|
|
Created:
Dec. 19, 2014, 1:16 p.m. by kzar Modified:
April 3, 2015, 9:48 a.m. CC:
Felix Dahlke, saroyanm Visibility:
Public. |
DescriptionIssue #395 Filter hits statistics backend
This code can be found in GitHub here https://github.com/kzar/adblockplus-sitescripts/tree/395-hitstats-backend . (The repository has one extra commit not included in this review that provides a Vagrant VM development environment. Clone and type `vagrant up` if you wish to test this, you can then access the webserver at http://localhost:5000 and the query interface at http://localhost:5000/static/query.html )
In case it helps I have also written some helpful utilities for generating dummy data and dumping the logged data into a flattened JSON format. These are not part of the sitescripts repository or this review but can be obtained separately here https://github.com/kzar/adblockplus-filterhits-helpers
The process_logs.py script included is a convenient way to re-load log files into the database. I included this as we are expecting the aggregation method to be adjusted in the future and this provides an easy way to then re-load the data. It can load either a single log file or all the log files in a directory, all as one transaction which is rolled back in case of error.
Patch Set 1 #
Total comments: 71
Patch Set 2 : Improvements regarding comments #
Total comments: 35
Patch Set 3 : Addressed comments. #Patch Set 4 : Added API tests, addressed comments and some other improvements. #
Total comments: 14
Patch Set 5 : Addressed more comments. #Patch Set 6 : Display friendly message if processing script can't connect to DB. #Patch Set 7 : Rebased. #Patch Set 8 : Simplified geometrical_mean code and reduced filter inserts. #
Total comments: 44
Patch Set 9 : Addressed some of Wladimir's comments #
Total comments: 2
Patch Set 10 : Addressed Sebastian's and Wladimir's comments. #
Total comments: 32
Patch Set 11 : Addressed further comments from Wladimir. #
Total comments: 8
Patch Set 12 : Addressed final nits. #Patch Set 13 : Rebased. #Patch Set 14 : Removed db.testing flag, instead overwrite config during testing. #
Total comments: 4
Patch Set 15 : Make test log directory path configurable and ensure it's always cleared. #
Total comments: 2
Patch Set 16 : Create temporary log directory with tempfile module for tests. #
Total comments: 2
Patch Set 17 : Make sure the temporary log directory is recreated for each test. #
Total comments: 20
Patch Set 18 : Addressed further feedback from Sebastian. #Patch Set 19 : Created base class for tests and reverted earlier content-type change. #
Total comments: 12
Patch Set 20 : Addressed further comments from Sebastian. #MessagesTotal messages: 59
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgi... File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgi... .hgignore:6: sitescripts/filterhits/test/temp While on it mind adding a .gitignore like we already did for some of our other repos? Should be done with a separate review though. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/bin/__init__.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/__init__.py:1: Nit: Unneeded newline in empty file http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:4: # Copyright (C) 2006-2014 Eyeo GmbH Nit: This line needs to be updated now. In other files as well. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:28: Provides a generator of filter hits log files for the given directory. Nit: Why do you indent the docstring one more space than the surrounding quotes? http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/common.py:41: return ( I wonder whether we should rather properly check for exceptions than having logic duplicated in a (currently incomplete) validation function. Same above. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/common.py:58: return int((dt - datetime(1970, 1, 1)).total_seconds()) Hardcoding the epoch looks like a hack to me. Did you know that you can actually convert a datetime to a timetuple and that to a UNIX timestamp: time.mktime(dt.timetuple()) http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/common.py:69: if not os.path.exists(path): I'd rather catch the OSError, checking for e.errno == errno.EEXIST, rather than checking whether the directory exists in advance, to eliminate race conditions and reduce syscalls. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:33: def disconnect(): It seems the global variable and diconnect() method is uneeded. You can simply call the connection's close() method in the calling code. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:48: global db Note that you only have to use the "global" statement when setting global variables not for reading them. But anyway, I think we should get rid of that global variable. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:68: if isinstance(sql, str): How about always expecting a sequence here, eliminating the need for a type switch? If the calling code usually writes only one item you might want to go with variable arguments, to eliminate boilerplate there. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:72: # Commit the changes You don't say? This comment doesn't add any information that the code below doesn't already tells you on first glance. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:76: [cursor.execute(s) for s in query.split(";") if s] I'd rather use a for loop, than list expressions, if you don't need the result. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:79: # On error roll them back Same as above. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:80: if db: How could a MySQLdb.Error be raised if db didn't exist? http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:84: if cursor: This will result in a NameError if the cursor doesn't exist yet. The code should rather look like that: try: cursor = db.cursor() try: ... finally: cursor.close() except MySQLdb.Error: ... Same above. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/geometrical_mean.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:27: filter = db.escape(filter) Please don't manually escape or format SQL. Pass the variable parameters to the cursor's execute() method instead, like: cursor.execute("SELECT * FROM t WHERE foo = %s", ("bar",)) http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:31: "(filter, md5) VALUES ('%s', UNHEX(MD5(filter)));" + MD5 is deprecate due to known collision issues. You should use SHA1 instead. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:34: "VALUES (UNHEX(MD5('%s')), '%s', %d, FROM_UNIXTIME(%d)) " + It seems that timezone information are ignored here. I highly recommend to convert all timestamps to UTC before storing/processing them, to avoid inconsistencies and bugs. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:62: return itertools.imap(lambda fields: apply(partial(update_sql, interval), fields), apply() is a Python 2.2 (?) relic before the *-syntax for variable arguments were introduced. In modern Python you can just write following code: itertools.imap(lambda fields: update_sql(interval, *fields), filter_hits(data)) http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/schema.sql (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/schema.sql:11: /*!40101 SET character_set_client = utf8 */; This is redundant with the above. Not that we have to bother about setting the client charset if we don't INSERT/UPDATE any text data. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/static/query.html (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/static/query.html:1: <!DOCTYPE html> This web interface is currently not specified in the issue description. So either update the issue description or move it to a separate issue/review. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/test/common_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8 I wonder whether we should also have a test running the server, and actually testing the /submit and /query HTTP APIs. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:22: Nit: The newline should be between the corelib and own module. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:23: import sitescripts.filterhits.common as common Nit: You are misusing the "import .. as" syntax here. That is what "from .. import" is for: from sitescripts.filterhits import common from sitescripts.filterhits import db http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats", "dbuser"), It seems this pattern is repeated. How about retrieving the user, password and database inside the connect() method, or at least if it's not provided otherwise (e.g. when testing)? http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/submit.py:33: if environ["REQUEST_METHOD"].upper() != "POST": According to the specs REQUEST_METHOD is always already uppercase: http://wsgi.readthedocs.org/en/latest/definitions.html http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/submit.py:40: except ValueError: If you just catch the KeyError as well you can use canonical key lookups, and don't need to specify the fallback twice: try: data_length = int(environ["CONTENT_LENGTH"]) except (ValueError, KeyError): data_length = 0 http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/submit.py:58: except OSError, IOError: You need parentheses here. Otherwise - as it currently is - the catched OSError will be stored in the variable IOError, while IOErrors are not handled.
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:33: if f.endswith(".log") and f[0].isdigit(): Nit: Please use os.path.splitext() http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:46: while s != "\" ": It took me a while to figure this code out. Maybe a proper state machine would be more readable: last = None while True: current = f.read(1) if not current: sys.exit("Unexpected EOF") if last == '\' and current == " ": break last = current Also note that EOF is currently ignored by your implementation resulting in an infinite loop.
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgi... File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgi... .hgignore:6: sitescripts/filterhits/test/temp On 2015/02/11 16:00:12, Sebastian Noack wrote: > While on it mind adding a .gitignore like we already did for some of our other > repos? Should be done with a separate review though. I've created an issue for this https://issues.adblockplus.org/ticket/1999#ticket http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/bin/__init__.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/__init__.py:1: On 2015/02/11 16:00:12, Sebastian Noack wrote: > Nit: Unneeded newline in empty file Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:4: # Copyright (C) 2006-2014 Eyeo GmbH On 2015/02/11 16:00:12, Sebastian Noack wrote: > Nit: This line needs to be updated now. In other files as well. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:28: Provides a generator of filter hits log files for the given directory. On 2015/02/11 16:00:12, Sebastian Noack wrote: > Nit: Why do you indent the docstring one more space than the surrounding quotes? Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:33: if f.endswith(".log") and f[0].isdigit(): On 2015/02/11 16:29:32, Sebastian Noack wrote: > Nit: Please use os.path.splitext() Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/bin/process_logs.py:46: while s != "\" ": On 2015/02/11 16:29:32, Sebastian Noack wrote: > It took me a while to figure this code out. Maybe a proper state machine would > be more readable: > > last = None > while True: > current = f.read(1) > if not current: > sys.exit("Unexpected EOF") > if last == '\' and current == " ": > break > last = current > > Also note that EOF is currently ignored by your implementation resulting in an > infinite loop. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/common.py:41: return ( On 2015/02/11 16:00:12, Sebastian Noack wrote: > I wonder whether we should rather properly check for exceptions than having > logic duplicated in a (currently incomplete) validation function. Same above. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/common.py:58: return int((dt - datetime(1970, 1, 1)).total_seconds()) On 2015/02/11 16:00:12, Sebastian Noack wrote: > Hardcoding the epoch looks like a hack to me. Did you know that you can actually > convert a datetime to a timetuple and that to a UNIX timestamp: > > time.mktime(dt.timetuple()) Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/common.py:69: if not os.path.exists(path): On 2015/02/11 16:00:12, Sebastian Noack wrote: > I'd rather catch the OSError, checking for e.errno == errno.EEXIST, rather than > checking whether the directory exists in advance, to eliminate race conditions > and reduce syscalls. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:33: def disconnect(): On 2015/02/11 16:00:12, Sebastian Noack wrote: > It seems the global variable and diconnect() method is uneeded. You can simply > call the connection's close() method in the calling code. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:48: global db On 2015/02/11 16:00:12, Sebastian Noack wrote: > Note that you only have to use the "global" statement when setting global > variables not for reading them. But anyway, I think we should get rid of that > global variable. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:72: # Commit the changes On 2015/02/11 16:00:12, Sebastian Noack wrote: > You don't say? This comment doesn't add any information that the code below > doesn't already tells you on first glance. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:76: [cursor.execute(s) for s in query.split(";") if s] On 2015/02/11 16:00:12, Sebastian Noack wrote: > I'd rather use a for loop, than list expressions, if you don't need the result. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:79: # On error roll them back On 2015/02/11 16:00:12, Sebastian Noack wrote: > Same as above. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:80: if db: On 2015/02/11 16:00:12, Sebastian Noack wrote: > How could a MySQLdb.Error be raised if db didn't exist? I'm not sure but I seem to remember that this check prevents problems so I would rather leave it in here. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:84: if cursor: On 2015/02/11 16:00:12, Sebastian Noack wrote: > This will result in a NameError if the cursor doesn't exist yet. The code should > rather look like that: > > try: > cursor = db.cursor() > try: > ... > finally: > cursor.close() > except MySQLdb.Error: > ... > > Same above. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/geometrical_mean.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:27: filter = db.escape(filter) On 2015/02/11 16:00:12, Sebastian Noack wrote: > Please don't manually escape or format SQL. Pass the variable parameters to the > cursor's execute() method instead, like: > > cursor.execute("SELECT * FROM t WHERE foo = %s", ("bar",)) Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:31: "(filter, md5) VALUES ('%s', UNHEX(MD5(filter)));" + On 2015/02/11 16:00:12, Sebastian Noack wrote: > MD5 is deprecate due to known collision issues. You should use SHA1 instead. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:34: "VALUES (UNHEX(MD5('%s')), '%s', %d, FROM_UNIXTIME(%d)) " + On 2015/02/11 16:00:12, Sebastian Noack wrote: > It seems that timezone information are ignored here. I highly recommend to > convert all timestamps to UTC before storing/processing them, to avoid > inconsistencies and bugs. As long as the client sends the timestamp properly as UTC we should be OK. I've added a note to the ticket and review for the client side code that we expect this. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/geometrical_mean.py:62: return itertools.imap(lambda fields: apply(partial(update_sql, interval), fields), On 2015/02/11 16:00:12, Sebastian Noack wrote: > apply() is a Python 2.2 (?) relic before the *-syntax for variable arguments > were introduced. In modern Python you can just write following code: > > itertools.imap(lambda fields: update_sql(interval, *fields), filter_hits(data)) Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/schema.sql (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/schema.sql:11: /*!40101 SET character_set_client = utf8 */; On 2015/02/11 16:00:12, Sebastian Noack wrote: > This is redundant with the above. Not that we have to bother about setting the > client charset if we don't INSERT/UPDATE any text data. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/static/query.html (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/static/query.html:1: <!DOCTYPE html> On 2015/02/11 16:00:12, Sebastian Noack wrote: > This web interface is currently not specified in the issue description. So > either update the issue description or move it to a separate issue/review. https://issues.adblockplus.org/ticket/395?action=diff&version=55 Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:22: On 2015/02/11 16:00:12, Sebastian Noack wrote: > Nit: The newline should be between the corelib and own module. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:23: import sitescripts.filterhits.common as common On 2015/02/11 16:00:12, Sebastian Noack wrote: > Nit: You are misusing the "import .. as" syntax here. That is what "from .. > import" is for: > > from sitescripts.filterhits import common > from sitescripts.filterhits import db Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/submit.py:33: if environ["REQUEST_METHOD"].upper() != "POST": On 2015/02/11 16:00:12, Sebastian Noack wrote: > According to the specs REQUEST_METHOD is always already uppercase: > http://wsgi.readthedocs.org/en/latest/definitions.html Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/submit.py:40: except ValueError: On 2015/02/11 16:00:12, Sebastian Noack wrote: > If you just catch the KeyError as well you can use canonical key lookups, and > don't need to specify the fallback twice: > > try: > data_length = int(environ["CONTENT_LENGTH"]) > except (ValueError, KeyError): > data_length = 0 Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/submit.py:58: except OSError, IOError: On 2015/02/11 16:00:12, Sebastian Noack wrote: > You need parentheses here. Otherwise - as it currently is - the catched OSError > will be stored in the variable IOError, while IOErrors are not handled. Done.
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/11 16:00:12, Sebastian Noack wrote: > How about always expecting a sequence here, eliminating the need for a type > switch? If the calling code usually writes only one item you might want to go > with variable arguments, to eliminate boilerplate there. You didn't mind to address or reply to this comment. I still think we should just always expect an iterable object here, getting rid of this silly type switch. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/schema.sql (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/schema.sql:11: /*!40101 SET character_set_client = utf8 */; On 2015/02/17 10:52:24, kzar wrote: > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > This is redundant with the above. Not that we have to bother about setting the > > client charset if we don't INSERT/UPDATE any text data. > > Done. I just realized a typo in my comment above. I actually meant that as long as we don't store any string data here, we don't have to bother about the client charset. However, setting it won't hurt either way. And we're safe in case things change in the future. But thanks for removing the code duplication. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats", "dbuser"), On 2015/02/11 16:00:12, Sebastian Noack wrote: > It seems this pattern is repeated. How about retrieving the user, password and > database inside the connect() method, or at least if it's not provided otherwise > (e.g. when testing)? What's about this comment? http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and current == " "): If you use the != instead the == operator in the first place, you don't need to negate the result: last != '"' or current != ' ' http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/common.py:30: now = time.gmtime() For reference, this will be UTC. I don't have a strong opinion whether UTC or local timezone is more appropriate for local filenames. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/db.py:50: This writes a given SQL string or iterator of tuples containing SQL Iterators (in Python) are objects that implement the iterator protocol, i.e. the next() method. When using for-loops, Python internally calls iter(obj) to create an iterator, however the object passed to this function isn't an iterator itself, it's merely an iterable object. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/db.py:61: sql, params = query[0], query[1:] I'd prefer to use a two-dimensional tuple here and everywhere else like (sql, params) rather than prepanding the sql to the params. This makes it easier to understand the data structure, is more efficient, and reduces complexity. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/geometrical_mean.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/geometrical_mean.py:59: return flatten(itertools.imap(lambda fields: update_query(interval, *fields), I feel that you are kinda overusing itertools. A generator seems to be more appropriate and easier to read here: for fields in filter_hits(data): for query in update_query(interval, *fields): yield query http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:26: def query(domain=None, filter=None, skip=0, take=20, order_by="hits DESC", **_): Any reason why you silently ignore additional keyword arguments? http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:37: where_fields = [(s, "%" + p + "%") for s, p in (("domain", domain), It's best practice to use format string when concatenating more than two string. Format strings are generally more readable and also faster. Yes, I know that you have to escape the percent-sign then. ;) http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:38: ("filter", filter)) if p] Nit: Seems that the indentation is a little off here. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:42: order_by, order = order_by.split() Strings aren't proper data structures. So how about turning the order_by argument into a tuple like ("hists", "DESC"). Or maybe even better, using two arguments, one for the sort column, and another one to indicate the sort direction. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:43: order = order.upper() if order.upper() in ["ASC", "DESC"] else "ASC" Nit: When a sequence doesn't need to be modified use tuples instead lists. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:65: if db_connection: This will result in a NameError, in case db_connection failed to connect. You need following pattern here: try: connection = db.connect(...) try: ... finally: connection.close() except MySQLdb.Error: ... http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/submit.py:45: data = json.loads(data) It seems that parsing an empty dictionary is unnecessary. Why don't you just use |data = {}| here? http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/submit.py:77: if db_connection: This will result in a NameError, in case db_connection failed to connect. You need following pattern here: try: connection = db.connect(...) try: ... finally: connection.close() except MySQLdb.Error: ...
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/test/common_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8 On 2015/02/11 16:00:12, Sebastian Noack wrote: > I wonder whether we should also have a test running the server, and actually > testing the /submit and /query HTTP APIs. We don't have to run a real server, calling the WSGI handler directly would do. Actually sounds like a good idea here. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/common.py:30: now = time.gmtime() On 2015/02/17 14:59:17, Sebastian Noack wrote: > For reference, this will be UTC. I don't have a strong opinion whether UTC or > local timezone is more appropriate for local filenames. Local timezone is always a bad idea - already because of daylight saving time, there is one hour each year where the same "timestamp" can refer to two different points in time. Not that it should matter much as the "local" time for all our servers is UTC, for that very reason.
http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/common.py:30: now = time.gmtime() On 2015/02/17 15:12:21, Wladimir Palant wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > For reference, this will be UTC. I don't have a strong opinion whether UTC or > > local timezone is more appropriate for local filenames. > > Local timezone is always a bad idea I generally agree. Though I wasn't entirely sure in this particular case, since from a sysadmin perspective it might be confusing to have files with names based on the time in a different time zone. Log files on UNIX machines usually use the local timezone. But I don't mind. > - already because of daylight saving time, > there is one hour each year where the same "timestamp" can refer to two > different points in time. Mind leap seconds? So as long as you don't run your server with GPS time, you have that issue anyway. ;P
http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/submit.py:45: data = json.loads(data) data = {} try: data_length = int(environ["CONTENT_LENGTH"]) except (ValueError, KeyError): pass else: if data_length > 0: try: data = json.loads(environ["wsgi.input"].read(data_length)) except ValueError: return common.showError("Error while parsing JSON data.", start_response) Or if we just fail in case the request is empty, it can get even simpler: try: data = json.loads(environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))) except (ValueError, KeyError): return common.showError("Error while parsing JSON data.", start_response)
http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/submit.py:45: data = json.loads(data) Possibly you could also just use json.load(): try: data = json.load(environ["wsgi.input"]) except (ValueError, IOError): return common.showError("Error while parsing JSON data.", start_response)
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/17 14:59:17, Sebastian Noack wrote: > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > How about always expecting a sequence here, eliminating the need for a type > > switch? If the calling code usually writes only one item you might want to go > > with variable arguments, to eliminate boilerplate there. > > You didn't mind to address or reply to this comment. I still think we should > just always expect an iterable object here, getting rid of this silly type > switch. It just makes the function nicer to use if you only want to write one piece of SQL to the database and don't need to pass parameters. `db.write("DELETE FROM BLAH")` instead of `db.write(("DELETE FROM BLAH"),)`. Especially in the testing code it makes things a little cleaner. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/test/common_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8 On 2015/02/17 15:12:21, Wladimir Palant wrote: > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > I wonder whether we should also have a test running the server, and actually > > testing the /submit and /query HTTP APIs. > > We don't have to run a real server, calling the WSGI handler directly would do. > Actually sounds like a good idea here. Yea, I agree a few tests for the API would be nice. I'll see what I can do. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats", "dbuser"), On 2015/02/17 14:59:17, Sebastian Noack wrote: > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > It seems this pattern is repeated. How about retrieving the user, password and > > database inside the connect() method, or at least if it's not provided > otherwise > > (e.g. when testing)? > > What's about this comment? I prefer to do it this way, during development I experimented with a few approaches but IMHO it's dangerous to set the database by default. All it takes is one test to be written that forgets to pass the database name and you've messed up your real database. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and current == " "): On 2015/02/17 14:59:17, Sebastian Noack wrote: > If you use the != instead the == operator in the first place, you don't need to > negate the result: > > last != '"' or current != ' ' I'm aware of Demorgan's law but I think the intention is much clearer when it's written this way around. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/db.py:50: This writes a given SQL string or iterator of tuples containing SQL On 2015/02/17 14:59:17, Sebastian Noack wrote: > Iterators (in Python) are objects that implement the iterator protocol, i.e. the > next() method. When using for-loops, Python internally calls iter(obj) to create > an iterator, however the object passed to this function isn't an iterator > itself, it's merely an iterable object. Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/db.py:61: sql, params = query[0], query[1:] On 2015/02/17 14:59:17, Sebastian Noack wrote: > I'd prefer to use a two-dimensional tuple here and everywhere else like (sql, > params) rather than prepanding the sql to the params. This makes it easier to > understand the data structure, is more efficient, and reduces complexity. I like doing it this way for db.query and db.write personally as it reduces the boilerplate when using the functions with zero or one parameters. I originally did it in a way similar to how you suggested but found lots of calls where I was passing empty or 1 item tuples - it seemed kind of ugly. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/geometrical_mean.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/geometrical_mean.py:59: return flatten(itertools.imap(lambda fields: update_query(interval, *fields), On 2015/02/17 14:59:17, Sebastian Noack wrote: > I feel that you are kinda overusing itertools. A generator seems to be more > appropriate and easier to read here: > > for fields in filter_hits(data): > for query in update_query(interval, *fields): > yield query Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:26: def query(domain=None, filter=None, skip=0, take=20, order_by="hits DESC", **_): On 2015/02/17 14:59:17, Sebastian Noack wrote: > Any reason why you silently ignore additional keyword arguments? I do that because we're taking the parameters straight from the query string. We want to ignore any passed parameters that we don't care about here. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:37: where_fields = [(s, "%" + p + "%") for s, p in (("domain", domain), On 2015/02/17 14:59:17, Sebastian Noack wrote: > It's best practice to use format string when concatenating more than two string. > Format strings are generally more readable and also faster. Yes, I know that you > have to escape the percent-sign then. ;) I disagree that `"%%%s%%" % p` is easier to read than `"%" + p + "%"`. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:38: ("filter", filter)) if p] On 2015/02/17 14:59:17, Sebastian Noack wrote: > Nit: Seems that the indentation is a little off here. Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:42: order_by, order = order_by.split() On 2015/02/17 14:59:17, Sebastian Noack wrote: > Strings aren't proper data structures. So how about turning the order_by > argument into a tuple like ("hists", "DESC"). Or maybe even better, using two > arguments, one for the sort column, and another one to indicate the sort > direction. Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:43: order = order.upper() if order.upper() in ["ASC", "DESC"] else "ASC" On 2015/02/17 14:59:17, Sebastian Noack wrote: > Nit: When a sequence doesn't need to be modified use tuples instead lists. Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:65: if db_connection: On 2015/02/17 14:59:17, Sebastian Noack wrote: > This will result in a NameError, in case db_connection failed to connect. You > need following pattern here: > > try: > connection = db.connect(...) > try: > ... > finally: > connection.close() > except MySQLdb.Error: > ... Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/submit.py:45: data = json.loads(data) On 2015/02/17 17:16:57, Sebastian Noack wrote: > Possibly you could also just use json.load(): > > try: > data = json.load(environ["wsgi.input"]) > except (ValueError, IOError): > return common.showError("Error while parsing JSON data.", start_response) Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/submit.py:77: if db_connection: On 2015/02/17 14:59:17, Sebastian Noack wrote: > This will result in a NameError, in case db_connection failed to connect. You > need following pattern here: > > try: > connection = db.connect(...) > try: > ... > finally: > connection.close() > except MySQLdb.Error: > ... Done.
http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > > How about always expecting a sequence here, eliminating the need for a type > > > switch? If the calling code usually writes only one item you might want to > go > > > with variable arguments, to eliminate boilerplate there. > > > > You didn't mind to address or reply to this comment. I still think we should > > just always expect an iterable object here, getting rid of this silly type > > switch. > > It just makes the function nicer to use if you only want to write one piece of > SQL to the database and don't need to pass parameters. > > `db.write("DELETE FROM BLAH")` instead of `db.write(("DELETE FROM BLAH"),)`. > Especially in the testing code it makes things a little cleaner. There is only one call in your whole code where you pass a single string to the write() method. This certainly doesn't justify to keep that logic. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:80: if db: On 2015/02/17 10:52:24, kzar wrote: > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > How could a MySQLdb.Error be raised if db didn't exist? > > I'm not sure but I seem to remember that this check prevents problems so I would > rather leave it in here. If a random change you can't explain, fixes an issue you don't understand, this isn't necessarily the correct or best way doing it. So figure out what's the issue here, and if this is indeed the correct way of handling it add a comment explaining why this non-obvious check is necessary. Does the connection class maybe implement __nonzero__, returning False when the connection was closed or rolled back after an error? http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats", "dbuser"), On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > > It seems this pattern is repeated. How about retrieving the user, password > and > > > database inside the connect() method, or at least if it's not provided > > otherwise > > > (e.g. when testing)? > > > > What's about this comment? > > I prefer to do it this way, during development I experimented with a few > approaches but IMHO it's dangerous to set the database by default. All it takes > is one test to be written that forgets to pass the database name and you've > messed up your real database. That doesn't make any sense. The database and credentials come from the configuration anyway. I'm merely criticizing that you duplicate the code retrieving those settings. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and current == " "): On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > If you use the != instead the == operator in the first place, you don't need > to > > negate the result: > > > > last != '"' or current != ' ' > > I'm aware of Demorgan's law but I think the intention is much clearer when it's > written this way around. I'd rather say, as more logical operations involved, as harder it's to understand the logic. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/db.py:61: sql, params = query[0], query[1:] On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > I'd prefer to use a two-dimensional tuple here and everywhere else like (sql, > > params) rather than prepanding the sql to the params. This makes it easier to > > understand the data structure, is more efficient, and reduces complexity. > > I like doing it this way for db.query and db.write personally as it reduces the > boilerplate when using the functions with zero or one parameters. I originally > did it in a way similar to how you suggested but found lots of calls where I was > passing empty or 1 item tuples - it seemed kind of ugly. I don't agree. So I am leaving it up to Wladimir. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:26: def query(domain=None, filter=None, skip=0, take=20, order_by="hits DESC", **_): On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > Any reason why you silently ignore additional keyword arguments? > > I do that because we're taking the parameters straight from the query string. We > want to ignore any passed parameters that we don't care about here. I see. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/web/query.py:37: where_fields = [(s, "%" + p + "%") for s, p in (("domain", domain), On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > It's best practice to use format string when concatenating more than two > string. > > Format strings are generally more readable and also faster. Yes, I know that > you > > have to escape the percent-sign then. ;) > > I disagree that `"%%%s%%" % p` is easier to read than `"%" + p + "%"`. Fair enough.
Added API tests, addressed comments and some other improvements. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/26 16:39:25, Sebastian Noack wrote: > On 2015/02/24 18:05:11, kzar wrote: > > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > > > How about always expecting a sequence here, eliminating the need for a > type > > > > switch? If the calling code usually writes only one item you might want to > > go > > > > with variable arguments, to eliminate boilerplate there. > > > > > > You didn't mind to address or reply to this comment. I still think we should > > > just always expect an iterable object here, getting rid of this silly type > > > switch. > > > > It just makes the function nicer to use if you only want to write one piece of > > SQL to the database and don't need to pass parameters. > > > > `db.write("DELETE FROM BLAH")` instead of `db.write(("DELETE FROM BLAH"),)`. > > Especially in the testing code it makes things a little cleaner. > > There is only one call in your whole code where you pass a single string to the > write() method. This certainly doesn't justify to keep that logic. Fair enough, I didn't realise that. Done. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/db.py:80: if db: On 2015/02/26 16:39:25, Sebastian Noack wrote: > On 2015/02/17 10:52:24, kzar wrote: > > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > > How could a MySQLdb.Error be raised if db didn't exist? > > > > I'm not sure but I seem to remember that this check prevents problems so I > would > > rather leave it in here. > > If a random change you can't explain, fixes an issue you don't understand, this > isn't necessarily the correct or best way doing it. So figure out what's the > issue here, and if this is indeed the correct way of handling it add a comment > explaining why this non-obvious check is necessary. > > Does the connection class maybe implement __nonzero__, returning False when the > connection was closed or rolled back after an error? Well I agree with you of course but I guess the trouble is when there is a gap of months between writing code and getting it reviewed it's hard to remember the finer details. I've removed it now and I can't see any problems. The tests still pass, I can still submit data and I can still query it out again. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/test/common_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8 On 2015/02/24 18:05:11, kzar wrote: > On 2015/02/17 15:12:21, Wladimir Palant wrote: > > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > > I wonder whether we should also have a test running the server, and actually > > > testing the /submit and /query HTTP APIs. > > > > We don't have to run a real server, calling the WSGI handler directly would > do. > > Actually sounds like a good idea here. > > Yea, I agree a few tests for the API would be nice. I'll see what I can do. OK, I've written some now. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site... sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats", "dbuser"), On 2015/02/26 16:39:25, Sebastian Noack wrote: > On 2015/02/24 18:05:11, kzar wrote: > > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > > On 2015/02/11 16:00:12, Sebastian Noack wrote: > > > > It seems this pattern is repeated. How about retrieving the user, password > > and > > > > database inside the connect() method, or at least if it's not provided > > > otherwise > > > > (e.g. when testing)? > > > > > > What's about this comment? > > > > I prefer to do it this way, during development I experimented with a few > > approaches but IMHO it's dangerous to set the database by default. All it > takes > > is one test to be written that forgets to pass the database name and you've > > messed up your real database. > > That doesn't make any sense. The database and credentials come from the > configuration anyway. I'm merely criticizing that you duplicate the code > retrieving those settings. OK, I changed my mind here and I've now changed it. Done. http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and current == " "): On 2015/02/26 16:39:25, Sebastian Noack wrote: > On 2015/02/24 18:05:11, kzar wrote: > > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > > If you use the != instead the == operator in the first place, you don't need > > to > > > negate the result: > > > > > > last != '"' or current != ' ' > > > > I'm aware of Demorgan's law but I think the intention is much clearer when > it's > > written this way around. > > I'd rather say, as more logical operations involved, as harder it's to > understand the logic. Well what we're saying is something like "While the last two characters aren't '" ' then do this...". I think `not (last == '"' and current == " ")` is closer to that than `last != '"' or current != ' '` but I guess it's subjective. How about this? It's even closer to the English version and reduces the logical operations.
http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site... sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and current == " "): On 2015/02/28 19:39:56, kzar wrote: > On 2015/02/26 16:39:25, Sebastian Noack wrote: > > On 2015/02/24 18:05:11, kzar wrote: > > > On 2015/02/17 14:59:17, Sebastian Noack wrote: > > > > If you use the != instead the == operator in the first place, you don't > need > > > to > > > > negate the result: > > > > > > > > last != '"' or current != ' ' > > > > > > I'm aware of Demorgan's law but I think the intention is much clearer when > > it's > > > written this way around. > > > > I'd rather say, as more logical operations involved, as harder it's to > > understand the logic. > > Well what we're saying is something like "While the last two characters aren't > '" ' then do this...". I think `not (last == '"' and current == " ")` is closer > to that than `last != '"' or current != ' '` but I guess it's subjective. > > How about this? It's even closer to the English version and reduces the logical > operations. Not sure whether I like it, but fair enough. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:20: import sitescripts.filterhits.common as common Nit: from ... import ... http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:81: if isinstance(e, KeyError): Where does the KeyError come from? Certainly not form the code reading the configuration since, you use dict.get there. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:89: if db_connection: Again, if db.connect() fails, this results into a NameError. You need following logic: db_connection = db.connect() try: try: ... except: ... finally: db_connection.close() http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/common.py:21: def showError(message, start_response, status="400 Processing Error"): Nit: PEP-8 please, i.e. showError -> show_error http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/web/submit.py:23: import sitescripts.filterhits.common as common Nit: from ... import ...
Patch Set 5 : Addressed more comments. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:20: import sitescripts.filterhits.common as common On 2015/03/02 10:04:01, Sebastian Noack wrote: > Nit: from ... import ... Done. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:81: if isinstance(e, KeyError): On 2015/03/02 10:04:01, Sebastian Noack wrote: > Where does the KeyError come from? Certainly not form the code reading the > configuration since, you use dict.get there. KeyError comes from processing invalid data, this is required because we no longer validate the data. (Well we check they submitted a JSON object at least but that's it!) http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 10:04:01, Sebastian Noack wrote: > Again, if db.connect() fails, this results into a NameError. You need following > logic: > > db_connection = db.connect() > try: > try: > ... > except: > ... > finally: > db_connection.close() > Done. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/common.py:21: def showError(message, start_response, status="400 Processing Error"): On 2015/03/02 10:04:01, Sebastian Noack wrote: > Nit: PEP-8 please, i.e. showError -> show_error (I was following the example from sitescripts.reports, sitescripts.crashes and sitescripts.urlfixer.) Done. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/web/submit.py:23: import sitescripts.filterhits.common as common On 2015/03/02 10:04:01, Sebastian Noack wrote: > Nit: from ... import ... Done.
http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 10:39:03, kzar wrote: > On 2015/03/02 10:04:01, Sebastian Noack wrote: > > Again, if db.connect() fails, this results into a NameError. You need > following > > logic: > > > > db_connection = db.connect() > > try: > > try: > > ... > > except: > > ... > > finally: > > db_connection.close() > > > > Done. Sorry, I meant: try: db_connection = db.connect() try: ... finally: db_connection.close() except: ... Otherwise, you ignore MySQL errors that occur while connecting.
On 2015/03/02 10:56:35, Sebastian Noack wrote: > http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... > File sitescripts/filterhits/bin/process_logs.py (right): > > http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... > sitescripts/filterhits/bin/process_logs.py:89: if db_connection: > On 2015/03/02 10:39:03, kzar wrote: > > On 2015/03/02 10:04:01, Sebastian Noack wrote: > > > Again, if db.connect() fails, this results into a NameError. You need > > following > > > logic: > > > > > > db_connection = db.connect() > > > try: > > > try: > > > ... > > > except: > > > ... > > > finally: > > > db_connection.close() > > > > > > > Done. > > Sorry, I meant: > > try: > db_connection = db.connect() > try: > ... > finally: > db_connection.close() > except: > ... > > Otherwise, you ignore MySQL errors that occur while connecting. Yea I realised that but I figure that's OK. This script is called from the command line to process logs, if the database connection fails I think the script should return an exception about that.
Sorry, I hit the wrong reply button there. Ignore the last email. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 10:56:36, Sebastian Noack wrote: > On 2015/03/02 10:39:03, kzar wrote: > > On 2015/03/02 10:04:01, Sebastian Noack wrote: > > > Again, if db.connect() fails, this results into a NameError. You need > > following > > > logic: > > > > > > db_connection = db.connect() > > > try: > > > try: > > > ... > > > except: > > > ... > > > finally: > > > db_connection.close() > > > > > > > Done. > > Sorry, I meant: > > try: > db_connection = db.connect() > try: > ... > finally: > db_connection.close() > except: > ... > > Otherwise, you ignore MySQL errors that occur while connecting. Yea I realised that but I figure it's OK. This script is called from the command line to process logs, if the database connection fails I think the script probably should return an exception.
http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 11:00:52, kzar wrote: > On 2015/03/02 10:56:36, Sebastian Noack wrote: > > On 2015/03/02 10:39:03, kzar wrote: > > > On 2015/03/02 10:04:01, Sebastian Noack wrote: > > > > Again, if db.connect() fails, this results into a NameError. You need > > > following > > > > logic: > > > > > > > > db_connection = db.connect() > > > > try: > > > > try: > > > > ... > > > > except: > > > > ... > > > > finally: > > > > db_connection.close() > > > > > > > > > > Done. > > > > Sorry, I meant: > > > > try: > > db_connection = db.connect() > > try: > > ... > > finally: > > db_connection.close() > > except: > > ... > > > > Otherwise, you ignore MySQL errors that occur while connecting. > > Yea I realised that but I figure it's OK. This script is called from > the command line to process logs, if the database connection fails I > think the script probably should return an exception. The exception would be printed as well if handled by the code above. With your line of arguing there would be no reason to handle any exceptions here. Please make it consistent.
Patch Set 6 : Display friendly message if processing script can't connect to DB. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site... sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 11:06:05, Sebastian Noack wrote: > On 2015/03/02 11:00:52, kzar wrote: > > On 2015/03/02 10:56:36, Sebastian Noack wrote: > > > On 2015/03/02 10:39:03, kzar wrote: > > > > On 2015/03/02 10:04:01, Sebastian Noack wrote: > > > > > Again, if db.connect() fails, this results into a NameError. You need > > > > following > > > > > logic: > > > > > > > > > > db_connection = db.connect() > > > > > try: > > > > > try: > > > > > ... > > > > > except: > > > > > ... > > > > > finally: > > > > > db_connection.close() > > > > > > > > > > > > > Done. > > > > > > Sorry, I meant: > > > > > > try: > > > db_connection = db.connect() > > > try: > > > ... > > > finally: > > > db_connection.close() > > > except: > > > ... > > > > > > Otherwise, you ignore MySQL errors that occur while connecting. > > > > Yea I realised that but I figure it's OK. This script is called from > > the command line to process logs, if the database connection fails I > > think the script probably should return an exception. > > The exception would be printed as well if handled by the code above. With your > line of arguing there would be no reason to handle any exceptions here. Please > make it consistent. Well my line of reasoning was that if the database is down and we can't connect any message that makes it clear that the database is down is sufficient - including the exception itself. With the processing exceptions however this is not the case because the user needs to know which log file caused the problem - not just that a MySQL exception / KeyError was thrown. Either way it's not a big deal, I now catch the MySQL exception on connecting and display a friendlier message instead.
LGTM. But I suppose Wladimir wants to have a closer look himself before I merge this.
Patch Set 7 : Rebased.
Patch Set 8 : Simplified geometrical_mean code and reduced filter inserts.
This is only a partial review, I didn't get to reviewing the web handlers yet. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sit... File .sitescripts.example (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sit... .sitescripts.example:178: dbpasswrd=password This should be dbpassword. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json, os, sys Nit: PEP 8 requires one import per line and a blank line between groups of imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the other files of course. Yes, I know that most existing scripts aren't following these requirements yet. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:20: from sitescripts.filterhits import common, db, geometrical_mean common isn't actually used here. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:22: last_log_file = None Nit: IMHO this isn't meant to be exported, call it _last_log_file? http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:31: if os.path.splitext(f)[1] == ".log" and f[0].isdigit(): Why the requirement that the first letter is a digit? You don't have that requirement when validating command line parameters. IMHO it serves no purpose, other than breaking this code should you change the file name format elsewhere. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log file %s" % log_file) Wouldn't that be a lot simpler if the date and the GET parameters were stored in the first line of the file and the JSON data in the second? You would merely need to read a line then... http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to the MySQL database.") I have the impression that the original exception will be more informative here, why catch it? http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/common.py:25: def log_filterhits(data, basepath, query_string): This isn't a common function, it is only being used by the submit handler and should live there. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/common.py:47: query_string, json.dumps(data))) Nit: you can use |print >>f, ...| - then you won't need to specify the trailing newline explicitly. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), When would that flag ever be used? This seems to imply that we would test on the production server - that sounds like a bad idea, with flag or without. We have VMs for that, using production configuration. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False): Why specify **kwargs if only one keyword parameter is ever used? Shouldn't this function be declared as |def query(db, sql, dict_result=False, *params)|? http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/geometrical_mean.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/geometrical_mean.py:33: yield ("""INSERT INTO `geometrical_mean` geometrical_mean is merely the aggregation approach we are using right now. A table name should describe the meaning of its contents, not the implementation. Call that table "frequency"?
Patch Set 9 : Addressed some of Wladimir's comments http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sit... File .sitescripts.example (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sit... .sitescripts.example:178: dbpasswrd=password On 2015/03/26 22:56:50, Wladimir Palant wrote: > This should be dbpassword. Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json, os, sys On 2015/03/26 22:56:50, Wladimir Palant wrote: > Nit: PEP 8 requires one import per line and a blank line between groups of > imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the other > files of course. Yes, I know that most existing scripts aren't following these > requirements yet. Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:20: from sitescripts.filterhits import common, db, geometrical_mean On 2015/03/26 22:56:50, Wladimir Palant wrote: > common isn't actually used here. Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:22: last_log_file = None On 2015/03/26 22:56:50, Wladimir Palant wrote: > Nit: IMHO this isn't meant to be exported, call it _last_log_file? Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:31: if os.path.splitext(f)[1] == ".log" and f[0].isdigit(): On 2015/03/26 22:56:50, Wladimir Palant wrote: > Why the requirement that the first letter is a digit? You don't have that > requirement when validating command line parameters. IMHO it serves no purpose, > other than breaking this code should you change the file name format elsewhere. That is a simple way to skip over other log files in the directory, namely processing-errors.log. Not strictly necessary but it makes the tool easier to use in practice. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log file %s" % log_file) On 2015/03/26 22:56:50, Wladimir Palant wrote: > Wouldn't that be a lot simpler if the date and the GET parameters were stored in > the first line of the file and the JSON data in the second? You would merely > need to read a line then... Sure, I suppose this logic would be simpler that way. Originally it was done this way as each log file contained many entries, one per line and I didn't re-consider the format since we changed to having one entry per file. Thing is this logic works well and has been well tested now I would prefer to leave it alone. Otherwise I'll need to update the format for test data that we've generated, the unit tests, the supporting tools for Kirill and test it all still works. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to the MySQL database.") On 2015/03/26 22:56:50, Wladimir Palant wrote: > I have the impression that the original exception will be more informative here, > why catch it? This was Sebastian's idea, not mine. (See the inline discussion around here in Patch set 5.) http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/common.py:25: def log_filterhits(data, basepath, query_string): On 2015/03/26 22:56:50, Wladimir Palant wrote: > This isn't a common function, it is only being used by the submit handler and > should live there. Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/common.py:47: query_string, json.dumps(data))) On 2015/03/26 22:56:50, Wladimir Palant wrote: > Nit: you can use |print >>f, ...| - then you won't need to specify the trailing > newline explicitly. Well apparently that's "unpythonic" I don't mind either way, but are you sure? http://stackoverflow.com/questions/8598228/f-write-vs-print-f http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/26 22:56:50, Wladimir Palant wrote: > When would that flag ever be used? This seems to imply that we would test on the > production server - that sounds like a bad idea, with flag or without. We have > VMs for that, using production configuration. Well originally I was passing the parameters through to db.connect each time so that I could pass one database name for tests and another elsewhere. It lead to a lot of duplication though so Sebastian suggested I changed that. (Rightly so I think.) This flag is a little ugly, it's true, but it is very useful for the unit tests. It allows us to have api tests that don't trash the real data. (Logged and in the database.) It also allows us to test the DB code and geometrical mean code which performs calculations in the DB without trashing real data. I personally think it's a "foot-gun" waiting to happen if we have the unit tests completely trash all the logged and aggregated data and this is the best way around that I could see. (It's worth noting as well that I _have_ been testing and developing all this on a VM but even then having the database and logged data trashed when running unit tests would be a real pain. It would make manually testing everything much harder.) http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False): On 2015/03/26 22:56:50, Wladimir Palant wrote: > Why specify **kwargs if only one keyword parameter is ever used? Shouldn't this > function be declared as |def query(db, sql, dict_result=False, *params)|? I tried this but unfortunately it doesn't work. You end up with "TypeError: query() got multiple values for keyword argument 'dict_result'". http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/geometrical_mean.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/geometrical_mean.py:33: yield ("""INSERT INTO `geometrical_mean` On 2015/03/26 22:56:50, Wladimir Palant wrote: > geometrical_mean is merely the aggregation approach we are using right now. A > table name should describe the meaning of its contents, not the implementation. > Call that table "frequency"? (I chose the plural to keep consistent with the other table name "filters".) Done.
http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json, os, sys On 2015/03/27 11:59:57, kzar wrote: > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > Nit: PEP 8 requires one import per line and a blank line between groups of > > imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the other > > files of course. Yes, I know that most existing scripts aren't following these > > requirements yet. > > Done. While we are on it, please also add a new line between corelib and own module imports. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log file %s" % log_file) On 2015/03/27 11:59:57, kzar wrote: > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > Wouldn't that be a lot simpler if the date and the GET parameters were stored > in > > the first line of the file and the JSON data in the second? You would merely > > need to read a line then... > > Sure, I suppose this logic would be simpler that way. Originally it was done > this way as each log file contained many entries, one per line and I didn't > re-consider the format since we changed to having one entry per file. > > Thing is this logic works well and has been well tested now I would prefer to > leave it alone. Otherwise I'll need to update the format for test data that > we've generated, the unit tests, the supporting tools for Kirill and test it all > still works. I agree with Wladimir FWIW. I also wouldn't consider any code tested, if it's neither used by real users nor passed QA. But you make a point, if we are going to change it, we should do it now. ;) http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to the MySQL database.") On 2015/03/27 11:59:57, kzar wrote: > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > I have the impression that the original exception will be more informative > here, > > why catch it? > > This was Sebastian's idea, not mine. (See the inline discussion around here in > Patch set 5.) There are no comments on patchset 5. You probably mean this discussion: http://codereview.adblockplus.org/4615801646612480/patch/5768755258851328/619... I didn't suggest to print this string instead the original exception here. That was your decision. I merely pointed out that you have to open the try/finally block closing the connection after you actually create the connection. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/common.py:47: query_string, json.dumps(data))) On 2015/03/27 11:59:57, kzar wrote: > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > Nit: you can use |print >>f, ...| - then you won't need to specify the > trailing > > newline explicitly. > > Well apparently that's "unpythonic" I don't mind either way, but are you sure? > http://stackoverflow.com/questions/8598228/f-write-vs-print-f I don't agree. I agree though that >> syntax is kinda weird. That is why the print statement was replaced with a function in Python 3. However, the way Wladimir suggested is the way to do it in Python 2. Note that |file.write("foo")| isn't completely the same as |print >>file, "foo"|. Amongs others the print statement (or function in Python 3), considers the terminal encoding when processing unicode objects and automatically picks the correct newline sequence depending of the OS. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 11:59:57, kzar wrote: > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > When would that flag ever be used? This seems to imply that we would test on > the > > production server - that sounds like a bad idea, with flag or without. We have > > VMs for that, using production configuration. > > Well originally I was passing the parameters through to db.connect each time so > that I could pass one database name for tests and another elsewhere. It lead to > a lot of duplication though so Sebastian suggested I changed that. (Rightly so I > think.) > > This flag is a little ugly, it's true, but it is very useful for the unit tests. > It allows us to have api tests that don't trash the real data. (Logged and in > the database.) It also allows us to test the DB code and geometrical mean code > which performs calculations in the DB without trashing real data. > > I personally think it's a "foot-gun" waiting to happen if we have the unit tests > completely trash all the logged and aggregated data and this is the best way > around that I could see. > > (It's worth noting as well that I _have_ been testing and developing all this on > a VM but even then having the database and logged data trashed when running unit > tests would be a real pain. It would make manually testing everything much > harder.) I agree with Wladimir. The footgun here would be testing on the production system, which you shouldn't do anyway. If you are bothered about destroying your test data, you should just backup them. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False): On 2015/03/27 11:59:57, kzar wrote: > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > Why specify **kwargs if only one keyword parameter is ever used? Shouldn't > this > > function be declared as |def query(db, sql, dict_result=False, *params)|? > > I tried this but unfortunately it doesn't work. You end up with "TypeError: > query() got multiple values for keyword argument 'dict_result'". Yep, that is because the first param given will be interpreted as dict_result argument. There will be keyword-only arguments in Python 3. But with Python 2 you have to do it the way you did. http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site... sitescripts/filterhits/web/submit.py:51: f.write("[%s] \"%s\" %s\n" % (time.strftime('%d/%b/%Y:%H:%M:%S', now), Same here, I'd rather go with the print statement.
http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 13:12:19, Sebastian Noack wrote: > On 2015/03/27 11:59:57, kzar wrote: > > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > > When would that flag ever be used? This seems to imply that we would test on > > the > > > production server - that sounds like a bad idea, with flag or without. We > have > > > VMs for that, using production configuration. > > > > Well originally I was passing the parameters through to db.connect each time > so > > that I could pass one database name for tests and another elsewhere. It lead > to > > a lot of duplication though so Sebastian suggested I changed that. (Rightly so > I > > think.) > > > > This flag is a little ugly, it's true, but it is very useful for the unit > tests. > > It allows us to have api tests that don't trash the real data. (Logged and in > > the database.) It also allows us to test the DB code and geometrical mean code > > which performs calculations in the DB without trashing real data. > > > > I personally think it's a "foot-gun" waiting to happen if we have the unit > tests > > completely trash all the logged and aggregated data and this is the best way > > around that I could see. > > > > (It's worth noting as well that I _have_ been testing and developing all this > on > > a VM but even then having the database and logged data trashed when running > unit > > tests would be a real pain. It would make manually testing everything much > > harder.) > > I agree with Wladimir. The footgun here would be testing on the production > system, which you shouldn't do anyway. If you are bothered about destroying your > test data, you should just backup them. Regarding the unit tests, I think it would be a better approach to configure the database settings used to run the tests separately. So the unit tests will just fail to run in production where no test database is configured. This is quite a common approach for testing web apps.
Patch Set 10 : Addressed Sebastian's and Wladimir's comments. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json, os, sys On 2015/03/27 13:12:19, Sebastian Noack wrote: > On 2015/03/27 11:59:57, kzar wrote: > > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > > Nit: PEP 8 requires one import per line and a blank line between groups of > > > imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the > other > > > files of course. Yes, I know that most existing scripts aren't following > these > > > requirements yet. > > > > Done. > > While we are on it, please also add a new line between corelib and own module > imports. Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log file %s" % log_file) On 2015/03/27 13:12:19, Sebastian Noack wrote: > On 2015/03/27 11:59:57, kzar wrote: > > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > > Wouldn't that be a lot simpler if the date and the GET parameters were > stored > > in > > > the first line of the file and the JSON data in the second? You would merely > > > need to read a line then... > > > > Sure, I suppose this logic would be simpler that way. Originally it was done > > this way as each log file contained many entries, one per line and I didn't > > re-consider the format since we changed to having one entry per file. > > > > Thing is this logic works well and has been well tested now I would prefer to > > leave it alone. Otherwise I'll need to update the format for test data that > > we've generated, the unit tests, the supporting tools for Kirill and test it > all > > still works. > > I agree with Wladimir FWIW. I also wouldn't consider any code tested, if it's > neither used by real users nor passed QA. But you make a point, if we are going > to change it, we should do it now. ;) OK fair enough, Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to the MySQL database.") On 2015/03/27 13:12:19, Sebastian Noack wrote: > On 2015/03/27 11:59:57, kzar wrote: > > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > > I have the impression that the original exception will be more informative > > here, > > > why catch it? > > > > This was Sebastian's idea, not mine. (See the inline discussion around here in > > Patch set 5.) > > There are no comments on patchset 5. You probably mean this discussion: > http://codereview.adblockplus.org/4615801646612480/patch/5768755258851328/619... > > I didn't suggest to print this string instead the original exception here. That > was your decision. I merely pointed out that you have to open the try/finally > block closing the connection after you actually create the connection. OK the discussion was in patch set 4... it doesn't matter. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/common.py:47: query_string, json.dumps(data))) On 2015/03/27 13:12:19, Sebastian Noack wrote: > On 2015/03/27 11:59:57, kzar wrote: > > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > > Nit: you can use |print >>f, ...| - then you won't need to specify the > > trailing > > > newline explicitly. > > > > Well apparently that's "unpythonic" I don't mind either way, but are you sure? > > http://stackoverflow.com/questions/8598228/f-write-vs-print-f > > I don't agree. I agree though that >> syntax is kinda weird. That is why the > print statement was replaced with a function in Python 3. However, the way > Wladimir suggested is the way to do it in Python 2. > > Note that |file.write("foo")| isn't completely the same as |print >>file, > "foo"|. Amongs others the print statement (or function in Python 3), considers > the terminal encoding when processing unicode objects and automatically picks > the correct newline sequence depending of the OS. Done. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 14:26:06, Sebastian Noack wrote: > On 2015/03/27 13:12:19, Sebastian Noack wrote: > > On 2015/03/27 11:59:57, kzar wrote: > > > On 2015/03/26 22:56:50, Wladimir Palant wrote: > > > > When would that flag ever be used? This seems to imply that we would test > on > > > the > > > > production server - that sounds like a bad idea, with flag or without. We > > have > > > > VMs for that, using production configuration. > > > > > > Well originally I was passing the parameters through to db.connect each time > > so > > > that I could pass one database name for tests and another elsewhere. It lead > > to > > > a lot of duplication though so Sebastian suggested I changed that. (Rightly > so > > I > > > think.) > > > > > > This flag is a little ugly, it's true, but it is very useful for the unit > > tests. > > > It allows us to have api tests that don't trash the real data. (Logged and > in > > > the database.) It also allows us to test the DB code and geometrical mean > code > > > which performs calculations in the DB without trashing real data. > > > > > > I personally think it's a "foot-gun" waiting to happen if we have the unit > > tests > > > completely trash all the logged and aggregated data and this is the best way > > > around that I could see. > > > > > > (It's worth noting as well that I _have_ been testing and developing all > this > > on > > > a VM but even then having the database and logged data trashed when running > > unit > > > tests would be a real pain. It would make manually testing everything much > > > harder.) > > > > I agree with Wladimir. The footgun here would be testing on the production > > system, which you shouldn't do anyway. If you are bothered about destroying > your > > test data, you should just backup them. > > Regarding the unit tests, I think it would be a better approach to configure the > database settings used to run the tests separately. So the unit tests will just > fail to run in production where no test database is configured. This is quite a > common approach for testing web apps. I guess I just disagree with you both on this one! If I do that I will not be able to run my unit tests whilst developing without having my database + logs trashed. (I like to test by hand too!) Also it kind of sucks either not being able to run the tests on the production server or having them trash all the real data! http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site... sitescripts/filterhits/web/submit.py:51: f.write("[%s] \"%s\" %s\n" % (time.strftime('%d/%b/%Y:%H:%M:%S', now), On 2015/03/27 13:12:19, Sebastian Noack wrote: > Same here, I'd rather go with the print statement. Done.
Done with the review. Note that I only skimmed the tests, might take a closer look at those later. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:31: if os.path.splitext(f)[1] == ".log" and f[0].isdigit(): On 2015/03/27 11:59:57, kzar wrote: > That is a simple way to skip over other log files in the directory, namely > processing-errors.log. Not strictly necessary but it makes the tool easier to > use in practice. Aren't all logs we are interested in inside of subdirectories? |if root != dir and os.path.splitext(f)[1] == ".log"|? http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log file %s" % log_file) On 2015/03/27 11:59:57, kzar wrote: > Otherwise I'll need to update the format for test data that > we've generated, the unit tests, the supporting tools for Kirill and test it all > still works. Yes, that's why we have reviews - so that we don't change this stuff *after* we create lots of tools depending on it ;) This should be changed now, it won't get any easier in a year. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 14:26:06, Sebastian Noack wrote: > Regarding the unit tests, I think it would be a better approach to configure the > database settings used to run the tests separately. So the unit tests will just > fail to run in production where no test database is configured. This is quite a > common approach for testing web apps. A simpler way to deal with unit tests would having them change the configuration temporarily and prepend the database name by "test_". http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/.hgi... File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/.hgi... .hgignore:6: sitescripts/filterhits/test/temp This seems to be an artifact of your personal test environment, not something that anybody else would need. You can add it to you personal hgignore file, not point having it here for everybody. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/bin/process_logs.py:76: sys.exit("Failed to connect to the MySQL database.") Given that Sebastian doesn't seem to disagree, could you just leave that exception alone here instead of turning it into a meaningless message? http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/bin/process_logs.py:80: except (KeyError, MySQLdb.Error), e: The json module seems to raise a ValueError exception on invalid syntax. It seems that for this exception we won't tell which file caused it. Do we even care about the exception type here? How about: except: logging.error("Failed to process file %s, all changes rolled back." % _last_log_file) raise http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/common.py:20: return [message.encode("utf-8")] Given that this is only useful for the web handlers, how about moving this file into the web/ directory? Side-note: I wonder whether you could somehow convince Git that is not a copy of sitescripts/reports/bin/removeOldUsers.py. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/static/query.html (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/static/query.html:23: <th>Hits</th> Given that these aren't actual hits - name it Frequency maybe? http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/static/query.js (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/static/query.js:38: $('#filter, #domain').keyup(function () { table.fnDraw(); }); How about listening to the "input" event instead? This will work when text is being dragged into the input field as well. Note that I don't mind registering your event handler the old-fashioned way if jQuery doesn't support it ;) http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:41: where_sql = "WHERE " + where if where else "" This is confusing, why the intermediate step? where_fields = ["%s LIKE %%%s%%" % (s, p) for s, p in ...] where_sql = "WHERE " + " AND ".join(where_fields) if where_fields else "" However, the value we are searching for should *not* be inserted into the query - that's an SQL injection vulnerability. *All* dynamic strings should be added to the parameters of the prepared statement. So it should really be something like: where_fields, params = zip(*[("%s LIKE %%s" % s, "%%%s%%" % p) for s, p in ...]) where_sql = "WHERE " + " AND ".join(where_fields) if where_fields else "" http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:44: order_by_sql = "`%s` %s" % (MySQLdb.escape_string(order_by), order) How about you only allow certain values for order_by instead of escaping some bogus field name? As I said, no dynamic strings in the query please. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:63: "500 Database error") The actual exception needs to be logged still, calling traceback.print_exc() should do. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:70: response_headers = [("Content-type", "application/json")] "application/json; charset=utf-8" please. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:73: "total": total, "count": len(results)})] This should be json.dumps(..., ensure_ascii=False).encode("utf-8") http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:52: query_string) I wonder whether quotation marks in the query string will break any of the tools you wrote. And before you object that most browsers will turn quotation marks into %22 - sure, but anybody using telnet or an older IE version can still send them to the server. Given that a query string cannot have whitespace in it due to specifics of the HTTP protocol - why use quotation marks as delimiters here? http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:82: "500 Logging error") This seems to be an unexpected error, one that is worth logging. Do traceback.print_exc() as well here? http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:104: print >> f, "[%s] %s" % (datetime.now().strftime('%d/%b/%Y:%H:%M:%S %z'), message) I'd suggest using traceback module here instead of generating custom messages. This should also allow logging any error types rather than only two.
http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/bin/process_logs.py:1: # coding: utf-8 One more note: the purpose of this script wasn't clear to me initially, rename it into reprocess_logs.py?
http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:53: print >> f, json.dumps(data) json.dump(data, f)
Patch Set 11 : Addressed further comments from Wladimir. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:31: if os.path.splitext(f)[1] == ".log" and f[0].isdigit(): On 2015/03/27 16:29:06, Wladimir Palant wrote: > On 2015/03/27 11:59:57, kzar wrote: > > That is a simple way to skip over other log files in the directory, namely > > processing-errors.log. Not strictly necessary but it makes the tool easier to > > use in practice. > > Aren't all logs we are interested in inside of subdirectories? |if root != dir > and os.path.splitext(f)[1] == ".log"|? Not always, for example you might choose to only re-process today's logs in which case all the log files _would_ be in the root directory. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log file %s" % log_file) On 2015/03/27 16:29:06, Wladimir Palant wrote: > On 2015/03/27 11:59:57, kzar wrote: > > Otherwise I'll need to update the format for test data that > > we've generated, the unit tests, the supporting tools for Kirill and test it > all > > still works. > > Yes, that's why we have reviews - so that we don't change this stuff *after* we > create lots of tools depending on it ;) > > This should be changed now, it won't get any easier in a year. Better yet, I changed it in the past :p. (See patch set 10) http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 16:29:06, Wladimir Palant wrote: > On 2015/03/27 14:26:06, Sebastian Noack wrote: > > Regarding the unit tests, I think it would be a better approach to configure > the > > database settings used to run the tests separately. So the unit tests will > just > > fail to run in production where no test database is configured. This is quite > a > > common approach for testing web apps. > > A simpler way to deal with unit tests would having them change the configuration > temporarily and prepend the database name by "test_". But we are also avoiding logging files when performing the tests of the submit API. (Without that the real logged data gets mixed up with a bunch of test data.) http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/.hgi... File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/.hgi... .hgignore:6: sitescripts/filterhits/test/temp On 2015/03/27 16:29:06, Wladimir Palant wrote: > This seems to be an artifact of your personal test environment, not something > that anybody else would need. You can add it to you personal hgignore file, not > point having it here for everybody. Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/bin/process_logs.py:1: # coding: utf-8 On 2015/03/27 16:30:48, Wladimir Palant wrote: > One more note: the purpose of this script wasn't clear to me initially, rename > it into reprocess_logs.py? Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/bin/process_logs.py:76: sys.exit("Failed to connect to the MySQL database.") On 2015/03/27 16:29:06, Wladimir Palant wrote: > Given that Sebastian doesn't seem to disagree, could you just leave that > exception alone here instead of turning it into a meaningless message? Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/bin/process_logs.py:80: except (KeyError, MySQLdb.Error), e: On 2015/03/27 16:29:06, Wladimir Palant wrote: > The json module seems to raise a ValueError exception on invalid syntax. It > seems that for this exception we won't tell which file caused it. > > Do we even care about the exception type here? How about: > > except: > logging.error("Failed to process file %s, all changes rolled back." % > _last_log_file) > raise Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/common.py:20: return [message.encode("utf-8")] On 2015/03/27 16:29:06, Wladimir Palant wrote: > Given that this is only useful for the web handlers, how about moving this file > into the web/ directory? > > Side-note: I wonder whether you could somehow convince Git that is not a copy of > sitescripts/reports/bin/removeOldUsers.py. Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/static/query.html (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/static/query.html:23: <th>Hits</th> On 2015/03/27 16:29:06, Wladimir Palant wrote: > Given that these aren't actual hits - name it Frequency maybe? Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/static/query.js (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/static/query.js:38: $('#filter, #domain').keyup(function () { table.fnDraw(); }); On 2015/03/27 16:29:06, Wladimir Palant wrote: > How about listening to the "input" event instead? This will work when text is > being dragged into the input field as well. > > Note that I don't mind registering your event handler the old-fashioned way if > jQuery doesn't support it ;) Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/web/query.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:41: where_sql = "WHERE " + where if where else "" On 2015/03/27 16:29:06, Wladimir Palant wrote: > This is confusing, why the intermediate step? > > where_fields = ["%s LIKE %%%s%%" % (s, p) for s, p in ...] > where_sql = "WHERE " + " AND ".join(where_fields) if where_fields else "" > > However, the value we are searching for should *not* be inserted into the query > - that's an SQL injection vulnerability. *All* dynamic strings should be added > to the parameters of the prepared statement. So it should really be something > like: > > where_fields, params = zip(*[("%s LIKE %%s" % s, "%%%s%%" % p) for s, p in ...]) > where_sql = "WHERE " + " AND ".join(where_fields) if where_fields else "" You're right this code was confusing, I hadn't looked at it for a while and I found it hard to re-grok. I've tidied it up a bit now like you suggested. (Although unfortunately I needed to add an extra step as we can't unpack a potentially empty array into two values.) That said in my defence I don't think there was a SQL vulnerability here, f[0] is either "domain" or "filter" and did not come from user input. f[1] is used below, when it is added to the parameter list which is then passed to the database query function. The database query function should then take care of the escaping of those parameters for us. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:44: order_by_sql = "`%s` %s" % (MySQLdb.escape_string(order_by), order) On 2015/03/27 16:29:06, Wladimir Palant wrote: > How about you only allow certain values for order_by instead of escaping some > bogus field name? As I said, no dynamic strings in the query please. Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:63: "500 Database error") On 2015/03/27 16:29:06, Wladimir Palant wrote: > The actual exception needs to be logged still, calling traceback.print_exc() > should do. Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:70: response_headers = [("Content-type", "application/json")] On 2015/03/27 16:29:06, Wladimir Palant wrote: > "application/json; charset=utf-8" please. Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/query.py:73: "total": total, "count": len(results)})] On 2015/03/27 16:29:06, Wladimir Palant wrote: > This should be json.dumps(..., ensure_ascii=False).encode("utf-8") Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:52: query_string) On 2015/03/27 16:29:06, Wladimir Palant wrote: > I wonder whether quotation marks in the query string will break any of the tools > you wrote. And before you object that most browsers will turn quotation marks > into %22 - sure, but anybody using telnet or an older IE version can still send > them to the server. > > Given that a query string cannot have whitespace in it due to specifics of the > HTTP protocol - why use quotation marks as delimiters here? (Give me a little credit, I would not assume data sent from the client is safe / correct.) Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:53: print >> f, json.dumps(data) On 2015/03/27 16:31:04, Sebastian Noack wrote: > json.dump(data, f) Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:82: "500 Logging error") On 2015/03/27 16:29:06, Wladimir Palant wrote: > This seems to be an unexpected error, one that is worth logging. Do > traceback.print_exc() as well here? Done. http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site... sitescripts/filterhits/web/submit.py:104: print >> f, "[%s] %s" % (datetime.now().strftime('%d/%b/%Y:%H:%M:%S %z'), message) On 2015/03/27 16:29:06, Wladimir Palant wrote: > I'd suggest using traceback module here instead of generating custom messages. > This should also allow logging any error types rather than only two. Done.
Almost there, only a few nit left. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgi... File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgi... .hgignore:5: dist Now you removed the line break at the end of this file - and that's your only change here. You better revert it, it will bite you when you rebase (your patch is based on an older version of this file). http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... File sitescripts/filterhits/static/query.js (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... sitescripts/filterhits/static/query.js:38: $('#filter, #domain').on('input', function () { table.fnDraw(); }); Style nit: Double quotation marks in JavaScript please unless there is a reason to use single quotation marks. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... File sitescripts/filterhits/web/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... sitescripts/filterhits/web/common.py:1: # coding: utf-8 Unfortunately, this file will still have a messy history, it being considered a copy of sitescripts/reports/bin/removeOldUsers.py. I guess Git cannot be convinced otherwise?
Patch Set 12 : Addressed final nits. Patch Set 13 : Rebased. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgi... File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgi... .hgignore:5: dist On 2015/03/28 12:59:19, Wladimir Palant wrote: > Now you removed the line break at the end of this file - and that's your only > change here. You better revert it, it will bite you when you rebase (your patch > is based on an older version of this file). Ergh, sorry not sure how I missed that. Done. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... File sitescripts/filterhits/static/query.js (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... sitescripts/filterhits/static/query.js:38: $('#filter, #domain').on('input', function () { table.fnDraw(); }); On 2015/03/28 12:59:19, Wladimir Palant wrote: > Style nit: Double quotation marks in JavaScript please unless there is a reason > to use single quotation marks. Done. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... File sitescripts/filterhits/web/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... sitescripts/filterhits/web/common.py:1: # coding: utf-8 On 2015/03/28 12:59:19, Wladimir Palant wrote: > Unfortunately, this file will still have a messy history, it being considered a > copy of sitescripts/reports/bin/removeOldUsers.py. I guess Git cannot be > convinced otherwise? Well I will be squashing down the commits into one before I create the patch so hopefully that will help?
LGTM http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... File sitescripts/filterhits/web/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... sitescripts/filterhits/web/common.py:1: # coding: utf-8 On 2015/03/28 14:11:56, kzar wrote: > Well I will be squashing down the commits into one before I create the patch so > hopefully that will help? Unlikely. But I guess I can modify the patch you give me to remove the bogus copy statement.
http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... File sitescripts/filterhits/web/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/site... sitescripts/filterhits/web/common.py:1: # coding: utf-8 On 2015/03/29 07:29:12, Wladimir Palant wrote: > On 2015/03/28 14:11:56, kzar wrote: > > Well I will be squashing down the commits into one before I create the patch > so > > hopefully that will help? > > Unlikely. But I guess I can modify the patch you give me to remove the bogus > copy statement. Actually it does seem to have helped, the diff considers it a new file now instead of a copy of sitescripts/reports/bin/removeOldUsers.py :)
Sorry, not ready after all. The way unit tests work still needs to be fixed. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 22:15:00, kzar wrote: > But we are also avoiding logging files when performing the tests of the submit > API. Unit tests should not change the way the code works - they are supported to test it exactly the way it works in production. The obvious solution is changing the log file configuration in unit tests as well - direct log file generation to a temporary directory, remove it once the unit tests are done.
Patch Set 14 : Removed db.testing flag, instead overwrite config during testing. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site... sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/30 11:23:55, Wladimir Palant wrote: > On 2015/03/27 22:15:00, kzar wrote: > > But we are also avoiding logging files when performing the tests of the submit > > API. > > Unit tests should not change the way the code works - they are supported to test > it exactly the way it works in production. The obvious solution is changing the > log file configuration in unit tests as well - direct log file generation to a > temporary directory, remove it once the unit tests are done. Done.
http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/site... sitescripts/filterhits/test/test_helpers.py:34: config.set("filterhitstats", "log_dir", test_config["log_dir"]) Maybe create a real temporary directory rather than assuming that you have write permissions to the repository? Generally, writing test files under the repository root seems to be a bad idea. http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/site... sitescripts/filterhits/test/test_helpers.py:39: config.set("filterhitstats", "log_dir", live_config["log_dir"]) Remove the temporary log directory here?
Patch Set 15 : Make test log directory path configurable and ensure it's always cleared. http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/site... sitescripts/filterhits/test/test_helpers.py:34: config.set("filterhitstats", "log_dir", test_config["log_dir"]) On 2015/03/30 13:13:46, Wladimir Palant wrote: > Maybe create a real temporary directory rather than assuming that you have write > permissions to the repository? Generally, writing test files under the > repository root seems to be a bad idea. Done. http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/site... sitescripts/filterhits/test/test_helpers.py:39: config.set("filterhitstats", "log_dir", live_config["log_dir"]) On 2015/03/30 13:13:46, Wladimir Palant wrote: > Remove the temporary log directory here? Done.
http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/site... sitescripts/filterhits/test/test_helpers.py:29: "log_dir": config.get("filterhitstats", "test_log_dir") Why configure that directory? setup_config() can create a unique temporary directory via tempfile module.
Patch Set 16 : Create temporary log directory with tempfile module for tests. http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/site... sitescripts/filterhits/test/test_helpers.py:29: "log_dir": config.get("filterhitstats", "test_log_dir") On 2015/03/30 18:35:53, Wladimir Palant wrote: > Why configure that directory? setup_config() can create a unique temporary > directory via tempfile module. Done.
http://codereview.adblockplus.org/4615801646612480/diff/5633494390669312/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5633494390669312/site... sitescripts/filterhits/test/test_helpers.py:30: "log_dir": tempfile.mkdtemp() This will create the directory exactly once rather than do it for each test run. This means that if this module is loaded yet no tests run then the directory will already be created but nothing will remove it. Also, the test results might depend on the order in which the tests run because only the first test will have the directory already - it will be removed after that test and not recreated. So I'd suggest removing this setting from test_config and rather calling mkdtemp() in setup_config() directly.
Patch Set 17 : Make sure the temporary log directory is recreated for each test. http://codereview.adblockplus.org/4615801646612480/diff/5633494390669312/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5633494390669312/site... sitescripts/filterhits/test/test_helpers.py:30: "log_dir": tempfile.mkdtemp() On 2015/03/30 19:07:04, Wladimir Palant wrote: > This will create the directory exactly once rather than do it for each test run. > This means that if this module is loaded yet no tests run then the directory > will already be created but nothing will remove it. Also, the test results might > depend on the order in which the tests run because only the first test will have > the directory already - it will be removed after that test and not recreated. > > So I'd suggest removing this setting from test_config and rather calling > mkdtemp() in setup_config() directly. Ugh sorry, good point. (I didn't notice this as I've been running each test individually so the log directory was always created properly as I expected.) Done.
LGTM
Looks mostly good. Only a few nits left. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False): Nit: We don't use kwargs below. So there is no point in removing the item, hence calling .get() would be sufficient. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() Why do you assign this attribute? It doesn't seem to be used in the other methods. Do we actually need to return it from setup_config()? http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:64: if __name__ == '__main__': Nit: Consistent whitespaces please. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/test_helpers.py:29: "database": config.get("filterhitstats", "test_database") Maybe we should also have test_dbuser and test_dbpassword? http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:19: import MySQLdb Nit: Third-party modules go between corelib and own module imports. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] Do we actually need to specify a content type for no content?
http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:64: if __name__ == '__main__': On 2015/03/31 07:55:21, Sebastian Noack wrote: > Nit: Consistent whitespaces please. I meant quotes of course.
Patch Set 18 : Addressed further feedback from Sebastian. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False): On 2015/03/31 07:55:21, Sebastian Noack wrote: > Nit: We don't use kwargs below. So there is no point in removing the item, hence > calling .get() would be sufficient. Done. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 07:55:21, Sebastian Noack wrote: > Why do you assign this attribute? It doesn't seem to be used in the other > methods. Do we actually need to return it from setup_config()? Well the reason I've done it this way is that I'm trying to set a convention by calling the setup_config and resore_config functions in the same way for all the tests. I think having access to the configuration is generally something useful for tests to have and I want to avoid the situation where in the future someone forgets to call the setup_config function for a test and we end up trashing real data. It's true we could remove the return value, add an extra include here and get the configuration ourselves but I think it's cleaner how it is. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:64: if __name__ == '__main__': On 2015/03/31 09:20:18, Sebastian Noack wrote: > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > Nit: Consistent whitespaces please. > > I meant quotes of course. Done. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/test_helpers.py:29: "database": config.get("filterhitstats", "test_database") On 2015/03/31 07:55:21, Sebastian Noack wrote: > Maybe we should also have test_dbuser and test_dbpassword? Done. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:19: import MySQLdb On 2015/03/31 07:55:21, Sebastian Noack wrote: > Nit: Third-party modules go between corelib and own module imports. Done. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] On 2015/03/31 07:55:21, Sebastian Noack wrote: > Do we actually need to specify a content type for no content? I guess not, I've removed it.
http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 09:48:56, kzar wrote: > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > Why do you assign this attribute? It doesn't seem to be used in the other > > methods. Do we actually need to return it from setup_config()? > > Well the reason I've done it this way is that I'm trying to set a convention by > calling the setup_config and resore_config functions in the same way for all the > tests. I think having access to the configuration is generally something useful > for tests to have and I want to avoid the situation where in the future someone > forgets to call the setup_config function for a test and we end up trashing real > data. > > It's true we could remove the return value, add an extra include here and get > the configuration ourselves but I think it's cleaner how it is. I see, but this is the wrong way doing it then. Instead you should implement a base class or mixin, implementing the setUp() and tearDown() methods in test_helpers.py, deriving from it here and for the other tests.
http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 10:19:04, Sebastian Noack wrote: > On 2015/03/31 09:48:56, kzar wrote: > > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > > Why do you assign this attribute? It doesn't seem to be used in the other > > > methods. Do we actually need to return it from setup_config()? > > > > Well the reason I've done it this way is that I'm trying to set a convention > by > > calling the setup_config and resore_config functions in the same way for all > the > > tests. I think having access to the configuration is generally something > useful > > for tests to have and I want to avoid the situation where in the future > someone > > forgets to call the setup_config function for a test and we end up trashing > real > > data. > > > > It's true we could remove the return value, add an extra include here and get > > the configuration ourselves but I think it's cleaner how it is. > > I see, but this is the wrong way doing it then. Instead you should implement a > base class or mixin, implementing the setUp() and tearDown() methods in > test_helpers.py, deriving from it here and for the other tests. I agree, I thought of doing it that way originally. The problem I saw was that I would still have to call the base class' setup + teardown methods manually[1] and therefore I don't think the added complexity would actually help us. [1] - "If you want the setUpClass and tearDownClass on base classes called then you must call up to them yourself." https://docs.python.org/2/library/unittest.html#setupclass-and-teardownclass
http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 10:27:47, kzar wrote: > On 2015/03/31 10:19:04, Sebastian Noack wrote: > > On 2015/03/31 09:48:56, kzar wrote: > > > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > > > Why do you assign this attribute? It doesn't seem to be used in the other > > > > methods. Do we actually need to return it from setup_config()? > > > > > > Well the reason I've done it this way is that I'm trying to set a convention > > by > > > calling the setup_config and resore_config functions in the same way for all > > the > > > tests. I think having access to the configuration is generally something > > useful > > > for tests to have and I want to avoid the situation where in the future > > someone > > > forgets to call the setup_config function for a test and we end up trashing > > real > > > data. > > > > > > It's true we could remove the return value, add an extra include here and > get > > > the configuration ourselves but I think it's cleaner how it is. > > > > I see, but this is the wrong way doing it then. Instead you should implement a > > base class or mixin, implementing the setUp() and tearDown() methods in > > test_helpers.py, deriving from it here and for the other tests. > > I agree, I thought of doing it that way originally. The problem I saw was that I > would still have to call the base class' setup + teardown methods manually[1] > and therefore I don't think the added complexity would actually help us. > > [1] - "If you want the setUpClass and tearDownClass on base classes called then > you must call up to them yourself." > https://docs.python.org/2/library/unittest.html#setupclass-and-teardownclass Well, this documentation refers to setupClass/tearDownClass, not setUp/tearDown. Did you actually try to inherit the latter form a base class? I didn't find any documentation saying that it wouldn't work.
http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] On 2015/03/31 07:55:21, Sebastian Noack wrote: > Do we actually need to specify a content type for no content? Yes, we do - the browser still needs to know how to interpret the document, even if it is empty.
http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] On 2015/03/31 13:42:31, Wladimir Palant wrote: > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > Do we actually need to specify a content type for no content? > > Yes, we do - the browser still needs to know how to interpret the document, even > if it is empty. Well, if we would also set the correct status code - i.e. "204 No Content" - the browser shouldn't expect any content anyway.
Patch Set 19 : Created base class for tests and reverted earlier content-type change. (Sorry for the delay with this, I've been having some issues with my new laptop.) http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 10:32:23, Sebastian Noack wrote: > On 2015/03/31 10:27:47, kzar wrote: > > On 2015/03/31 10:19:04, Sebastian Noack wrote: > > > On 2015/03/31 09:48:56, kzar wrote: > > > > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > > > > Why do you assign this attribute? It doesn't seem to be used in the > other > > > > > methods. Do we actually need to return it from setup_config()? > > > > > > > > Well the reason I've done it this way is that I'm trying to set a > convention > > > by > > > > calling the setup_config and resore_config functions in the same way for > all > > > the > > > > tests. I think having access to the configuration is generally something > > > useful > > > > for tests to have and I want to avoid the situation where in the future > > > someone > > > > forgets to call the setup_config function for a test and we end up > trashing > > > real > > > > data. > > > > > > > > It's true we could remove the return value, add an extra include here and > > get > > > > the configuration ourselves but I think it's cleaner how it is. > > > > > > I see, but this is the wrong way doing it then. Instead you should implement > a > > > base class or mixin, implementing the setUp() and tearDown() methods in > > > test_helpers.py, deriving from it here and for the other tests. > > > > I agree, I thought of doing it that way originally. The problem I saw was that > I > > would still have to call the base class' setup + teardown methods manually[1] > > and therefore I don't think the added complexity would actually help us. > > > > [1] - "If you want the setUpClass and tearDownClass on base classes called > then > > you must call up to them yourself." > > https://docs.python.org/2/library/unittest.html#setupclass-and-teardownclass > > Well, this documentation refers to setupClass/tearDownClass, not setUp/tearDown. > Did you actually try to inherit the latter form a base class? I didn't find any > documentation saying that it wouldn't work. You were right, I've done it that way now and it works fine. Done. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site... sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] On 2015/03/31 13:46:53, Sebastian Noack wrote: > On 2015/03/31 13:42:31, Wladimir Palant wrote: > > On 2015/03/31 07:55:21, Sebastian Noack wrote: > > > Do we actually need to specify a content type for no content? > > > > Yes, we do - the browser still needs to know how to interpret the document, > even > > if it is empty. > > Well, if we would also set the correct status code - i.e. "204 No Content" - the > browser shouldn't expect any content anyway. OK so I tested returning a 200 with and without the content type and both worked for me, but if Wladimir says the browser still might need it I'm inclined to believe him. You're right that 204 is technically a more correct status code but my vote is to return 200 still as in practice maybe a client would check for that exact code. So I vote we just return 200 with the possibly redundant content type, as far as I can see it is the most likely to work as we want and I don't think there are any down sides apart from transmitting a few extra bytes.
http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:45: # Set up test config There is a lot of boilerplate and duplication in the code mocking and restoring the config. How about this? def setUp(self): self._backup = [] for option in ("dbuser", "dbpassword", "database"): self._backup[option] = config.get("filterhitstats", option) config.set("filterhitstats", option, config.get("filterhitstats", "test_" + option)) self._backup["log_dir"] = config.get("filterhitstats", "log_dir") self._test_dir = tempfile.mkdtemp() config.set("filterhitstats", "log_dir", self._test_dir) ... def tearDown(self): ... for option, value in self._backup.iteritems(): self._config.set("filterhitstats", option, value) http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:56: self.db = None Wouldn't it be better to fail here with the actual MySQLdb.Error, rather than with a meaningless TypeError during the test? http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/web/submit.py:108: start_response("200 OK", response_headers) Again, the status should be "204 No Content". And then you shouldn't need a mime type.
http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:45: # Set up test config On 2015/04/02 07:37:04, Sebastian Noack wrote: > There is a lot of boilerplate and duplication in the code mocking and restoring > the config. How about this? > > def setUp(self): > self._backup = [] > > for option in ("dbuser", "dbpassword", "database"): > self._backup[option] = config.get("filterhitstats", option) > config.set("filterhitstats", option, config.get("filterhitstats", "test_" + > option)) > > self._backup["log_dir"] = config.get("filterhitstats", "log_dir") > self._test_dir = tempfile.mkdtemp() > config.set("filterhitstats", "log_dir", self._test_dir) > > ... > > def tearDown(self): > ... > > for option, value in self._backup.iteritems(): > self._config.set("filterhitstats", option, value) I prefer it as it is, although it doesn't matter much either way. (I suppose one advantage to my version is that the configuration values are read less often.) http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:56: self.db = None On 2015/04/02 07:37:04, Sebastian Noack wrote: > Wouldn't it be better to fail here with the actual MySQLdb.Error, rather than > with a meaningless TypeError during the test? The tests that require database access first check that self.db is truthy, if not they skip. This way we can still run the other tests without a test database set up. http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/web/submit.py:108: start_response("200 OK", response_headers) On 2015/04/02 07:37:04, Sebastian Noack wrote: > Again, the status should be "204 No Content". And then you shouldn't need a mime > type. What's your opinion on this Wladimir? As earlier discussed I would rather return 200 in case clients (wrongly) check for that explicitly. Sebastian would rather return 204 as it's a more accurate status code.
http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:45: # Set up test config On 2015/04/02 07:47:47, kzar wrote: > On 2015/04/02 07:37:04, Sebastian Noack wrote: > > There is a lot of boilerplate and duplication in the code mocking and > restoring > > the config. How about this? > > > > def setUp(self): > > self._backup = [] > > > > for option in ("dbuser", "dbpassword", "database"): > > self._backup[option] = config.get("filterhitstats", option) > > config.set("filterhitstats", option, config.get("filterhitstats", "test_" > + > > option)) > > > > self._backup["log_dir"] = config.get("filterhitstats", "log_dir") > > self._test_dir = tempfile.mkdtemp() > > config.set("filterhitstats", "log_dir", self._test_dir) > > > > ... > > > > def tearDown(self): > > ... > > > > for option, value in self._backup.iteritems(): > > self._config.set("filterhitstats", option, value) > > I prefer it as it is, although it doesn't matter much either way. (I suppose one > advantage to my version is that the configuration values are read less often.) We don't have to care about performance here. Moreover, this is actually an advantage. The test should backup the application's state before it runs, restoring it when it's done. Retrieving the config already on module intilization might cause problems if the config get's mocked by other parts of the application. More a theoretical scenario here, since as soon as the module is loaded the test runs, but still. However, the more important point here is that duplication and unneeded boilerplate isn't great. http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:56: self.db = None On 2015/04/02 07:47:47, kzar wrote: > On 2015/04/02 07:37:04, Sebastian Noack wrote: > > Wouldn't it be better to fail here with the actual MySQLdb.Error, rather than > > with a meaningless TypeError during the test? > > The tests that require database access first check that self.db is truthy, if > not they skip. This way we can still run the other tests without a test database > set up. How about using a getter for the db property? @property def db(self): if (not self._db) self._db = db.connect() self._clear_database() return self._db That way tests that don't need the db won't even try to connect and will still run if there is a problem with the db. And tests that require the db will fail with a meaningful exception. http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/web/submit.py:108: start_response("200 OK", response_headers) On 2015/04/02 07:47:47, kzar wrote: > On 2015/04/02 07:37:04, Sebastian Noack wrote: > > Again, the status should be "204 No Content". And then you shouldn't need a > mime > > type. > > What's your opinion on this Wladimir? As earlier discussed I would rather return > 200 in case clients (wrongly) check for that explicitly. Sebastian would rather > return 204 as it's a more accurate status code. The only client we have to bother about here is our browser extension, where we can easily make sure that it correctly handles the response status.
Patch Set 20 : Addressed further comments from Sebastian. (Ignore "change" to __init__.py files, the patchset is a little noisy as I rebased.) http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:45: # Set up test config On 2015/04/02 08:11:44, Sebastian Noack wrote: > On 2015/04/02 07:47:47, kzar wrote: > > On 2015/04/02 07:37:04, Sebastian Noack wrote: > > > There is a lot of boilerplate and duplication in the code mocking and > > restoring > > > the config. How about this? > > > > > > def setUp(self): > > > self._backup = [] > > > > > > for option in ("dbuser", "dbpassword", "database"): > > > self._backup[option] = config.get("filterhitstats", option) > > > config.set("filterhitstats", option, config.get("filterhitstats", > "test_" > > + > > > option)) > > > > > > self._backup["log_dir"] = config.get("filterhitstats", "log_dir") > > > self._test_dir = tempfile.mkdtemp() > > > config.set("filterhitstats", "log_dir", self._test_dir) > > > > > > ... > > > > > > def tearDown(self): > > > ... > > > > > > for option, value in self._backup.iteritems(): > > > self._config.set("filterhitstats", option, value) > > > > I prefer it as it is, although it doesn't matter much either way. (I suppose > one > > advantage to my version is that the configuration values are read less often.) > > We don't have to care about performance here. Moreover, this is actually an > advantage. The test should backup the application's state before it runs, > restoring it when it's done. Retrieving the config already on module > intilization might cause problems if the config get's mocked by other parts of > the application. More a theoretical scenario here, since as soon as the module > is loaded the test runs, but still. However, the more important point here is > that duplication and unneeded boilerplate isn't great. Fair enough, well how about this? I've reduced duplication like you've suggested while keeping the style to my taste. http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/test/test_helpers.py:56: self.db = None On 2015/04/02 08:11:44, Sebastian Noack wrote: > On 2015/04/02 07:47:47, kzar wrote: > > On 2015/04/02 07:37:04, Sebastian Noack wrote: > > > Wouldn't it be better to fail here with the actual MySQLdb.Error, rather > than > > > with a meaningless TypeError during the test? > > > > The tests that require database access first check that self.db is truthy, if > > not they skip. This way we can still run the other tests without a test > database > > set up. > > How about using a getter for the db property? > > @property > def db(self): > if (not self._db) > self._db = db.connect() > self._clear_database() > return self._db > > That way tests that don't need the db won't even try to connect and will still > run if there is a problem with the db. And tests that require the db will fail > with a meaningful exception. Done. http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site... sitescripts/filterhits/web/submit.py:108: start_response("200 OK", response_headers) On 2015/04/02 08:11:44, Sebastian Noack wrote: > On 2015/04/02 07:47:47, kzar wrote: > > On 2015/04/02 07:37:04, Sebastian Noack wrote: > > > Again, the status should be "204 No Content". And then you shouldn't need a > > mime > > > type. > > > > What's your opinion on this Wladimir? As earlier discussed I would rather > return > > 200 in case clients (wrongly) check for that explicitly. Sebastian would > rather > > return 204 as it's a more accurate status code. > > The only client we have to bother about here is our browser extension, where we > can easily make sure that it correctly handles the response status. Fine, Done.
LGTM
LGTM |