Issue 4615801646612480: Issue 395 - Filter hits statistics backend

kzar

Dec. 19, 2014, 1:27 p.m. (2014-12-19 13:27:40 UTC) #1

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgignore File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgignore#newcode6 .hgignore:6: sitescripts/filterhits/test/temp While on it mind adding a .gitignore like ...

Feb. 11, 2015, 4 p.m. (2015-02-11 16:00:11 UTC) #2

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/bin/process_logs.py#newcode33 sitescripts/filterhits/bin/process_logs.py:33: if f.endswith(".log") and f[0].isdigit(): Nit: Please use os.path.splitext() http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/bin/process_logs.py#newcode46 ...

Feb. 11, 2015, 4:29 p.m. (2015-02-11 16:29:32 UTC) #3

kzar

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgignore File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/.hgignore#newcode6 .hgignore:6: sitescripts/filterhits/test/temp On 2015/02/11 16:00:12, Sebastian Noack wrote: > While ...

Feb. 17, 2015, 10:52 a.m. (2015-02-17 10:52:23 UTC) #4

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py#newcode68 sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/11 16:00:12, Sebastian Noack wrote: ...

Feb. 17, 2015, 2:59 p.m. (2015-02-17 14:59:17 UTC) #5

Wladimir Palant

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/test/common_tests.py File sitescripts/filterhits/test/common_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/test/common_tests.py#newcode1 sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8 On 2015/02/11 16:00:12, Sebastian Noack wrote: ...

Feb. 17, 2015, 3:12 p.m. (2015-02-17 15:12:20 UTC) #6

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/common.py File sitescripts/filterhits/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/common.py#newcode30 sitescripts/filterhits/common.py:30: now = time.gmtime() On 2015/02/17 15:12:21, Wladimir Palant wrote: ...

Feb. 17, 2015, 3:19 p.m. (2015-02-17 15:19:43 UTC) #7

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/web/submit.py File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/web/submit.py#newcode45 sitescripts/filterhits/web/submit.py:45: data = json.loads(data) data = {} try: data_length = ...

Feb. 17, 2015, 5:09 p.m. (2015-02-17 17:09:47 UTC) #8

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/web/submit.py File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/web/submit.py#newcode45 sitescripts/filterhits/web/submit.py:45: data = json.loads(data) Possibly you could also just use ...

Feb. 17, 2015, 5:16 p.m. (2015-02-17 17:16:57 UTC) #9

kzar

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py#newcode68 sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/17 14:59:17, Sebastian Noack wrote: ...

Feb. 24, 2015, 6:05 p.m. (2015-02-24 18:05:11 UTC) #10

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/db.py:68: if isinstance(sql, str):
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > How about always expecting a sequence here, eliminating the need for a type
> > switch? If the calling code usually writes only one item you might want to
go
> > with variable arguments, to eliminate boilerplate there.
> 
> You didn't mind to address or reply to this comment. I still think we should
> just always expect an iterable object here, getting rid of this silly type
> switch.

It just makes the function nicer to use if you only want to write one piece of
SQL to the database and don't need to pass parameters.

`db.write("DELETE FROM BLAH")` instead of `db.write(("DELETE FROM BLAH"),)`.
Especially in the testing code it makes things a little cleaner.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/test/common_tests.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8
On 2015/02/17 15:12:21, Wladimir Palant wrote:
> On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > I wonder whether we should also have a test running the server, and actually
> > testing the /submit and /query HTTP APIs.
> 
> We don't have to run a real server, calling the WSGI handler directly would
do.
> Actually sounds like a good idea here.

Yea, I agree a few tests for the API would be nice. I'll see what I can do.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/web/query.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats",
"dbuser"),
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > It seems this pattern is repeated. How about retrieving the user, password
and
> > database inside the connect() method, or at least if it's not provided
> otherwise
> > (e.g. when testing)?
> 
> What's about this comment?

I prefer to do it this way, during development I experimented with a few
approaches but IMHO it's dangerous to set the database by default. All it takes
is one test to be written that forgets to pass the database name and you've
messed up your real database.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and
current == " "):
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> If you use the != instead the == operator in the first place, you don't need
to
> negate the result:
> 
> last != '"' or current != ' '

I'm aware of Demorgan's law but I think the intention is much clearer when it's
written this way around.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/db.py:50: This writes a given SQL string or iterator of
tuples containing SQL
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> Iterators (in Python) are objects that implement the iterator protocol, i.e.
the
> next() method. When using for-loops, Python internally calls iter(obj) to
create
> an iterator, however the object passed to this function isn't an iterator
> itself, it's merely an iterable object.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/db.py:61: sql, params = query[0], query[1:]
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> I'd prefer to use a two-dimensional tuple here and everywhere else like (sql,
> params) rather than prepanding the sql to the params. This makes it easier to
> understand the data structure, is more efficient, and reduces complexity.

I like doing it this way for db.query and db.write personally as it reduces the
boilerplate when using the functions with zero or one parameters. I originally
did it in a way similar to how you suggested but found lots of calls where I was
passing empty or 1 item tuples - it seemed kind of ugly.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/geometrical_mean.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/geometrical_mean.py:59: return
flatten(itertools.imap(lambda fields: update_query(interval, *fields),
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> I feel that you are kinda overusing itertools. A generator seems to be more
> appropriate and easier to read here: 
> 
> for fields in filter_hits(data):
>   for query in update_query(interval, *fields):
>     yield query

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/web/query.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:26: def query(domain=None, filter=None,
skip=0, take=20, order_by="hits DESC", **_):
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> Any reason why you silently ignore additional keyword arguments?

I do that because we're taking the parameters straight from the query string. We
want to ignore any passed parameters that we don't care about here.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:37: where_fields = [(s, "%" + p + "%") for
s, p in (("domain", domain),
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> It's best practice to use format string when concatenating more than two
string.
> Format strings are generally more readable and also faster. Yes, I know that
you
> have to escape the percent-sign then. ;)

I disagree that `"%%%s%%" % p` is easier to read than `"%" + p + "%"`.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:38: ("filter", filter)) if p]
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> Nit: Seems that the indentation is a little off here.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:42: order_by, order = order_by.split()
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> Strings aren't proper data structures. So how about turning the order_by
> argument into a tuple like ("hists", "DESC"). Or maybe even better, using two
> arguments, one for the sort column, and another one to indicate the sort
> direction.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:43: order = order.upper() if order.upper()
in ["ASC", "DESC"] else "ASC"
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> Nit: When a sequence doesn't need to be modified use tuples instead lists.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:65: if db_connection:
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> This will result in a NameError, in case db_connection failed to connect. You
> need following pattern here:
> 
> try:
>   connection = db.connect(...)
>   try:
>     ...
>   finally:
>     connection.close()
> except MySQLdb.Error:
>   ...

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/submit.py:45: data = json.loads(data)
On 2015/02/17 17:16:57, Sebastian Noack wrote:
> Possibly you could also just use json.load():
> 
> try:
>   data = json.load(environ["wsgi.input"])
> except (ValueError, IOError):
>   return common.showError("Error while parsing JSON data.", start_response)

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/submit.py:77: if db_connection:
On 2015/02/17 14:59:17, Sebastian Noack wrote:
> This will result in a NameError, in case db_connection failed to connect. You
> need following pattern here:
> 
> try:
>   connection = db.connect(...)
>   try:
>     ...
>   finally:
>     connection.close()
> except MySQLdb.Error:
>   ...

Done.

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py#newcode68 sitescripts/filterhits/db.py:68: if isinstance(sql, str): On 2015/02/24 18:05:11, kzar wrote: > ...

Feb. 26, 2015, 4:39 p.m. (2015-02-26 16:39:24 UTC) #11

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/db.py:68: if isinstance(sql, str):
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > > How about always expecting a sequence here, eliminating the need for a
type
> > > switch? If the calling code usually writes only one item you might want to
> go
> > > with variable arguments, to eliminate boilerplate there.
> > 
> > You didn't mind to address or reply to this comment. I still think we should
> > just always expect an iterable object here, getting rid of this silly type
> > switch.
> 
> It just makes the function nicer to use if you only want to write one piece of
> SQL to the database and don't need to pass parameters.
> 
> `db.write("DELETE FROM BLAH")` instead of `db.write(("DELETE FROM BLAH"),)`.
> Especially in the testing code it makes things a little cleaner.

There is only one call in your whole code where you pass a single string to the
write() method. This certainly doesn't justify to keep that logic.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/db.py:80: if db:
On 2015/02/17 10:52:24, kzar wrote:
> On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > How could a MySQLdb.Error be raised if db didn't exist?
> 
> I'm not sure but I seem to remember that this check prevents problems so I
would
> rather leave it in here.

If a random change you can't explain, fixes an issue you don't understand, this
isn't necessarily the correct or best way doing it. So figure out what's the
issue here, and if this is indeed the correct way of handling it add a comment
explaining why this non-obvious check is necessary.

Does the connection class maybe implement __nonzero__, returning False when the
connection was closed or rolled back after an error?

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/web/query.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats",
"dbuser"),
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > > It seems this pattern is repeated. How about retrieving the user, password
> and
> > > database inside the connect() method, or at least if it's not provided
> > otherwise
> > > (e.g. when testing)?
> > 
> > What's about this comment?
> 
> I prefer to do it this way, during development I experimented with a few
> approaches but IMHO it's dangerous to set the database by default. All it
takes
> is one test to be written that forgets to pass the database name and you've
> messed up your real database.

That doesn't make any sense. The database and credentials come from the
configuration anyway. I'm merely criticizing that you duplicate the code
retrieving those settings.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and
current == " "):
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > If you use the != instead the == operator in the first place, you don't need
> to
> > negate the result:
> > 
> > last != '"' or current != ' '
> 
> I'm aware of Demorgan's law but I think the intention is much clearer when
it's
> written this way around.

I'd rather say, as more logical operations involved, as harder it's to
understand the logic.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/db.py:61: sql, params = query[0], query[1:]
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > I'd prefer to use a two-dimensional tuple here and everywhere else like
(sql,
> > params) rather than prepanding the sql to the params. This makes it easier
to
> > understand the data structure, is more efficient, and reduces complexity.
> 
> I like doing it this way for db.query and db.write personally as it reduces
the
> boilerplate when using the functions with zero or one parameters. I originally
> did it in a way similar to how you suggested but found lots of calls where I
was
> passing empty or 1 item tuples - it seemed kind of ugly.

I don't agree. So I am leaving it up to Wladimir.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/web/query.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:26: def query(domain=None, filter=None,
skip=0, take=20, order_by="hits DESC", **_):
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > Any reason why you silently ignore additional keyword arguments?
> 
> I do that because we're taking the parameters straight from the query string.
We
> want to ignore any passed parameters that we don't care about here.

I see.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/web/query.py:37: where_fields = [(s, "%" + p + "%") for
s, p in (("domain", domain),
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > It's best practice to use format string when concatenating more than two
> string.
> > Format strings are generally more readable and also faster. Yes, I know that
> you
> > have to escape the percent-sign then. ;)
> 
> I disagree that `"%%%s%%" % p` is easier to read than `"%" + p + "%"`.

Fair enough.

kzar

Added API tests, addressed comments and some other improvements. http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/sitescripts/filterhits/db.py#newcode68 sitescripts/filterhits/db.py:68: ...

Feb. 28, 2015, 7:39 p.m. (2015-02-28 19:39:56 UTC) #12

Added API tests, addressed comments and some other improvements.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/db.py:68: if isinstance(sql, str):
On 2015/02/26 16:39:25, Sebastian Noack wrote:
> On 2015/02/24 18:05:11, kzar wrote:
> > On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > > On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > > > How about always expecting a sequence here, eliminating the need for a
> type
> > > > switch? If the calling code usually writes only one item you might want
to
> > go
> > > > with variable arguments, to eliminate boilerplate there.
> > > 
> > > You didn't mind to address or reply to this comment. I still think we
should
> > > just always expect an iterable object here, getting rid of this silly type
> > > switch.
> > 
> > It just makes the function nicer to use if you only want to write one piece
of
> > SQL to the database and don't need to pass parameters.
> > 
> > `db.write("DELETE FROM BLAH")` instead of `db.write(("DELETE FROM BLAH"),)`.
> > Especially in the testing code it makes things a little cleaner.
> 
> There is only one call in your whole code where you pass a single string to
the
> write() method. This certainly doesn't justify to keep that logic.

Fair enough, I didn't realise that.
Done.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/db.py:80: if db:
On 2015/02/26 16:39:25, Sebastian Noack wrote:
> On 2015/02/17 10:52:24, kzar wrote:
> > On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > > How could a MySQLdb.Error be raised if db didn't exist?
> > 
> > I'm not sure but I seem to remember that this check prevents problems so I
> would
> > rather leave it in here.
> 
> If a random change you can't explain, fixes an issue you don't understand,
this
> isn't necessarily the correct or best way doing it. So figure out what's the
> issue here, and if this is indeed the correct way of handling it add a comment
> explaining why this non-obvious check is necessary.
> 
> Does the connection class maybe implement __nonzero__, returning False when
the
> connection was closed or rolled back after an error?

Well I agree with you of course but I guess the trouble is when there is a gap
of months between writing code and getting it reviewed it's hard to remember the
finer details.

I've removed it now and I can't see any problems. The tests still pass, I can
still submit data and I can still query it out again.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/test/common_tests.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/test/common_tests.py:1: # coding: utf-8
On 2015/02/24 18:05:11, kzar wrote:
> On 2015/02/17 15:12:21, Wladimir Palant wrote:
> > On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > > I wonder whether we should also have a test running the server, and
actually
> > > testing the /submit and /query HTTP APIs.
> > 
> > We don't have to run a real server, calling the WSGI handler directly would
> do.
> > Actually sounds like a good idea here.
> 
> Yea, I agree a few tests for the API would be nice. I'll see what I can do.

OK, I've written some now.

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
File sitescripts/filterhits/web/query.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5629499534213120/site...
sitescripts/filterhits/web/query.py:46: db.connect(config.get("filterhitstats",
"dbuser"),
On 2015/02/26 16:39:25, Sebastian Noack wrote:
> On 2015/02/24 18:05:11, kzar wrote:
> > On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > > On 2015/02/11 16:00:12, Sebastian Noack wrote:
> > > > It seems this pattern is repeated. How about retrieving the user,
password
> > and
> > > > database inside the connect() method, or at least if it's not provided
> > > otherwise
> > > > (e.g. when testing)?
> > > 
> > > What's about this comment?
> > 
> > I prefer to do it this way, during development I experimented with a few
> > approaches but IMHO it's dangerous to set the database by default. All it
> takes
> > is one test to be written that forgets to pass the database name and you've
> > messed up your real database.
> 
> That doesn't make any sense. The database and credentials come from the
> configuration anyway. I'm merely criticizing that you duplicate the code
> retrieving those settings.

OK, I changed my mind here and I've now changed it.
Done.

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/site...
sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and
current == " "):
On 2015/02/26 16:39:25, Sebastian Noack wrote:
> On 2015/02/24 18:05:11, kzar wrote:
> > On 2015/02/17 14:59:17, Sebastian Noack wrote:
> > > If you use the != instead the == operator in the first place, you don't
need
> > to
> > > negate the result:
> > > 
> > > last != '"' or current != ' '
> > 
> > I'm aware of Demorgan's law but I think the intention is much clearer when
> it's
> > written this way around.
> 
> I'd rather say, as more logical operations involved, as harder it's to
> understand the logic.

Well what we're saying is something like "While the last two characters aren't
'" ' then do this...". I think `not (last == '"' and current == " ")` is closer
to that than `last != '"' or current != ' '` but I guess it's subjective.

How about this? It's even closer to the English version and reduces the logical
operations.

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5673385510043648/sitescripts/filterhits/bin/process_logs.py#newcode46 sitescripts/filterhits/bin/process_logs.py:46: while not (last == '"' and current == " ...

March 2, 2015, 10:04 a.m. (2015-03-02 10:04:01 UTC) #13

kzar

Patch Set 5 : Addressed more comments. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py#newcode20 sitescripts/filterhits/bin/process_logs.py:20: import sitescripts.filterhits.common ...

March 2, 2015, 10:39 a.m. (2015-03-02 10:39:03 UTC) #14

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py#newcode89 sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 10:39:03, kzar wrote: > On ...

March 2, 2015, 10:56 a.m. (2015-03-02 10:56:35 UTC) #15

kzar

On 2015/03/02 10:56:35, Sebastian Noack wrote: > http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py > File sitescripts/filterhits/bin/process_logs.py (right): > > http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py#newcode89 ...

March 2, 2015, 10:58 a.m. (2015-03-02 10:58:40 UTC) #16

kzar

Sorry, I hit the wrong reply button there. Ignore the last email. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py ...

March 2, 2015, 11 a.m. (2015-03-02 11:00:52 UTC) #17

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py#newcode89 sitescripts/filterhits/bin/process_logs.py:89: if db_connection: On 2015/03/02 11:00:52, kzar wrote: > On ...

March 2, 2015, 11:06 a.m. (2015-03-02 11:06:05 UTC) #18

kzar

Patch Set 6 : Display friendly message if processing script can't connect to DB. http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/sitescripts/filterhits/bin/process_logs.py ...

March 2, 2015, 11:18 a.m. (2015-03-02 11:18:50 UTC) #19

Patch Set 6 : Display friendly message if processing script can't connect to DB.

http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5768755258851328/site...
sitescripts/filterhits/bin/process_logs.py:89: if db_connection:
On 2015/03/02 11:06:05, Sebastian Noack wrote:
> On 2015/03/02 11:00:52, kzar wrote:
> > On 2015/03/02 10:56:36, Sebastian Noack wrote:
> > > On 2015/03/02 10:39:03, kzar wrote:
> > > > On 2015/03/02 10:04:01, Sebastian Noack wrote:
> > > > > Again, if db.connect() fails, this results into a NameError. You need
> > > > following
> > > > > logic:
> > > > > 
> > > > > db_connection = db.connect()
> > > > > try:
> > > > >   try:
> > > > >     ...
> > > > >   except:
> > > > >     ...
> > > > > finally:
> > > > >   db_connection.close()
> > > > >   
> > > > 
> > > > Done.
> > > 
> > > Sorry, I meant:
> > > 
> > > try:
> > >   db_connection = db.connect()
> > >   try:
> > >     ...
> > >   finally:
> > >     db_connection.close()
> > > except:
> > >   ...
> > > 
> > > Otherwise, you ignore MySQL errors that occur while connecting.
> > 
> > Yea I realised that but I figure it's OK. This script is called from
> > the command line to process logs, if the database connection fails I
> > think the script probably should return an exception.
> 
> The exception would be printed as well if handled by the code above. With your
> line of arguing there would be no reason to handle any exceptions here. Please
> make it consistent.

Well my line of reasoning was that if the database is down and we can't connect
any message that makes it clear that the database is down is sufficient -
including the exception itself.

With the processing exceptions however this is not the case because the user
needs to know which log file caused the problem - not just that a MySQL
exception / KeyError was thrown.

Either way it's not a big deal, I now catch the MySQL exception on connecting
and display a friendlier message instead.

Sebastian Noack

LGTM. But I suppose Wladimir wants to have a closer look himself before I merge ...

March 2, 2015, 11:27 a.m. (2015-03-02 11:27:52 UTC) #20

kzar

Patch Set 8 : Simplified geometrical_mean code and reduced filter inserts.

March 16, 2015, 4:25 p.m. (2015-03-16 16:25:41 UTC) #22

Wladimir Palant

This is only a partial review, I didn't get to reviewing the web handlers yet. ...

March 26, 2015, 10:56 p.m. (2015-03-26 22:56:50 UTC) #23

kzar

Patch Set 9 : Addressed some of Wladimir's comments http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sitescripts.example File .sitescripts.example (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sitescripts.example#newcode178 .sitescripts.example:178: ...

March 27, 2015, 11:59 a.m. (2015-03-27 11:59:57 UTC) #24

Patch Set 9 : Addressed some of Wladimir's comments

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sit...
File .sitescripts.example (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/.sit...
.sitescripts.example:178: dbpasswrd=password
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> This should be dbpassword.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json,
os, sys
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> Nit: PEP 8 requires one import per line and a blank line between groups of
> imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the other
> files of course. Yes, I know that most existing scripts aren't following these
> requirements yet.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:20: from sitescripts.filterhits
import common, db, geometrical_mean
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> common isn't actually used here.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:22: last_log_file = None
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> Nit: IMHO this isn't meant to be exported, call it _last_log_file?

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:31: if os.path.splitext(f)[1] ==
".log" and f[0].isdigit():
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> Why the requirement that the first letter is a digit? You don't have that
> requirement when validating command line parameters. IMHO it serves no
purpose,
> other than breaking this code should you change the file name format
elsewhere.

That is a simple way to skip over other log files in the directory, namely
processing-errors.log. Not strictly necessary but it makes the tool easier to
use in practice.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log
file %s" % log_file)
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> Wouldn't that be a lot simpler if the date and the GET parameters were stored
in
> the first line of the file and the JSON data in the second? You would merely
> need to read a line then...

Sure, I suppose this logic would be simpler that way. Originally it was done
this way as each log file contained many entries, one per line and I didn't
re-consider the format since we changed to having one entry per file.

Thing is this logic works well and has been well tested now I would prefer to
leave it alone. Otherwise I'll need to update the format for test data that
we've generated, the unit tests, the supporting tools for Kirill and test it all
still works.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to
the MySQL database.")
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> I have the impression that the original exception will be more informative
here,
> why catch it?

This was Sebastian's idea, not mine. (See the inline discussion around here in
Patch set 5.)

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/common.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/common.py:25: def log_filterhits(data, basepath,
query_string):
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> This isn't a common function, it is only being used by the submit handler and
> should live there.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/common.py:47: query_string, json.dumps(data)))
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> Nit: you can use |print >>f, ...| - then you won't need to specify the
trailing
> newline explicitly.

Well apparently that's "unpythonic" I don't mind either way, but are you sure?
http://stackoverflow.com/questions/8598228/f-write-vs-print-f

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database"
if testing else "database"),
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> When would that flag ever be used? This seems to imply that we would test on
the
> production server - that sounds like a bad idea, with flag or without. We have
> VMs for that, using production configuration.

Well originally I was passing the parameters through to db.connect each time so
that I could pass one database name for tests and another elsewhere. It lead to
a lot of duplication though so Sebastian suggested I changed that. (Rightly so I
think.)

This flag is a little ugly, it's true, but it is very useful for the unit tests.
It allows us to have api tests that don't trash the real data. (Logged and in
the database.) It also allows us to test the DB code and geometrical mean code
which performs calculations in the DB without trashing real data.

I personally think it's a "foot-gun" waiting to happen if we have the unit tests
completely trash all the logged and aggregated data and this is the best way
around that I could see.

(It's worth noting as well that I _have_ been testing and developing all this on
a VM but even then having the database and logged data trashed when running unit
tests would be a real pain. It would make manually testing everything much
harder.)

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False):
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> Why specify **kwargs if only one keyword parameter is ever used? Shouldn't
this
> function be declared as |def query(db, sql, dict_result=False, *params)|?

I tried this but unfortunately it doesn't work. You end up with "TypeError:
query() got multiple values for keyword argument 'dict_result'".

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/geometrical_mean.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/geometrical_mean.py:33: yield ("""INSERT INTO
`geometrical_mean`
On 2015/03/26 22:56:50, Wladimir Palant wrote:
> geometrical_mean is merely the aggregation approach we are using right now. A
> table name should describe the meaning of its contents, not the
implementation.
> Call that table "frequency"?

(I chose the plural to keep consistent with the other table name "filters".)
Done.

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/bin/process_logs.py#newcode18 sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json, os, sys On 2015/03/27 11:59:57, ...

March 27, 2015, 1:12 p.m. (2015-03-27 13:12:19 UTC) #25

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json,
os, sys
On 2015/03/27 11:59:57, kzar wrote:
> On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > Nit: PEP 8 requires one import per line and a blank line between groups of
> > imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the
other
> > files of course. Yes, I know that most existing scripts aren't following
these
> > requirements yet.
> 
> Done.

While we are on it, please also add a new line between corelib and own module
imports.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log
file %s" % log_file)
On 2015/03/27 11:59:57, kzar wrote:
> On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > Wouldn't that be a lot simpler if the date and the GET parameters were
stored
> in
> > the first line of the file and the JSON data in the second? You would merely
> > need to read a line then...
> 
> Sure, I suppose this logic would be simpler that way. Originally it was done
> this way as each log file contained many entries, one per line and I didn't
> re-consider the format since we changed to having one entry per file.
> 
> Thing is this logic works well and has been well tested now I would prefer to
> leave it alone. Otherwise I'll need to update the format for test data that
> we've generated, the unit tests, the supporting tools for Kirill and test it
all
> still works.

I agree with Wladimir FWIW. I also wouldn't consider any code tested, if it's
neither used by real users nor passed QA. But you make a point, if we are going
to change it, we should do it now. ;)

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to
the MySQL database.")
On 2015/03/27 11:59:57, kzar wrote:
> On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > I have the impression that the original exception will be more informative
> here,
> > why catch it?
> 
> This was Sebastian's idea, not mine. (See the inline discussion around here in
> Patch set 5.)

There are no comments on patchset 5. You probably mean this discussion:
http://codereview.adblockplus.org/4615801646612480/patch/5768755258851328/619...

I didn't suggest to print this string instead the original exception here. That
was your decision. I merely pointed out that you have to open the try/finally
block closing the connection after you actually create the connection.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/common.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/common.py:47: query_string, json.dumps(data)))
On 2015/03/27 11:59:57, kzar wrote:
> On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > Nit: you can use |print >>f, ...| - then you won't need to specify the
> trailing
> > newline explicitly.
> 
> Well apparently that's "unpythonic" I don't mind either way, but are you sure?
> http://stackoverflow.com/questions/8598228/f-write-vs-print-f

I don't agree. I agree though that >> syntax is kinda weird. That is why the
print statement was replaced with a function in Python 3. However, the way
Wladimir suggested is the way to do it in Python 2.

Note that |file.write("foo")| isn't completely the same as |print >>file,
"foo"|.  Amongs others the print statement (or function in Python 3), considers
the terminal encoding when processing unicode objects and automatically picks
the correct newline sequence depending of the OS.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database"
if testing else "database"),
On 2015/03/27 11:59:57, kzar wrote:
> On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > When would that flag ever be used? This seems to imply that we would test on
> the
> > production server - that sounds like a bad idea, with flag or without. We
have
> > VMs for that, using production configuration.
> 
> Well originally I was passing the parameters through to db.connect each time
so
> that I could pass one database name for tests and another elsewhere. It lead
to
> a lot of duplication though so Sebastian suggested I changed that. (Rightly so
I
> think.)
> 
> This flag is a little ugly, it's true, but it is very useful for the unit
tests.
> It allows us to have api tests that don't trash the real data. (Logged and in
> the database.) It also allows us to test the DB code and geometrical mean code
> which performs calculations in the DB without trashing real data.
> 
> I personally think it's a "foot-gun" waiting to happen if we have the unit
tests
> completely trash all the logged and aggregated data and this is the best way
> around that I could see.
> 
> (It's worth noting as well that I _have_ been testing and developing all this
on
> a VM but even then having the database and logged data trashed when running
unit
> tests would be a real pain. It would make manually testing everything much
> harder.)

I agree with Wladimir. The footgun here would be testing on the production
system, which you shouldn't do anyway. If you are bothered about destroying your
test data, you should just backup them.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:38: if kwargs.pop('dict_result', False):
On 2015/03/27 11:59:57, kzar wrote:
> On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > Why specify **kwargs if only one keyword parameter is ever used? Shouldn't
> this
> > function be declared as |def query(db, sql, dict_result=False, *params)|?
> 
> I tried this but unfortunately it doesn't work. You end up with "TypeError:
> query() got multiple values for keyword argument 'dict_result'".

Yep, that is because the first param given will be interpreted as dict_result
argument. There will be keyword-only arguments in Python 3. But with Python 2
you have to do it the way you did.

http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site...
sitescripts/filterhits/web/submit.py:51: f.write("[%s] \"%s\" %s\n" %
(time.strftime('%d/%b/%Y:%H:%M:%S', now),
Same here, I'd rather go with the print statement.

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/db.py#newcode28 sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database" if testing else "database"), On 2015/03/27 13:12:19, ...

March 27, 2015, 2:26 p.m. (2015-03-27 14:26:06 UTC) #26

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database"
if testing else "database"),
On 2015/03/27 13:12:19, Sebastian Noack wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > > When would that flag ever be used? This seems to imply that we would test
on
> > the
> > > production server - that sounds like a bad idea, with flag or without. We
> have
> > > VMs for that, using production configuration.
> > 
> > Well originally I was passing the parameters through to db.connect each time
> so
> > that I could pass one database name for tests and another elsewhere. It lead
> to
> > a lot of duplication though so Sebastian suggested I changed that. (Rightly
so
> I
> > think.)
> > 
> > This flag is a little ugly, it's true, but it is very useful for the unit
> tests.
> > It allows us to have api tests that don't trash the real data. (Logged and
in
> > the database.) It also allows us to test the DB code and geometrical mean
code
> > which performs calculations in the DB without trashing real data.
> > 
> > I personally think it's a "foot-gun" waiting to happen if we have the unit
> tests
> > completely trash all the logged and aggregated data and this is the best way
> > around that I could see.
> > 
> > (It's worth noting as well that I _have_ been testing and developing all
this
> on
> > a VM but even then having the database and logged data trashed when running
> unit
> > tests would be a real pain. It would make manually testing everything much
> > harder.)
> 
> I agree with Wladimir. The footgun here would be testing on the production
> system, which you shouldn't do anyway. If you are bothered about destroying
your
> test data, you should just backup them.

Regarding the unit tests, I think it would be a better approach to configure the
database settings used to run the tests separately. So the unit tests will just
fail to run in production where no test database is configured. This is quite a
common approach for testing web apps.

kzar

Patch Set 10 : Addressed Sebastian's and Wladimir's comments. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/bin/process_logs.py#newcode18 sitescripts/filterhits/bin/process_logs.py:18: ...

March 27, 2015, 3:10 p.m. (2015-03-27 15:10:50 UTC) #27

Patch Set 10 : Addressed Sebastian's and Wladimir's comments.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:18: import MySQLdb, itertools, json,
os, sys
On 2015/03/27 13:12:19, Sebastian Noack wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > > Nit: PEP 8 requires one import per line and a blank line between groups of
> > > imports (https://www.python.org/dev/peps/pep-0008/#imports). Same in the
> other
> > > files of course. Yes, I know that most existing scripts aren't following
> these
> > > requirements yet.
> > 
> > Done.
> 
> While we are on it, please also add a new line between corelib and own module
> imports.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log
file %s" % log_file)
On 2015/03/27 13:12:19, Sebastian Noack wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > > Wouldn't that be a lot simpler if the date and the GET parameters were
> stored
> > in
> > > the first line of the file and the JSON data in the second? You would
merely
> > > need to read a line then...
> > 
> > Sure, I suppose this logic would be simpler that way. Originally it was done
> > this way as each log file contained many entries, one per line and I didn't
> > re-consider the format since we changed to having one entry per file.
> > 
> > Thing is this logic works well and has been well tested now I would prefer
to
> > leave it alone. Otherwise I'll need to update the format for test data that
> > we've generated, the unit tests, the supporting tools for Kirill and test it
> all
> > still works.
> 
> I agree with Wladimir FWIW. I also wouldn't consider any code tested, if it's
> neither used by real users nor passed QA. But you make a point, if we are
going
> to change it, we should do it now. ;)

OK fair enough, Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:78: sys.exit("Failed to connect to
the MySQL database.")
On 2015/03/27 13:12:19, Sebastian Noack wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > > I have the impression that the original exception will be more informative
> > here,
> > > why catch it?
> > 
> > This was Sebastian's idea, not mine. (See the inline discussion around here
in
> > Patch set 5.)
> 
> There are no comments on patchset 5. You probably mean this discussion:
>
http://codereview.adblockplus.org/4615801646612480/patch/5768755258851328/619...
> 
> I didn't suggest to print this string instead the original exception here.
That
> was your decision. I merely pointed out that you have to open the try/finally
> block closing the connection after you actually create the connection.

OK the discussion was in patch set 4... it doesn't matter.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/common.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/common.py:47: query_string, json.dumps(data)))
On 2015/03/27 13:12:19, Sebastian Noack wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > > Nit: you can use |print >>f, ...| - then you won't need to specify the
> > trailing
> > > newline explicitly.
> > 
> > Well apparently that's "unpythonic" I don't mind either way, but are you
sure?
> > http://stackoverflow.com/questions/8598228/f-write-vs-print-f
> 
> I don't agree. I agree though that >> syntax is kinda weird. That is why the
> print statement was replaced with a function in Python 3. However, the way
> Wladimir suggested is the way to do it in Python 2.
> 
> Note that |file.write("foo")| isn't completely the same as |print >>file,
> "foo"|.  Amongs others the print statement (or function in Python 3),
considers
> the terminal encoding when processing unicode objects and automatically picks
> the correct newline sequence depending of the OS.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database"
if testing else "database"),
On 2015/03/27 14:26:06, Sebastian Noack wrote:
> On 2015/03/27 13:12:19, Sebastian Noack wrote:
> > On 2015/03/27 11:59:57, kzar wrote:
> > > On 2015/03/26 22:56:50, Wladimir Palant wrote:
> > > > When would that flag ever be used? This seems to imply that we would
test
> on
> > > the
> > > > production server - that sounds like a bad idea, with flag or without.
We
> > have
> > > > VMs for that, using production configuration.
> > > 
> > > Well originally I was passing the parameters through to db.connect each
time
> > so
> > > that I could pass one database name for tests and another elsewhere. It
lead
> > to
> > > a lot of duplication though so Sebastian suggested I changed that.
(Rightly
> so
> > I
> > > think.)
> > > 
> > > This flag is a little ugly, it's true, but it is very useful for the unit
> > tests.
> > > It allows us to have api tests that don't trash the real data. (Logged and
> in
> > > the database.) It also allows us to test the DB code and geometrical mean
> code
> > > which performs calculations in the DB without trashing real data.
> > > 
> > > I personally think it's a "foot-gun" waiting to happen if we have the unit
> > tests
> > > completely trash all the logged and aggregated data and this is the best
way
> > > around that I could see.
> > > 
> > > (It's worth noting as well that I _have_ been testing and developing all
> this
> > on
> > > a VM but even then having the database and logged data trashed when
running
> > unit
> > > tests would be a real pain. It would make manually testing everything much
> > > harder.)
> > 
> > I agree with Wladimir. The footgun here would be testing on the production
> > system, which you shouldn't do anyway. If you are bothered about destroying
> your
> > test data, you should just backup them.
> 
> Regarding the unit tests, I think it would be a better approach to configure
the
> database settings used to run the tests separately. So the unit tests will
just
> fail to run in production where no test database is configured. This is quite
a
> common approach for testing web apps.

I guess I just disagree with you both on this one! If I do that I will not be
able to run my unit tests whilst developing without having my database + logs
trashed. (I like to test by hand too!)

Also it kind of sucks either not being able to run the tests on the production
server or having them trash all the real data!

http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5105437288431616/site...
sitescripts/filterhits/web/submit.py:51: f.write("[%s] \"%s\" %s\n" %
(time.strftime('%d/%b/%Y:%H:%M:%S', now),
On 2015/03/27 13:12:19, Sebastian Noack wrote:
> Same here, I'd rather go with the print statement.

Done.

Wladimir Palant

Done with the review. Note that I only skimmed the tests, might take a closer ...

March 27, 2015, 4:29 p.m. (2015-03-27 16:29:06 UTC) #28

Wladimir Palant

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/sitescripts/filterhits/bin/process_logs.py#newcode1 sitescripts/filterhits/bin/process_logs.py:1: # coding: utf-8 One more note: the purpose of ...

March 27, 2015, 4:30 p.m. (2015-03-27 16:30:48 UTC) #29

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/sitescripts/filterhits/web/submit.py File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/sitescripts/filterhits/web/submit.py#newcode53 sitescripts/filterhits/web/submit.py:53: print >> f, json.dumps(data) json.dump(data, f)

March 27, 2015, 4:31 p.m. (2015-03-27 16:31:04 UTC) #30

kzar

Patch Set 11 : Addressed further comments from Wladimir. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/bin/process_logs.py File sitescripts/filterhits/bin/process_logs.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/bin/process_logs.py#newcode31 sitescripts/filterhits/bin/process_logs.py:31: ...

March 27, 2015, 10:15 p.m. (2015-03-27 22:15:00 UTC) #31

Patch Set 11 : Addressed further comments from Wladimir.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:31: if os.path.splitext(f)[1] ==
".log" and f[0].isdigit():
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > That is a simple way to skip over other log files in the directory, namely
> > processing-errors.log. Not strictly necessary but it makes the tool easier
to
> > use in practice.
> 
> Aren't all logs we are interested in inside of subdirectories? |if root != dir
> and os.path.splitext(f)[1] == ".log"|?

Not always, for example you might choose to only re-process today's logs in
which case all the log files _would_ be in the root directory.

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/bin/process_logs.py:47: sys.exit("Unexpected EOF in log
file %s" % log_file)
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> On 2015/03/27 11:59:57, kzar wrote:
> > Otherwise I'll need to update the format for test data that
> > we've generated, the unit tests, the supporting tools for Kirill and test it
> all
> > still works.
> 
> Yes, that's why we have reviews - so that we don't change this stuff *after*
we
> create lots of tools depending on it ;)
> 
> This should be changed now, it won't get any easier in a year.

Better yet, I changed it in the past :p. (See patch set 10)

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
File sitescripts/filterhits/db.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/site...
sitescripts/filterhits/db.py:28: db=config.get("filterhitstats", "test_database"
if testing else "database"),
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> On 2015/03/27 14:26:06, Sebastian Noack wrote:
> > Regarding the unit tests, I think it would be a better approach to configure
> the
> > database settings used to run the tests separately. So the unit tests will
> just
> > fail to run in production where no test database is configured. This is
quite
> a
> > common approach for testing web apps.
> 
> A simpler way to deal with unit tests would having them change the
configuration
> temporarily and prepend the database name by "test_".

But we are also avoiding logging files when performing the tests of the submit
API. (Without that the real logged data gets mixed up with a bunch of test
data.)

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/.hgi...
File .hgignore (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/.hgi...
.hgignore:6: sitescripts/filterhits/test/temp
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> This seems to be an artifact of your personal test environment, not something
> that anybody else would need. You can add it to you personal hgignore file,
not
> point having it here for everybody.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
File sitescripts/filterhits/bin/process_logs.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/bin/process_logs.py:1: # coding: utf-8
On 2015/03/27 16:30:48, Wladimir Palant wrote:
> One more note: the purpose of this script wasn't clear to me initially, rename
> it into reprocess_logs.py?

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/bin/process_logs.py:76: sys.exit("Failed to connect to
the MySQL database.")
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> Given that Sebastian doesn't seem to disagree, could you just leave that
> exception alone here instead of turning it into a meaningless message?

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/bin/process_logs.py:80: except (KeyError, MySQLdb.Error),
e:
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> The json module seems to raise a ValueError exception on invalid syntax. It
> seems that for this exception we won't tell which file caused it.
> 
> Do we even care about the exception type here? How about:
> 
> except:
>   logging.error("Failed to process file %s, all changes rolled back." %
> _last_log_file)
>   raise

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
File sitescripts/filterhits/common.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/common.py:20: return [message.encode("utf-8")]
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> Given that this is only useful for the web handlers, how about moving this
file
> into the web/ directory?
> 
> Side-note: I wonder whether you could somehow convince Git that is not a copy
of
> sitescripts/reports/bin/removeOldUsers.py.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
File sitescripts/filterhits/static/query.html (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/static/query.html:23: <th>Hits</th>
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> Given that these aren't actual hits - name it Frequency maybe?

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
File sitescripts/filterhits/static/query.js (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/static/query.js:38: $('#filter, #domain').keyup(function
() { table.fnDraw(); });
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> How about listening to the "input" event instead? This will work when text is
> being dragged into the input field as well.
> 
> Note that I don't mind registering your event handler the old-fashioned way if
> jQuery doesn't support it ;)

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
File sitescripts/filterhits/web/query.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/query.py:41: where_sql = "WHERE " + where if where
else ""
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> This is confusing, why the intermediate step?
> 
> where_fields = ["%s LIKE %%%s%%" % (s, p) for s, p in ...]
> where_sql = "WHERE " + " AND ".join(where_fields) if where_fields else ""
> 
> However, the value we are searching for should *not* be inserted into the
query
> - that's an SQL injection vulnerability. *All* dynamic strings should be added
> to the parameters of the prepared statement. So it should really be something
> like:
> 
> where_fields, params = zip(*[("%s LIKE %%s" % s, "%%%s%%" % p) for s, p in
...])
> where_sql = "WHERE " + " AND ".join(where_fields) if where_fields else ""

You're right this code was confusing, I hadn't looked at it for a while and I
found it hard to re-grok. I've tidied it up a bit now like you suggested.
(Although unfortunately I needed to add an extra step as we can't unpack a
potentially empty array into two values.)

That said in my defence I don't think there was a SQL vulnerability here, f[0]
is either "domain" or "filter" and did not come from user input. f[1] is used
below, when it is added to the parameter list which is then passed to the
database query function. The database query function should then take care of
the escaping of those parameters for us.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/query.py:44: order_by_sql = "`%s` %s" %
(MySQLdb.escape_string(order_by), order)
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> How about you only allow certain values for order_by instead of escaping some
> bogus field name? As I said, no dynamic strings in the query please.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/query.py:63: "500 Database error")
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> The actual exception needs to be logged still, calling traceback.print_exc()
> should do.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/query.py:70: response_headers = [("Content-type",
"application/json")]
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> "application/json; charset=utf-8" please.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/query.py:73: "total": total, "count": len(results)})]
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> This should be json.dumps(..., ensure_ascii=False).encode("utf-8")

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/submit.py:52: query_string)
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> I wonder whether quotation marks in the query string will break any of the
tools
> you wrote. And before you object that most browsers will turn quotation marks
> into %22 - sure, but anybody using telnet or an older IE version can still
send
> them to the server.
> 
> Given that a query string cannot have whitespace in it due to specifics of the
> HTTP protocol - why use quotation marks as delimiters here?

(Give me a little credit, I would not assume data sent from the client is safe /
correct.)

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/submit.py:53: print >> f, json.dumps(data)
On 2015/03/27 16:31:04, Sebastian Noack wrote:
> json.dump(data, f)

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/submit.py:82: "500 Logging error")
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> This seems to be an unexpected error, one that is worth logging. Do
> traceback.print_exc() as well here?

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5755424754106368/site...
sitescripts/filterhits/web/submit.py:104: print >> f, "[%s] %s" %
(datetime.now().strftime('%d/%b/%Y:%H:%M:%S %z'), message)
On 2015/03/27 16:29:06, Wladimir Palant wrote:
> I'd suggest using traceback module here instead of generating custom messages.
> This should also allow logging any error types rather than only two.

Done.

Wladimir Palant

Almost there, only a few nit left. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgignore File .hgignore (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgignore#newcode5 .hgignore:5: dist Now ...

March 28, 2015, 12:59 p.m. (2015-03-28 12:59:19 UTC) #32

kzar

Patch Set 12 : Addressed final nits. Patch Set 13 : Rebased. http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/.hgignore File .hgignore ...

March 28, 2015, 2:11 p.m. (2015-03-28 14:11:56 UTC) #33

Wladimir Palant

LGTM http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/sitescripts/filterhits/web/common.py File sitescripts/filterhits/web/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/sitescripts/filterhits/web/common.py#newcode1 sitescripts/filterhits/web/common.py:1: # coding: utf-8 On 2015/03/28 14:11:56, kzar wrote: ...

March 29, 2015, 7:29 a.m. (2015-03-29 07:29:11 UTC) #34

kzar

http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/sitescripts/filterhits/web/common.py File sitescripts/filterhits/web/common.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5735425238892544/sitescripts/filterhits/web/common.py#newcode1 sitescripts/filterhits/web/common.py:1: # coding: utf-8 On 2015/03/29 07:29:12, Wladimir Palant wrote: ...

March 29, 2015, 11:23 a.m. (2015-03-29 11:23:40 UTC) #35

Wladimir Palant

Sorry, not ready after all. The way unit tests work still needs to be fixed. ...

March 30, 2015, 11:23 a.m. (2015-03-30 11:23:55 UTC) #36

kzar

Patch Set 14 : Removed db.testing flag, instead overwrite config during testing. http://codereview.adblockplus.org/4615801646612480/diff/5659118702428160/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py ...

March 30, 2015, 1:01 p.m. (2015-03-30 13:01:42 UTC) #37

Wladimir Palant

http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/sitescripts/filterhits/test/test_helpers.py File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5154545407623168/sitescripts/filterhits/test/test_helpers.py#newcode34 sitescripts/filterhits/test/test_helpers.py:34: config.set("filterhitstats", "log_dir", test_config["log_dir"]) Maybe create a real temporary directory ...

March 30, 2015, 1:13 p.m. (2015-03-30 13:13:46 UTC) #38

kzar

Patch Set 15 : Make test log directory path configurable and ensure it's always cleared. ...

March 30, 2015, 3:14 p.m. (2015-03-30 15:14:53 UTC) #39

Wladimir Palant

http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/sitescripts/filterhits/test/test_helpers.py File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/sitescripts/filterhits/test/test_helpers.py#newcode29 sitescripts/filterhits/test/test_helpers.py:29: "log_dir": config.get("filterhitstats", "test_log_dir") Why configure that directory? setup_config() can ...

March 30, 2015, 6:35 p.m. (2015-03-30 18:35:53 UTC) #40

kzar

Patch Set 16 : Create temporary log directory with tempfile module for tests. http://codereview.adblockplus.org/4615801646612480/diff/4815735301865472/sitescripts/filterhits/test/test_helpers.py File ...

March 30, 2015, 7:01 p.m. (2015-03-30 19:01:34 UTC) #41

Wladimir Palant

http://codereview.adblockplus.org/4615801646612480/diff/5633494390669312/sitescripts/filterhits/test/test_helpers.py File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5633494390669312/sitescripts/filterhits/test/test_helpers.py#newcode30 sitescripts/filterhits/test/test_helpers.py:30: "log_dir": tempfile.mkdtemp() This will create the directory exactly once ...

March 30, 2015, 7:07 p.m. (2015-03-30 19:07:03 UTC) #42

kzar

Patch Set 17 : Make sure the temporary log directory is recreated for each test. ...

March 30, 2015, 7:29 p.m. (2015-03-30 19:29:02 UTC) #43

Sebastian Noack

Looks mostly good. Only a few nits left. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/db.py#newcode38 sitescripts/filterhits/db.py:38: if ...

March 31, 2015, 7:55 a.m. (2015-03-31 07:55:21 UTC) #45

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py#newcode64 sitescripts/filterhits/test/log_tests.py:64: if __name__ == '__main__': On 2015/03/31 07:55:21, Sebastian Noack ...

March 31, 2015, 9:20 a.m. (2015-03-31 09:20:18 UTC) #46

kzar

Patch Set 18 : Addressed further feedback from Sebastian. http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/db.py File sitescripts/filterhits/db.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/db.py#newcode38 sitescripts/filterhits/db.py:38: ...

March 31, 2015, 9:48 a.m. (2015-03-31 09:48:56 UTC) #47

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py#newcode31 sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 09:48:56, kzar wrote: > ...

March 31, 2015, 10:19 a.m. (2015-03-31 10:19:03 UTC) #48

kzar

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py#newcode31 sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 10:19:04, Sebastian Noack wrote: ...

March 31, 2015, 10:27 a.m. (2015-03-31 10:27:46 UTC) #49

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py File sitescripts/filterhits/test/log_tests.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/test/log_tests.py#newcode31 sitescripts/filterhits/test/log_tests.py:31: self.config = test_helpers.setup_config() On 2015/03/31 10:27:47, kzar wrote: > ...

March 31, 2015, 10:32 a.m. (2015-03-31 10:32:22 UTC) #50

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site...
File sitescripts/filterhits/test/log_tests.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site...
sitescripts/filterhits/test/log_tests.py:31: self.config =
test_helpers.setup_config()
On 2015/03/31 10:27:47, kzar wrote:
> On 2015/03/31 10:19:04, Sebastian Noack wrote:
> > On 2015/03/31 09:48:56, kzar wrote:
> > > On 2015/03/31 07:55:21, Sebastian Noack wrote:
> > > > Why do you assign this attribute? It doesn't seem to be used in  the
other
> > > > methods. Do we actually need to return it from setup_config()?
> > > 
> > > Well the reason I've done it this way is that I'm trying to set a
convention
> > by
> > > calling the setup_config and resore_config functions in the same way for
all
> > the
> > > tests. I think having access to the configuration is generally something
> > useful
> > > for tests to have and I want to avoid the situation where in the future
> > someone
> > > forgets to call the setup_config function for a test and we end up
trashing
> > real
> > > data.
> > > 
> > > It's true we could remove the return value, add an extra include here and
> get
> > > the configuration ourselves but I think it's cleaner how it is.
> > 
> > I see, but this is the wrong way doing it then. Instead you should implement
a
> > base class or mixin, implementing the setUp() and tearDown() methods in
> > test_helpers.py, deriving from it here and for the other tests.
> 
> I agree, I thought of doing it that way originally. The problem I saw was that
I
> would still have to call the base class' setup + teardown methods manually[1]
> and therefore I don't think the added complexity would actually help us.
> 
> [1] - "If you want the setUpClass and tearDownClass on base classes called
then
> you must call up to them yourself."
> https://docs.python.org/2/library/unittest.html#setupclass-and-teardownclass

Well, this documentation refers to setupClass/tearDownClass, not setUp/tearDown.
Did you actually try to inherit the latter form a base class? I didn't find any
documentation saying that it wouldn't work.

Wladimir Palant

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/web/submit.py File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/web/submit.py#newcode106 sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] On 2015/03/31 07:55:21, Sebastian Noack ...

March 31, 2015, 1:42 p.m. (2015-03-31 13:42:31 UTC) #51

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/web/submit.py File sitescripts/filterhits/web/submit.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/sitescripts/filterhits/web/submit.py#newcode106 sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type", "text/plain")] On 2015/03/31 13:42:31, Wladimir Palant ...

March 31, 2015, 1:46 p.m. (2015-03-31 13:46:53 UTC) #52

kzar

Patch Set 19 : Created base class for tests and reverted earlier content-type change. (Sorry ...

April 1, 2015, 7:09 p.m. (2015-04-01 19:09:39 UTC) #53

Patch Set 19 : Created base class for tests and reverted earlier content-type
change.

(Sorry for the delay with this, I've been having some issues with my new
laptop.)

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site...
File sitescripts/filterhits/test/log_tests.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site...
sitescripts/filterhits/test/log_tests.py:31: self.config =
test_helpers.setup_config()
On 2015/03/31 10:32:23, Sebastian Noack wrote:
> On 2015/03/31 10:27:47, kzar wrote:
> > On 2015/03/31 10:19:04, Sebastian Noack wrote:
> > > On 2015/03/31 09:48:56, kzar wrote:
> > > > On 2015/03/31 07:55:21, Sebastian Noack wrote:
> > > > > Why do you assign this attribute? It doesn't seem to be used in  the
> other
> > > > > methods. Do we actually need to return it from setup_config()?
> > > > 
> > > > Well the reason I've done it this way is that I'm trying to set a
> convention
> > > by
> > > > calling the setup_config and resore_config functions in the same way for
> all
> > > the
> > > > tests. I think having access to the configuration is generally something
> > > useful
> > > > for tests to have and I want to avoid the situation where in the future
> > > someone
> > > > forgets to call the setup_config function for a test and we end up
> trashing
> > > real
> > > > data.
> > > > 
> > > > It's true we could remove the return value, add an extra include here
and
> > get
> > > > the configuration ourselves but I think it's cleaner how it is.
> > > 
> > > I see, but this is the wrong way doing it then. Instead you should
implement
> a
> > > base class or mixin, implementing the setUp() and tearDown() methods in
> > > test_helpers.py, deriving from it here and for the other tests.
> > 
> > I agree, I thought of doing it that way originally. The problem I saw was
that
> I
> > would still have to call the base class' setup + teardown methods
manually[1]
> > and therefore I don't think the added complexity would actually help us.
> > 
> > [1] - "If you want the setUpClass and tearDownClass on base classes called
> then
> > you must call up to them yourself."
> > https://docs.python.org/2/library/unittest.html#setupclass-and-teardownclass
> 
> Well, this documentation refers to setupClass/tearDownClass, not
setUp/tearDown.
> Did you actually try to inherit the latter form a base class? I didn't find
any
> documentation saying that it wouldn't work.

You were right, I've done it that way now and it works fine.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5722383033827328/site...
sitescripts/filterhits/web/submit.py:106: response_headers = [("Content-type",
"text/plain")]
On 2015/03/31 13:46:53, Sebastian Noack wrote:
> On 2015/03/31 13:42:31, Wladimir Palant wrote:
> > On 2015/03/31 07:55:21, Sebastian Noack wrote:
> > > Do we actually need to specify a content type for no content?
> > 
> > Yes, we do - the browser still needs to know how to interpret the document,
> even
> > if it is empty.
> 
> Well, if we would also set the correct status code - i.e. "204 No Content" -
the
> browser shouldn't expect any content anyway.

OK so I tested returning a 200 with and without the content type and both worked
for me, but if Wladimir says the browser still might need it I'm inclined to
believe him. You're right that 204 is technically a more correct status code but
my vote is to return 200 still as in practice maybe a client would check for
that exact code.

So I vote we just return 200 with the possibly redundant content type, as far as
I can see it is the most likely to work as we want and I don't think there are
any down sides apart from transmitting a few extra bytes.

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/sitescripts/filterhits/test/test_helpers.py File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/sitescripts/filterhits/test/test_helpers.py#newcode45 sitescripts/filterhits/test/test_helpers.py:45: # Set up test config There is a lot ...

April 2, 2015, 7:37 a.m. (2015-04-02 07:37:04 UTC) #54

kzar

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/sitescripts/filterhits/test/test_helpers.py File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/sitescripts/filterhits/test/test_helpers.py#newcode45 sitescripts/filterhits/test/test_helpers.py:45: # Set up test config On 2015/04/02 07:37:04, Sebastian ...

April 2, 2015, 7:47 a.m. (2015-04-02 07:47:46 UTC) #55

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
File sitescripts/filterhits/test/test_helpers.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/test/test_helpers.py:45: # Set up test config
On 2015/04/02 07:37:04, Sebastian Noack wrote:
> There is a lot of boilerplate and duplication in the code mocking and
restoring
> the config. How about this?
> 
> def setUp(self):
>   self._backup = []
> 
>   for option in ("dbuser", "dbpassword", "database"):
>     self._backup[option] = config.get("filterhitstats", option)
>     config.set("filterhitstats", option, config.get("filterhitstats", "test_"
+
> option))
> 
>   self._backup["log_dir"] = config.get("filterhitstats", "log_dir")
>   self._test_dir = tempfile.mkdtemp()
>   config.set("filterhitstats", "log_dir", self._test_dir)
> 
>   ...
> 
> def tearDown(self):
>   ...
> 
>   for option, value in self._backup.iteritems():
>     self._config.set("filterhitstats", option, value)

I prefer it as it is, although it doesn't matter much either way. (I suppose one
advantage to my version is that the configuration values are read less often.)

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/test/test_helpers.py:56: self.db = None
On 2015/04/02 07:37:04, Sebastian Noack wrote:
> Wouldn't it be better to fail here with the actual MySQLdb.Error, rather than
> with a meaningless TypeError during the test?

The tests that require database access first check that self.db is truthy, if
not they skip. This way we can still run the other tests without a test database
set up.

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/web/submit.py:108: start_response("200 OK",
response_headers)
On 2015/04/02 07:37:04, Sebastian Noack wrote:
> Again, the status should be "204 No Content". And then you shouldn't need a
mime
> type.

What's your opinion on this Wladimir? As earlier discussed I would rather return
200 in case clients (wrongly) check for that explicitly. Sebastian would rather
return 204 as it's a more accurate status code.

Sebastian Noack

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/sitescripts/filterhits/test/test_helpers.py File sitescripts/filterhits/test/test_helpers.py (right): http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/sitescripts/filterhits/test/test_helpers.py#newcode45 sitescripts/filterhits/test/test_helpers.py:45: # Set up test config On 2015/04/02 07:47:47, kzar ...

April 2, 2015, 8:11 a.m. (2015-04-02 08:11:44 UTC) #56

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
File sitescripts/filterhits/test/test_helpers.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/test/test_helpers.py:45: # Set up test config
On 2015/04/02 07:47:47, kzar wrote:
> On 2015/04/02 07:37:04, Sebastian Noack wrote:
> > There is a lot of boilerplate and duplication in the code mocking and
> restoring
> > the config. How about this?
> > 
> > def setUp(self):
> >   self._backup = []
> > 
> >   for option in ("dbuser", "dbpassword", "database"):
> >     self._backup[option] = config.get("filterhitstats", option)
> >     config.set("filterhitstats", option, config.get("filterhitstats",
"test_"
> +
> > option))
> > 
> >   self._backup["log_dir"] = config.get("filterhitstats", "log_dir")
> >   self._test_dir = tempfile.mkdtemp()
> >   config.set("filterhitstats", "log_dir", self._test_dir)
> > 
> >   ...
> > 
> > def tearDown(self):
> >   ...
> > 
> >   for option, value in self._backup.iteritems():
> >     self._config.set("filterhitstats", option, value)
> 
> I prefer it as it is, although it doesn't matter much either way. (I suppose
one
> advantage to my version is that the configuration values are read less often.)

We don't have to care about performance here. Moreover, this is actually an
advantage. The test should backup the application's state before it runs,
restoring it when it's done. Retrieving the config already on module
intilization might cause problems if the config get's mocked by other parts of
the application. More a theoretical scenario here, since as soon as the module
is loaded the test runs, but still. However, the more important point here is
that duplication and unneeded boilerplate isn't great.

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/test/test_helpers.py:56: self.db = None
On 2015/04/02 07:47:47, kzar wrote:
> On 2015/04/02 07:37:04, Sebastian Noack wrote:
> > Wouldn't it be better to fail here with the actual MySQLdb.Error, rather
than
> > with a meaningless TypeError during the test?
> 
> The tests that require database access first check that self.db is truthy, if
> not they skip. This way we can still run the other tests without a test
database
> set up.

How about using a getter for the db property?

@property
def db(self):
  if (not self._db)
    self._db = db.connect()
    self._clear_database()
  return self._db

That way tests that don't need the db won't even try to connect and will still
run if there is a problem with the db. And tests that require the db will fail
with a meaningful exception.

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/web/submit.py:108: start_response("200 OK",
response_headers)
On 2015/04/02 07:47:47, kzar wrote:
> On 2015/04/02 07:37:04, Sebastian Noack wrote:
> > Again, the status should be "204 No Content". And then you shouldn't need a
> mime
> > type.
> 
> What's your opinion on this Wladimir? As earlier discussed I would rather
return
> 200 in case clients (wrongly) check for that explicitly. Sebastian would
rather
> return 204 as it's a more accurate status code.

The only client we have to bother about here is our browser extension, where we
can easily make sure that it correctly handles the response status.

kzar

Patch Set 20 : Addressed further comments from Sebastian. (Ignore "change" to __init__.py files, the ...

April 2, 2015, 10:16 a.m. (2015-04-02 10:16:54 UTC) #57

Patch Set 20 : Addressed further comments from Sebastian.

(Ignore "change" to __init__.py files, the patchset is a little noisy as I
rebased.)

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
File sitescripts/filterhits/test/test_helpers.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/test/test_helpers.py:45: # Set up test config
On 2015/04/02 08:11:44, Sebastian Noack wrote:
> On 2015/04/02 07:47:47, kzar wrote:
> > On 2015/04/02 07:37:04, Sebastian Noack wrote:
> > > There is a lot of boilerplate and duplication in the code mocking and
> > restoring
> > > the config. How about this?
> > > 
> > > def setUp(self):
> > >   self._backup = []
> > > 
> > >   for option in ("dbuser", "dbpassword", "database"):
> > >     self._backup[option] = config.get("filterhitstats", option)
> > >     config.set("filterhitstats", option, config.get("filterhitstats",
> "test_"
> > +
> > > option))
> > > 
> > >   self._backup["log_dir"] = config.get("filterhitstats", "log_dir")
> > >   self._test_dir = tempfile.mkdtemp()
> > >   config.set("filterhitstats", "log_dir", self._test_dir)
> > > 
> > >   ...
> > > 
> > > def tearDown(self):
> > >   ...
> > > 
> > >   for option, value in self._backup.iteritems():
> > >     self._config.set("filterhitstats", option, value)
> > 
> > I prefer it as it is, although it doesn't matter much either way. (I suppose
> one
> > advantage to my version is that the configuration values are read less
often.)
> 
> We don't have to care about performance here. Moreover, this is actually an
> advantage. The test should backup the application's state before it runs,
> restoring it when it's done. Retrieving the config already on module
> intilization might cause problems if the config get's mocked by other parts of
> the application. More a theoretical scenario here, since as soon as the module
> is loaded the test runs, but still. However, the more important point here is
> that duplication and unneeded boilerplate isn't great.

Fair enough, well how about this? I've reduced duplication like you've suggested
while keeping the style to my taste.

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/test/test_helpers.py:56: self.db = None
On 2015/04/02 08:11:44, Sebastian Noack wrote:
> On 2015/04/02 07:47:47, kzar wrote:
> > On 2015/04/02 07:37:04, Sebastian Noack wrote:
> > > Wouldn't it be better to fail here with the actual MySQLdb.Error, rather
> than
> > > with a meaningless TypeError during the test?
> > 
> > The tests that require database access first check that self.db is truthy,
if
> > not they skip. This way we can still run the other tests without a test
> database
> > set up.
> 
> How about using a getter for the db property?
> 
> @property
> def db(self):
>   if (not self._db)
>     self._db = db.connect()
>     self._clear_database()
>   return self._db
> 
> That way tests that don't need the db won't even try to connect and will still
> run if there is a problem with the db. And tests that require the db will fail
> with a meaningful exception.

Done.

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
File sitescripts/filterhits/web/submit.py (right):

http://codereview.adblockplus.org/4615801646612480/diff/5713050069893120/site...
sitescripts/filterhits/web/submit.py:108: start_response("200 OK",
response_headers)
On 2015/04/02 08:11:44, Sebastian Noack wrote:
> On 2015/04/02 07:47:47, kzar wrote:
> > On 2015/04/02 07:37:04, Sebastian Noack wrote:
> > > Again, the status should be "204 No Content". And then you shouldn't need
a
> > mime
> > > type.
> > 
> > What's your opinion on this Wladimir? As earlier discussed I would rather
> return
> > 200 in case clients (wrongly) check for that explicitly. Sebastian would
> rather
> > return 204 as it's a more accurate status code.
> 
> The only client we have to bother about here is our browser extension, where
we
> can easily make sure that it correctly handles the response status.

Fine, Done.

LGTM

Issue 4615801646612480: Issue 395 - Filter hits statistics backend (Closed)

Description

Patch Set 1 #

Patch Set 2 : Improvements regarding comments #

Patch Set 3 : Addressed comments. #

Patch Set 4 : Added API tests, addressed comments and some other improvements. #

Patch Set 5 : Addressed more comments. #

Patch Set 6 : Display friendly message if processing script can't connect to DB. #

Patch Set 7 : Rebased. #

Patch Set 8 : Simplified geometrical_mean code and reduced filter inserts. #

Patch Set 9 : Addressed some of Wladimir's comments #

Patch Set 10 : Addressed Sebastian's and Wladimir's comments. #

Patch Set 11 : Addressed further comments from Wladimir. #

Patch Set 12 : Addressed final nits. #

Patch Set 13 : Rebased. #

Patch Set 14 : Removed db.testing flag, instead overwrite config during testing. #

Patch Set 15 : Make test log directory path configurable and ensure it's always cleared. #

Patch Set 16 : Create temporary log directory with tempfile module for tests. #

Patch Set 17 : Make sure the temporary log directory is recreated for each test. #

Patch Set 18 : Addressed further feedback from Sebastian. #

Patch Set 19 : Created base class for tests and reverted earlier content-type change. #

Patch Set 20 : Addressed further comments from Sebastian. #

Messages