|
|
Created:
Aug. 7, 2015, 6:34 a.m. by Felix Dahlke Modified:
Aug. 7, 2015, 7:05 p.m. CC:
Wladimir Palant Visibility:
Public. |
DescriptionIssue 2868 - Don't pull the notification repo when parsing
We need to keep it up-to-date on the infrastructure side, see:
https://issues.adblockplus.org/ticket/2869
Patch Set 1 #
MessagesTotal messages: 10
If you only want to prevent multiple threads from concurrently running hg pull, how about using synchronization? hg_pull_pending = False hg_pull_finished = threading.Condition() def pull_once(): with hg_pull_finished: if hg_pull_pending: hg_pull_finished.wait() return hg_pull_pending = True subprocess.call(["hg", "-R", repo, "pull", "-q"]) with hg_pull_finished: hg_pull_pending = False hg_pull_finished.notify_all() This will only run "hg pull" if isn't called by a different thread yet. If it is we don't call it again, but simply wait until the pending call is finished. I don't have a too strong opinion though. But the advantages of this approach would be: 1. The repo is automatically pulled during testing and development. 2. The result is more predictable in production, as you don't need to consider two independent intervals (the cache expiration, and the cronjob pulling the repo) 3. You don't need to keep in mind that the repo is / needs to be updated by other means. 4. We make sure that the repo is updated where we access it. Hence we have higher code locality and don't need to rely on external mechanisms.
On 2015/08/07 07:58:06, Sebastian Noack wrote: > If you only want to prevent multiple threads from concurrently running hg pull, > how about using synchronization? > > hg_pull_pending = False > hg_pull_finished = threading.Condition() > > def pull_once(): > with hg_pull_finished: > if hg_pull_pending: > hg_pull_finished.wait() > return > hg_pull_pending = True > > subprocess.call(["hg", "-R", repo, "pull", "-q"]) > > with hg_pull_finished: > hg_pull_pending = False > hg_pull_finished.notify_all() > > This will only run "hg pull" if isn't called by a different thread yet. If it is > we don't call it again, but simply wait until the pending call is finished. I > don't have a too strong opinion though. But the advantages of this approach > would be: > > 1. The repo is automatically pulled during testing and development. > 2. The result is more predictable in production, as you don't need to consider > two independent intervals (the cache expiration, and the cronjob pulling the > repo) > 3. You don't need to keep in mind that the repo is / needs to be updated by > other means. > 4. We make sure that the repo is updated where we access it. Hence we have > higher code locality and don't need to rely on external mechanisms. I initially liked that, but actually, I'm worried about making this code too complex. It already depends on the notifications repository being present - why shouldn't it depend on that repository being up to date, too? It doesn't actually change so often that we'd have to check it continuously. We're also doing the same with the sitescripts repository - we just assumes it's present and up-to-date. I think in the long run, if we'd move away from syncing cron jobs and would trigger deployments from a CI server, e.g. after tests pass, it'd be a pretty nice way to do it. The main downside I see is that sitescripts and infrastructure are more entangled this way. They already are to quite some degree, making sitescripts difficult to test and understand in isolation. I'm not entirely sure how to best tackle that, but I feel that moving logic that's more understandably expressed in infrastructure to sitescripts is not the best way to achieve that.
On 2015/08/07 12:55:38, Felix Dahlke wrote: > The main downside I see is that sitescripts and infrastructure are more > entangled this way. They already are to quite some degree, making sitescripts > difficult to test and understand in isolation. I'm not entirely sure how to best > tackle that, but I feel that moving logic that's more understandably expressed > in infrastructure to sitescripts is not the best way to achieve that. Well, I do agree that the entanglement is an issue. But I do also believe that is is actually something we should tackle in module sitescripts. If that one would be a more regular Python package, with a setup.py and so forth, we could threat it like "regular" software. In both development and operation. (In fact it would even make sense to split it up in multiple packages, in order to account for the various problem domains the module is addressing. But that is most likely over-engineering for now.) Right now it's - from infrastructure perspective - basically a collection of glue. And the only official setup/installation mechanism in existence is not only inconsistent (which in turn is an infrastructure issue we are already improving on), but also requires the entire infrastructure development stack. This is of course not a problem when working with both modules, like I do. But it is probably a vast amount of overhead if one just wants to work with the Python repo. I would be glad if we could go into this topic a bit further. Not here in Rietveld, of course, but in general. I would even like to help on applying such changes to the repository. Let me know if you like the approach.
As I said, it was just a suggestion, I'm not convinced of myself yet. But I don't see how using synchronization to prevent concurring hg pull invocation, makes anything even more entangled with the infrastructure module. In fact, I think the opposite is the case. Instead of spreading the logic between infrastructure and sitescripts, the suggested approach would keep it isolated in sitescripts.
On 2015/08/07 13:36:25, Sebastian Noack wrote: > As I said, it was just a suggestion, I'm not convinced of myself yet. But I > don't see how using synchronization to prevent concurring hg pull invocation, > makes anything even more entangled with the infrastructure module. In fact, I > think the opposite is the case. Instead of spreading the logic between > infrastructure and sitescripts, the suggested approach would keep it isolated in > sitescripts. Fair enough. Yet I need to clarify that there is a clear requirement for the synchronization to not not be performed by the `.*.web.*` handler in the first place. As I wrote in some IRC channel before, that particular case would not have passed my review anyway, with or without manual locking or similar. IMHO this belongs in the `bin.*` counterpart, to be then invoked by a cron job. That way it's not implemented in infrastructure but also no overhead in serving requests. One could of course consider the Cron part entanglement itself. Unfortunately I do not see any reasonable alternative for that. Aside from extending the sitescripts module once again, in order to include job management of some sort. But that is probably over engineering as well.
Well, the idea was to avoid having an additional cronjob to update the repository, by doing locking in the WSGI handler to keep updating the repository there. If you disagree to that approach as well, we can just drop this disccussion, and simply merge this patch as is, adding a cron job to update the repository. As I said it were just an idea that came to my mind, in order to keep the logic here, where I thought we agreed, it belongs. In particular you seemd to be in favor of addressing the issue here, rather than in infrastructure. Note that my previous reply were addressing Felix' concerns, not yours. But fine with me.
LGTM (although I was not a reviewer in this ticket before, I now somehow ended up as one). On 2015/08/07 14:08:17, Sebastian Noack wrote: > Well, the idea was to avoid having an additional cronjob to update the > repository, by doing locking in the WSGI handler to keep updating the repository > there. If you disagree to that approach as well, we can just drop this > disccussion, and simply merge this patch as is, adding a cron job to update the > repository. As I said it were just an idea that came to my mind, in order to > keep the logic here, where I thought we agreed, it belongs. In particular you > seemd to be in favor of addressing the issue here, rather than in > infrastructure. Note that my previous reply were addressing Felix' concerns, not > yours. But fine with me. Very well. But please don't get me wrong, I do see the elegance in this approach. Yet I've also seen the computational overhead produced by such stuff (both in the past as well as with this particular change here), and as long as we don't run multi-core (4x+) machines exclusively for that we cannot operate such a handler.
Acquiring a lock, to check/set a variable, compared to calling a subprocess that will cause network IO, certainly doesn't add any measurable overhead. But sure, if we get that subprocess completely out of this code path, that would certainly remove some overhead. Though I think it probably doesn't matter too much here, as we (have to) rely on heavy caching anyway, to make this perform somehow. As far as I understood it, the problem merely was that we have to avoid running too many concurring "hg pull" subprocesses during cache warm up. So simply not calling "hg pull" if it is already running seemed an adequate solution. I also listed some advantages above. But Maybe I miss some point. However, either way, I don't care too much how this issue is tackeled. So if everybody is happy with doing so in infrastructure fine with me. LGTM.
On 2015/08/07 15:06:12, Sebastian Noack wrote: > Acquiring a lock, to check/set a variable, compared to calling a subprocess that > will cause network IO, certainly doesn't add any measurable overhead. But sure, > if we get that subprocess completely out of this code path, that would certainly > remove some overhead. Though I think it probably doesn't matter too much here, > as we (have to) rely on heavy caching anyway, to make this perform somehow. As > far as I understood it, the problem merely was that we have to avoid running too > many concurring "hg pull" subprocesses during cache warm up. So simply not > calling "hg pull" if it is already running seemed an adequate solution. I also > listed some advantages above. But Maybe I miss some point. However, either way, > I don't care too much how this issue is tackeled. So if everybody is happy with > doing so in infrastructure fine with me. LGTM. Alright, I'd prefer to go with that one for the sake of progress. To clarify: What I meant is that removing the hg pull here and moving it to infrastructure _increases_ entanglement between sitescripts and infrastructure. But I suppose we shouldn't tackle that general issue on a higher level, not with this particular issue here. |