|
I've been playing around with pgsql in Pike and have come across some
weird issues with what happens when the database server goes down and comes back again. It seems that either weird exceptions get thrown, or the engine gets stuck in a retry loop. Attached is a test-case (with my database's connection details sanitized out). In case it's significant, the database host is 'db' which is a DNS alias for localhost and resolves to ::1. The issue relates to a modified search_path, which is not automatically re-applied on reconnection. Theoretically, the _reconnect callback should allow me to do this. In actual fact, though, it doesn't. To achieve successful reconnection, I need two separate solutions: the callback, plus a trap on the original call that will, on error, reset the search path and retry. This seems odd. Also - not sure if it's significant or not - my reconnect callback gets called twice. The first one appears to succeed, and the db->query() call raises no error; in the second one, calling db->query() raises "Recursive mutex locks". I can simply swallow this error and return, and then an error gets raised at the original statement. Resetting the search path and retrying the query appears to work. Can anyone shed any light on what's going on here? Thanks! Chris Angelico |
|
Chris Angelico wrote:
>I've been playing around with pgsql in Pike and have come across some >weird issues with what happens when the database server goes down and >comes back again. It seems that either weird exceptions get thrown, or >the engine gets stuck in a retry loop. >The issue relates to a modified search_path, which is not >automatically re-applied on reconnection. Theoretically, the Interesting. The reconnect should re-set all the session variables, which should include your path. What method do you use to set the path in PostgreSQL? >Also - not sure if it's significant or not - my reconnect callback >gets called twice. The first one appears to succeed, and the >db->query() call raises no error; in the second one, calling >db->query() raises "Recursive mutex locks". I can simply swallow this >error and return, and then an error gets raised at the original >statement. Resetting the search path and retrying the query appears to >work. >Can anyone shed any light on what's going on here? I'll have to check if any of the fixes I still have for pgsql which are not in the mainstream relate to this problem. I'll get back on this as soon as I've synced fixes. -- Stephen. Being able to try has no purpose if failing is not an option. |
|
On Thu, Mar 29, 2012 at 1:44 AM, Stephen R. van den Berg <[hidden email]> wrote:
> Chris Angelico wrote: > Interesting. The reconnect should re-set all the session variables, > which should include your path. What method do you use to set the > path in PostgreSQL? Oh hey, glad there's some activity on this. We kinda shelved the whole matter and simply restart the Pike process any time Postgres gets restarted - not exactly using Pike optimally, but it's good enough for development. It's simply: db->query("set search_path to CHOSEN_SCHEMA,public"); Is there an alternative way of doing it? I looked through the docs and didn't see anything, but it's entirely possible that I missed it. > I'll have to check if any of the fixes I still have for pgsql which > are not in the mainstream relate to this problem. I'll get back on this > as soon as I've synced fixes. Thanks! I probably need to re-pull and build a new Pike at some point anyway. It's been a few months since I did that last. Most of the code I'm working with is, unfortunately, subject to corporate ownership and copyright, so I won't be able to share it or test against it except when I'm at work. But I'll try to do up test cases that aren't on restriction as needed. Chris Angelico |
|
Chris Angelico wrote:
>On Thu, Mar 29, 2012 at 1:44 AM, Stephen R. van den Berg <[hidden email]> wrote: >> Chris Angelico wrote: >> Interesting. ?The reconnect should re-set all the session variables, >> which should include your path. ?What method do you use to set the >> path in PostgreSQL? >It's simply: >db->query("set search_path to CHOSEN_SCHEMA,public"); >Is there an alternative way of doing it? I looked through the docs and >didn't see anything, but it's entirely possible that I missed it. No, it's fine. Basically, what should happen is, that you set it through this statement, and then PostgreSQL re-exports this variable to the driver using an out-of-band message on the wire. The driver then stores this assignment in a session related mapping, which is to be reapplied/resent to the server upon reconnection to restore the state. I'll have to verify that the database actually reports back, and that upon reconnect the variable is sent back to the server. >> I'll have to check if any of the fixes I still have for pgsql which >> are not in the mainstream relate to this problem. ?I'll get back on this >> as soon as I've synced fixes. >Thanks! I probably need to re-pull and build a new Pike at some point >anyway. It's been a few months since I did that last. Since it is going to be a bugfix, it will be in 7.9 and in 7.8. >Most of the code I'm working with is, unfortunately, subject to >corporate ownership and copyright, so I won't be able to share it or >test against it except when I'm at work. But I'll try to do up test >cases that aren't on restriction as needed. No problem. Out of curiosity, what *kind* of application is it you're using Pike for? Also of interest, what is the companysize? Since the Pike community is not very large, it is always interesting to know when it grows and in which direction :-). -- Stephen. Sex is like air. It's only a big deal if you can't get any. |
|
On Thu, Mar 29, 2012 at 5:35 PM, Stephen R. van den Berg <[hidden email]> wrote:
>>Most of the code I'm working with is, unfortunately, subject to >>corporate ownership and copyright, so I won't be able to share it or >>test against it except when I'm at work. But I'll try to do up test >>cases that aren't on restriction as needed. > > No problem. Out of curiosity, what *kind* of application is it you're > using Pike for? Also of interest, what is the companysize? > Since the Pike community is not very large, it is always interesting to > know when it grows and in which direction :-). The company is a tech startup with just a handful of staff at the moment, but a boss/owner with grand ideas :) I, uhh, kinda fell in love with Pike some years ago, brought her home to my parents, and ultimately married the girl... umm, yes, where was I? Initially, Pike usage at work was restricted to Pike's specialty: long-running servers that can reload code without losing existing connections. There are a couple of those now (not counting our web server, which is Apache with PHP scripts); some are purely internal, but I also have our external API server written in Pike - basically a MUD connection that allows for real-time information updates as things happen in our system. But since Pike is such an expressive and clear language, I've taken to using it for all sorts of internal command-line tools - anything larger than a half-dozen lines of bash script will probably end up being written in Pike. Hence the database updater script mentioned in another thread, plus a few other convenience scripts. (One that I'm going to ask my boss if I can release is a PHP/HTML modifier that puts a tag onto all references to js or css files, solving caching issues. It's designed to be run from a makefile, and is enormously handy.) We're proud to be using Pike. At least, I am, and my boss laughs a bit at my extensive plugging of the language ("You misspell a variable name in PHP, you get a Notice; you misspell one in Pike, you get a compilation error! Much better!" or "Look! I can just dereference a function's return value. Why can't I do that in every language?"), and nobody else even knows Pike...but I'm glad to use the same language at work as I use for the D&D MUD that I run. ChrisA |
|
In reply to this post by Stephen R. van den Berg
Stephen R. van den Berg wrote:
>Chris Angelico wrote: >>On Thu, Mar 29, 2012 at 1:44 AM, Stephen R. van den Berg <[hidden email]> wrote: >No, it's fine. Basically, what should happen is, that you set it through >this statement, and then PostgreSQL re-exports this variable to the >driver using an out-of-band message on the wire. The driver then stores >this assignment in a session related mapping, which is to be reapplied/resent >to the server upon reconnection to restore the state. >I'll have to verify that the database actually reports back, and that >upon reconnect the variable is sent back to the server. >>> I'll have to check if any of the fixes I still have for pgsql which >>> are not in the mainstream relate to this problem. ?I'll get back on this >>> as soon as I've synced fixes. >>Thanks! I probably need to re-pull and build a new Pike at some point >>anyway. It's been a few months since I did that last. >Since it is going to be a bugfix, it will be in 7.9 and in 7.8. Well, the fixes are in. Both in 7.8 and 7.9. It would be very convenient if you can test these fixes and see if it solves your issues already (chances are they will, there were several reconnect issues that got fixed). -- Stephen. Do more than anyone expects, and pretty soon everyone will expect more. |
|
On Wed, Apr 4, 2012 at 6:45 PM, Stephen R. van den Berg <[hidden email]> wrote:
> Well, the fixes are in. Both in 7.8 and 7.9. > It would be very convenient if you can test these fixes and see if > it solves your issues already (chances are they will, there were several > reconnect issues that got fixed). *chuckles* I've just an hour ago left work, with a long weekend ahead of me... well, this is a good excuse to get Pike+Postgres going on one of my boxes here. Or two of them, to test both 32-bit and 64-bit. I'll get back to you on that! ChrisA |
|
In reply to this post by Stephen R. van den Berg
On Wed, Apr 4, 2012 at 6:45 PM, Stephen R. van den Berg <[hidden email]> wrote:
> Well, the fixes are in. Both in 7.8 and 7.9. > It would be very convenient if you can test these fixes and see if > it solves your issues already (chances are they will, there were several > reconnect issues that got fixed). Haven't played with the 32-bit build as yet, as I don't have Postgres on any 32-bit box. Report from my 64-bit testbox, however, isn't good. I'm using code similar to pgtest.pike that I attached to the original post (ie, pretty much the same as we're using in the live system, only with different username/password etc). For this test, I'm doing everything in Hilfe. Database is postgres_9.1.2-1.amd64.openscg.deb if that helps anything. Pike v7.9 release 5 running Hilfe v3.5 (Incremental Pike Frontend) > object db=Sql.Sql("pgsql://fred:barney@localhost/foobar"); > db->query("set search_path to sch1,public"); (1) Result: ({ }) > db->query("select * from test"); (2) Result: ({ /* 3 elements */ ... data exactly as per what's in the table ... }) > db->query("select * from test"); After the successful query, I restarted Postgres from another terminal, let it settle down, and then simply up-entered in Hilfe. System started spinning, with Pike and Postgres taking up the CPU between them. Ctrl-C in Pike cleaned everything up. This test appears to be 100% repeatable. Adding in the reconnect callback: db->master_sql->set_notify_callback("_reconnect", lambda(int,string,string) { write("RECON\n"); write("Ex: %O\n",catch {db->query("set search_path to sch1,public");}); write("Done\n"); },1); produces this output: > db->query("select * from test"); RECON Ex: 0 Done RECON Ex: 0 Done FATAL 57P01: terminating connection due to administrator command (postgres.c:ProcessInterrupts:2894) Reconnected to database fd:10 TCP/IP localhost:5432 PID 20574 Reconnected to database fd:10 TCP/IP localhost:5432 PID 20583 /usr/local/pike/7.9.5/lib/modules/Sql.pmod/pgsql.pike:1920: pgsql(fred@localhost:5432/foobar,20583)->big_query("select * from test",UNDEFINED,UNDEFINED) /usr/local/pike/7.9.5/lib/modules/Sql.pmod/Sql.pike:646: Sql.Sql()->query("select * from test") > db->query("select * from test"); (5) Result: ({ /* 3 elements */ ... third time lucky, we get data ... }) I'm not entirely sure what's going on here. The first query after the server restart results in the reconnect hook being called twice, with no exceptions being thrown in the setting of search_path; the next query after that succeeds. Repeating the exercise results in exceptions being thrown of the form "ERROR 26000: prepared statement \"pike_prep_4\" does not exist", but again, the second query after a reconnect succeeds just fine. Presuming that this is working on your system, and given that it's failing in the same sort of way on each of mine, I'm thinking it's something I'm doing wrong. Any pointers as to what to look for? This test box has a simple default config for Postgres, I haven't tinkered at all. Chris Angelico |
|
Chris Angelico wrote:
>I'm using code similar to pgtest.pike that I attached to the original >post (ie, pretty much the same as we're using in the live system, only >with different username/password etc). For this test, I'm doing >everything in Hilfe. Database is postgres_9.1.2-1.amd64.openscg.deb if >that helps anything. That should not matter. The Postgres network protocol is version agnostic. >After the successful query, I restarted Postgres from another >terminal, let it settle down, and then simply up-entered in Hilfe. >System started spinning, with Pike and Postgres taking up the CPU >between them. Ctrl-C in Pike cleaned everything up. This test appears >to be 100% repeatable. >Adding in the reconnect callback: >FATAL 57P01: terminating connection due to administrator command > (postgres.c:ProcessInterrupts:2894) >Reconnected to database fd:10 TCP/IP localhost:5432 PID 20574 >Reconnected to database fd:10 TCP/IP localhost:5432 PID 20583 That is, at least, one reconnect too many. >query after that succeeds. Repeating the exercise results in >exceptions being thrown of the form "ERROR 26000: prepared statement >\"pike_prep_4\" does not exist", but again, the second query after a That would indicate that the prepared statement cache of the pgsql driver is not properly flushed after the reconnect. >Presuming that this is working on your system, and given that it's >failing in the same sort of way on each of mine, I'm thinking it's >something I'm doing wrong. Any pointers as to what to look for? This Well, since these tests always involve restarting a postgres server and needing enough context to reproduce it, I admit I don't replay every testcase because it involves setting up a spare database every time. Given the fact that it still fails your case, I'll first have a close look at the code once more to see if there is a state-machine flaw in there somewhere; if I can't find that, I'll reproduce your setup. -- Stephen. Do more than anyone expects, and pretty soon everyone will expect more. |
|
On Thu, Apr 5, 2012 at 12:05 AM, Stephen R. van den Berg <[hidden email]> wrote:
>>Presuming that this is working on your system, and given that it's >>failing in the same sort of way on each of mine, I'm thinking it's >>something I'm doing wrong. Any pointers as to what to look for? This > > Well, since these tests always involve restarting a postgres server > and needing enough context to reproduce it, I admit I don't replay every > testcase because it involves setting up a spare database every time. Understandable! I've poked around in the code a bit and haven't found a specific problem, but I have found a couple of things that "look odd". In reconnect(), at line 1052-1053, the preparedname entries are cleared out. This is NOT done if _c is 0. My understanding is that prepared statements on the server are wiped by the reconnect, and so the preparedname is removed to force them to be re-prepared on next use. If so, then possibly the foreach/m_delete loop needs to be moved above the if. Alternatively or as well: At line 1695, a name is given to a prepared statement. This however happens only if the statement was already found in the cache; this seems to mean that statements get their names only when they've been used a second time. Is this intentional? If not, then adding a similar assignment in the else block at line 1710ish will change this. I've tried each of these changes singly, and both together, but they don't solve the problem. Maybe it's completely the wrong track, but hopefully you'll be able to eyeball the ideas more efficiently than I can! Much appreciate your help on this. Chris Angelico |
|
Chris Angelico wrote:
>On Thu, Apr 5, 2012 at 12:05 AM, Stephen R. van den Berg <[hidden email]> wrote: >>>Presuming that this is working on your system, and given that it's >>>failing in the same sort of way on each of mine, I'm thinking it's >>>something I'm doing wrong. Any pointers as to what to look for? This >> Well, since these tests always involve restarting a postgres server >> and needing enough context to reproduce it, I admit I don't replay every >> testcase because it involves setting up a spare database every time. >In reconnect(), at line 1052-1053, the preparedname entries are >cleared out. This is NOT done if _c is 0. My understanding is that >prepared statements on the server are wiped by the reconnect, and so >the preparedname is removed to force them to be re-prepared on next >use. If so, then possibly the foreach/m_delete loop needs to be moved >above the if. If there is no connection, the cache should already have been cleared. >Alternatively or as well: At line 1695, a name is given to a prepared >statement. This however happens only if the statement was already >found in the cache; this seems to mean that statements get their names >only when they've been used a second time. Is this intentional? If >not, then adding a similar assignment in the else block at line >1710ish will change this. The cache is primed with an unnamed statement, so it will get its name on the first round. Well, I did some more testing and in the process tuned a bit here and there. The results were: - Your reconnect problem should be fixed now (fingers crossed). - There is a _text mode now for the pgsql module, to allow things like "BEGIN; UPDATE...; DELETE...; COMMIT" to be sent to the database in one go. - ping() and is_open() support is in now. - (Small) efficiency boost for simple statements in binary mode. Please test. If it tests ok for you, I'll reapply them to 7.8. -- Stephen. "Since light travels faster than sound, doesn't this explain why some people appear bright until you hear them speak?" |
|
On Tue, Apr 10, 2012 at 6:47 PM, Stephen R. van den Berg <[hidden email]> wrote:
> Well, I did some more testing and in the process tuned a bit here and > there. > The results were: > - Your reconnect problem should be fixed now (fingers crossed). My issue involves search_path, which according to your commit message on 6b81f4 doesn't work. But it can be made to! Woohoo!!! :) There's a small issue with the reconnection attempting to set some parameters that cannot be set; the attached patch solves this by adding to the blacklist. The only other change needed is that instead of setting the search path through SQL, I simply set it in the Sql.Sql constructor: db=Sql.Sql("pgsql://USER:PASSWORD@HOST/DATABASE",(["search_path":"SCHEMA,public"])); And then everything works!! Thank you ever so much for your help on this. Whee!! There's one other issue that I came across in testing. I think it's unrelated, but the search_path issue tended to trigger it. If, on reconnection, the statement cannot be successfully prepared, the connection is severed and restarted silently, and the prepare reattempted. This results in an infinite loop (spinning at full CPU usage) if the statement has an error - for instance, if the table has been dropped. This only occurs if it's the first statement after the reconnection. Not going to affect us, and I'd have to call that a corner case. ChrisA |
|
In reply to this post by Stephen R. van den Berg
Stephen R. van den Berg wrote:
>Please test. If it tests ok for you, I'll reapply them to 7.8. Incidentally: - The search_path still needs to be reset by hand, because PostgreSQL seems to forget to reexport the information to us; I guess this needs to be reported to Postgres. - If one uses the _text query, keep in mind that except for the result(set) of the first command, all results will be discarded. I'm not quite sure how to fix this (if at all). -- Stephen. "Since light travels faster than sound, doesn't this explain why some people appear bright until you hear them speak?" |
|
On Tue, Apr 10, 2012 at 8:10 PM, Stephen R. van den Berg <[hidden email]> wrote:
> - The search_path still needs to be reset by hand, because PostgreSQL seems > to forget to reexport the information to us; I guess this needs to be > reported to Postgres. Perhaps. But it works fine if listed as part of connection options (the second argument to Sql.Sql()). Getting the search_path back from Postgres is thus an optional benefit that would allow a change of path post-connection. Incidentally, I'm no longer using or needing the _reconnect hook. Chris Angelico |
|
In reply to this post by Chris Angelico
Chris Angelico wrote:
>On Tue, Apr 10, 2012 at 6:47 PM, Stephen R. van den Berg <[hidden email]> wrote: >> Well, I did some more testing and in the process tuned a bit here and >> there. >> The results were: >> - Your reconnect problem should be fixed now (fingers crossed). >My issue involves search_path, which according to your commit message >on 6b81f4 doesn't work. But it can be made to! Woohoo!!! :) >There's a small issue with the reconnection attempting to set some >parameters that cannot be set; the attached patch solves this by >adding to the blacklist. Ok, thanks, applied (to my own, will push it out in a minute). >The only other change needed is that instead of setting the search >path through SQL, I simply set it in the Sql.Sql constructor: >db=Sql.Sql("pgsql://USER:PASSWORD@HOST/DATABASE",(["search_path":"SCHEMA,public"])); Yes, this works. >There's one other issue that I came across in testing. I think it's >unrelated, but the search_path issue tended to trigger it. If, on >reconnection, the statement cannot be successfully prepared, the >connection is severed and restarted silently, and the prepare >reattempted. This results in an infinite loop (spinning at full CPU >usage) if the statement has an error - for instance, if the table has >been dropped. This only occurs if it's the first statement after the >reconnection. Not going to affect us, and I'd have to call that a >corner case. Hmmm..., this is indeed unrelated, but nonetheless it is undesirable. I'll see if I can find the reason it doesn't get out of this infinite tailspin. -- Stephen. "Since light travels faster than sound, doesn't this explain why some people appear bright until you hear them speak?" |
|
In reply to this post by Chris Angelico
Chris Angelico wrote:
>On Tue, Apr 10, 2012 at 8:10 PM, Stephen R. van den Berg <[hidden email]> wrote: >> - The search_path still needs to be reset by hand, because PostgreSQL seems >> ?to forget to reexport the information to us; I guess this needs to be >> ?reported to Postgres. >Perhaps. But it works fine if listed as part of connection options >(the second argument to Sql.Sql()). Getting the search_path back from >Postgres is thus an optional benefit that would allow a change of path >post-connection. Quite. According to Postgres docs, this is what should have happened. >Incidentally, I'm no longer using or needing the _reconnect hook. Obviously. Incidentally, your patch is in, plus a fix for the infinite reconnect loop. Please retest, if possible. If all is well, I'll reapply the lot to 7.8. -- Stephen. "Since light travels faster than sound, doesn't this explain why some people appear bright until you hear them speak?" |
|
On Tue, Apr 10, 2012 at 9:07 PM, Stephen R. van den Berg <[hidden email]> wrote:
> Incidentally, your patch is in, plus a fix for the infinite reconnect > loop. Please retest, if possible. > If all is well, I'll reapply the lot to 7.8. Yep! All working with my mini test harness on my test box. I'll pull the changes at work first thing tomorrow, tweak my code to set the search_path in options, and confirm that it works in the live environment, but I doubt it'll be any different from the test. Thanks again! ChrisA |
|
On Tue, Apr 10, 2012 at 9:41 PM, Chris Angelico <[hidden email]> wrote:
> On Tue, Apr 10, 2012 at 9:07 PM, Stephen R. van den Berg <[hidden email]> wrote: >> Incidentally, your patch is in, plus a fix for the infinite reconnect >> loop. Please retest, if possible. >> If all is well, I'll reapply the lot to 7.8. > > Yep! All working with my mini test harness on my test box. I'll pull > the changes at work first thing tomorrow, tweak my code to set the > search_path in options, and confirm that it works in the live > environment, but I doubt it'll be any different from the test. Confirmed working with live code, with one caveat (which is also true of the testbox I believe, but I never tested this): If the connection doesn't come good within the phasedreconnect time, an exception is thrown, and the db connection is permanently dead. Error in query: FATAL 57P01: terminating connection due to administrator command (postgres.c:ProcessInterrupts:2894) Couldn't connect to database on db:5432 Enter table name [q to quit]: ------------------------------------------------------------------------------- Error in query: Cannot index the NULL value with "setportal". backtrace_frame(/usr/local/pike/7.9.5/lib/master.pike:4031, _main(), Args: 1), backtrace_frame(-:1, `()(), Args: 1), backtrace_frame(client.pike:255, loadnodes(), No args), backtrace_frame(/usr/local/pike/7.9.5/lib/modules/Sql.pmod/Sql.pike:646, query(), Args: 1), backtrace_frame(/usr/local/pike/7.9.5/lib/modules/Sql.pmod/pgsql.pike:1829, big_query(), Args: 3), backtrace_frame(/usr/local/pike/7.9.5/lib/modules/Sql.pmod/pgsql_util.pmod:285, create(), Args: 7), backtrace_frame(/usr/local/pike/7.9.5/lib/modules/Sql.pmod/pgsql_util.pmod:399, steallock(), No args) ChrisA |
|
Chris Angelico wrote:
>Confirmed working with live code, with one caveat (which is also true >of the testbox I believe, but I never tested this): If the connection >doesn't come good within the phasedreconnect time, an exception is >thrown, and the db connection is permanently dead. That is by design. I mimiced the MySQL driver behaviour here; that tries an immediate reconnect and then dies if that fails. I try to give some headroom by waiting a little while longer and retrying. If this should become an infinite retry loop, that can be accomodated. >Error in query: FATAL 57P01: terminating connection due to administrator command > (postgres.c:ProcessInterrupts:2894) >Couldn't connect to database on db:5432 >Error in query: Cannot index the NULL value with "setportal". This last error looks a bit unclean. But it's just cosmetic. -- Stephen. While money can't buy happiness, it certainly lets you choose your own form of misery. |
|
On Wed, Apr 11, 2012 at 10:42 PM, Stephen R. van den Berg <[hidden email]> wrote:
> Chris Angelico wrote: >>Confirmed working with live code, with one caveat (which is also true >>of the testbox I believe, but I never tested this): If the connection >>doesn't come good within the phasedreconnect time, an exception is >>thrown, and the db connection is permanently dead. > > That is by design. > I mimiced the MySQL driver behaviour here; that tries an immediate reconnect > and then dies if that fails. > I try to give some headroom by waiting a little while longer and retrying. > > If this should become an infinite retry loop, that can be accomodated. Understandable, and a perfectly good design. Is it possible to set a max retries option on creation (maybe with -1 meaning infinite)? Otherwise, I'll probably end up whipping up a wrapper around db->query that catches any exceptions, checks for the conn-fail, and if not, rethrows. ChrisA |
| Powered by Nabble | See how NAML generates this page |
