Win32 hard crash problem
Hello,
Dave Cramer and I have dealt with a company today running 8.1.4 on
Windows 2003. The application is a web app that runs via JDBC/Hibernate.
The application will function perfectly for about 2/3 weeks and then we
will receive a:
"server sent data (\"D\" message) without prior row description (\"T\"
message)");
(not escaped of course).
Subsequent connections to the database will fail (such as pgAdmin) and
Windows must be completely rebooted. I did ask if they were able to kill
the process via the task manager. Instead they opt to use the service
options and when that fails (which is always) they reboot the machine
entirely.
PostgreSQL will also not recover on its own (e.g; auto restart and roll
through the logs).
The good news is at that on reboot the problem goes away for 2/3 weeks.
I have verified that they are doing all requisite routine maintenance.
I currently have the customer running hardware checks to verify validity
of the hardware but...
Any thoughts?
Sincerely,
Joshua D. Drake
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes:
Dave Cramer and I have dealt with a company today running 8.1.4 on
Windows 2003. The application is a web app that runs via JDBC/Hibernate.
The application will function perfectly for about 2/3 weeks and then we
will receive a:
"server sent data (\"D\" message) without prior row description (\"T\"
message)");
That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.
My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).
BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.
Any thoughts?
I suppose "get a real operating system" won't go over well?
regards, tom lane
Tom Lane wrote:
"Joshua D. Drake" <jd@commandprompt.com> writes:
Dave Cramer and I have dealt with a company today running 8.1.4 on
Windows 2003. The application is a web app that runs via JDBC/Hibernate.
The application will function perfectly for about 2/3 weeks and then we
will receive a:
"server sent data (\"D\" message) without prior row description (\"T\"
message)");That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.
The error is server side. I was just describing the environment.
Any thoughts?
I suppose "get a real operating system" won't go over well?
Tried that, I got nervous laughter on the other end ;)
Joshua D. Drake
regards, tom lane
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes:
Tom Lane wrote:
BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.
The error is server side. I was just describing the environment.
I can entirely assure you that that error message is not present in the
server code.
regards, tom lane
Tom Lane wrote:
"Joshua D. Drake" <jd@commandprompt.com> writes:
Tom Lane wrote:
BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.The error is server side. I was just describing the environment.
I can entirely assure you that that error message is not present in the
server code.
Ok let me be more clear. The message is being throw via PostgreSQL. I am
getting per the message I posted..
http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol2.c?rev=22194
http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol3.c?rev=25989
It is in libpq and the protocol not the backend that is giving me the
message. When I said server, I as referring to postgresql inclusively,
not the driver that was actually connecting.
Sincerely,
Joshua D. Drake
regards, tom lane
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
On 31-Aug-06, at 6:01 PM, Tom Lane wrote:
"Joshua D. Drake" <jd@commandprompt.com> writes:
Tom Lane wrote:
BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row
description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my
thought
is to look for something going through libpq.The error is server side. I was just describing the environment.
I can entirely assure you that that error message is not present in
the
server code.
Well that's even more interesting because it doesn't exist in the
jdbc driver either.
Dave
Show quoted text
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly
Dave Cramer wrote:
On 31-Aug-06, at 6:01 PM, Tom Lane wrote:
"Joshua D. Drake" <jd@commandprompt.com> writes:
Tom Lane wrote:
BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row
description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my
thought is to look for something going through libpq.The error is server side. I was just describing the environment.
I can entirely assure you that that error message is not present in
the server code.Well that's even more interesting because it doesn't exist in the
jdbc driver either.
Conclusion: they are using libpq in some form, so you should investigate
that.
Is there a way to alter the tick counter, so that a test run does not
need to take the full 3 weeks?
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).
This doesn't quite make sense. The only reason we have to reboot is
because PostgreSQL no longer responds. The system itself is fine.
Sincerely,
Joshua D. Drake
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Alvaro Herrera wrote:
Dave Cramer wrote:
On 31-Aug-06, at 6:01 PM, Tom Lane wrote:
"Joshua D. Drake" <jd@commandprompt.com> writes:
Tom Lane wrote:
BTW, are you sure this is coming from JDBC? I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row
description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my
thought is to look for something going through libpq.The error is server side. I was just describing the environment.
I can entirely assure you that that error message is not present in
the server code.Well that's even more interesting because it doesn't exist in the
jdbc driver either.Conclusion: they are using libpq in some form, so you should investigate
that.Is there a way to alter the tick counter, so that a test run does not
need to take the full 3 weeks?
Sure it is a registry entry... so we could (in theory) shrink that quite
a bit.. However I am confused, if we don't use it, what that is
connecting to libpq would trigger it?
I know they are using pgAAdmin...
Joshua D. Drake
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes:
My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).
This doesn't quite make sense. The only reason we have to reboot is
because PostgreSQL no longer responds. The system itself is fine.
The Windows kernel may still work, but that doesn't mean that everything
Postgres depends on still works. I'm wondering about (a) the TCP stack
(and that includes 3rd party firewalls and such, not only the core
Windows code); (b) timing or threading stuff inside the application
that's using libpq, which the only thing we know about so far is that
it's *not* JDBC/Hibernate.
regards, tom lane
Tom Lane wrote:
"Joshua D. Drake" <jd@commandprompt.com> writes:
My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).This doesn't quite make sense. The only reason we have to reboot is
because PostgreSQL no longer responds. The system itself is fine.The Windows kernel may still work, but that doesn't mean that everything
Postgres depends on still works. I'm wondering about (a) the TCP stack
(and that includes 3rd party firewalls and such, not only the core
Windows code); (b) timing or threading stuff inside the application
that's using libpq, which the only thing we know about so far is that
it's *not* JDBC/Hibernate.
/me grumbles in a not so polite way about Windows.
Which means we need to start stripping it down. Gah, I actually argued
*for* this port to. Next time slap me.
Joshua D. Drake
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes:
Which means we need to start stripping it down. Gah, I actually argued
*for* this port to. Next time slap me.
Well, before you invest a lot of time barking up what might be the wrong
tree, there is a very easy test you can use to check the GetTickCount
theory: keep closer track of time-since-boot on the affected systems.
If that idea is right, it won't be "two or three weeks" between boot and
problems appearing, it'll be 24.85 days on the nose. It shouldn't take
much except waiting to either falsify the theory or make it look pretty
convincing.
regards, tom lane
On 31/8/06 23:34, "Joshua D. Drake" <jd@commandprompt.com> wrote:
Sure it is a registry entry... so we could (in theory) shrink that quite
a bit.. However I am confused, if we don't use it, what that is
connecting to libpq would trigger it?I know they are using pgAAdmin...
Are they using pgAgent? That's the only part of pgAdmin that doesn't any
sort of timing I can think of offhand (other than the query tool timer which
only runs whilst a query is running). Even then it's done indirectly through
wxWidgets so I'm not familiar with how it's implemented at the win32 API
level.
If it were pgAdmin (or any other client) though, how would that lock up the
entire PostgreSQL instance, but not the rest of the server?
Regards, Dave.
My bet is something depending on GetTickCount to measure elapsed
time
(and no, it's not used in the core Postgres code, but you've got
plenty of other possible culprits in that stack).This doesn't quite make sense. The only reason we have to reboot
is
because PostgreSQL no longer responds. The system itself is fine.
The Windows kernel may still work, but that doesn't mean that
everything Postgres depends on still works. I'm wondering about
(a) the TCP stack (and that includes 3rd party firewalls and such,
not only the core Windows code); (b) timing or threading stuff
inside the application that's using libpq, which the only thing we
know about so far is that it's *not* JDBC/Hibernate.
How about getting a simple backtrace from a couple of the stuck postgres
processes? And from the postmaster which should be accepting new
connections... Or does that also hang completely?
How to get one? Well, since we don't have the MSVC build yet (yeah,
yeah, eventually), you can only get a semi-backtrace that only looks at
exported symbols. You can get this using process explorer (thread tab,
click stack), using WinDBG or using Visual Studio (you'll need VS 2005,
and you need to check the option for "Load DLL exports" in
options->debugging->native).
Oh, btw, if there is a 3rd firewall on the box the standard
recommendation of uninstalling it definitely sounds like a good plan :-)
//Magnus
Oops, going backwards through the mails it seems :)
Subsequent connections to the database will fail (such as pgAdmin)
and Windows must be completely rebooted.
Fail in what way. Hang, not connect, or get an error msg?
PostgreSQL will also not recover on its own (e.g; auto restart and
roll through the logs).
What do you mean by this? It doesn't start upon reboot? What is needed
to make it start?
//Magnus
My bet is something depending on GetTickCount to measure elapsed
time
(and no, it's not used in the core Postgres code, but you've got
plenty of other possible culprits in that stack).This doesn't quite make sense. The only reason we have to reboot is
because PostgreSQL no longer responds. The system itself is fine.The Windows kernel may still work, but that doesn't mean that
everything Postgres depends on still works.
It may be a not reacting listen socket. This may be because of a handle
leak. Next time it blocks look at the handle counts (e.g. with
handle.exe
from sysinternals).
You could also look for handle count now with Task Manager and see if it
increases constantly. (handle.exe shows you the details)
Andreas
Magnus Hagander wrote:
Oops, going backwards through the mails it seems :)
Subsequent connections to the database will fail (such as pgAdmin)
and Windows must be completely rebooted.Fail in what way. Hang, not connect, or get an error msg?
PostgreSQL will also not recover on its own (e.g; auto restart and
roll through the logs).What do you mean by this? It doesn't start upon reboot? What is needed
to make it start?
It means that postgresql doesn't recover on its own. On linux if a
backend crashes all of PostgreSQL will restart and come back up if it can.
On Win32 it doesn't.
Joshua D. Drake
//Magnus
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
PostgreSQL will also not recover on its own (e.g; auto restart and
roll through the logs).What do you mean by this? It doesn't start upon reboot?
What is needed
to make it start?
It means that postgresql doesn't recover on its own. On linux
if a backend crashes all of PostgreSQL will restart and come
back up if it can.On Win32 it doesn't.
Ah, I thought you meant that the database recovery process (that runs
after a crash) failed and lost data. But it's not data-loss then, it
just took a reboot to fix it?
I think we're somehow seeing a complete postmaster hang, where it's
either not able to kill off th ebackends as required, or just not
capable of accepting new connections after that. Which makes a
stacktrace from the postmaster the most interesting one to look at.
//Magnus
On 9/5/06, Joshua D. Drake <jd@commandprompt.com> wrote:
Magnus Hagander wrote:
What do you mean by this? It doesn't start upon reboot? What is needed
to make it start?It means that postgresql doesn't recover on its own. On linux if a
backend crashes all of PostgreSQL will restart and come back up if it can.On Win32 it doesn't.
it does for me, at least for me when I used to work with windows :).
I think it just doesn't restart for this particular type of crash. I
had a couple of similarly wierd undetectable windows problems that I
could never quite figured out until I got hired by another company and
left that monster behind for good.
merlin
Magnus Hagander wrote:
PostgreSQL will also not recover on its own (e.g; auto restart and
roll through the logs).What do you mean by this? It doesn't start upon reboot?
What is needed
to make it start?
It means that postgresql doesn't recover on its own. On linux
if a backend crashes all of PostgreSQL will restart and come
back up if it can.On Win32 it doesn't.
Ah, I thought you meant that the database recovery process (that runs
after a crash) failed and lost data. But it's not data-loss then, it
just took a reboot to fix it?
Right, but "just took a reboot to fix it" isn't very confidence inspiring ;)
I think we're somehow seeing a complete postmaster hang, where it's
either not able to kill off th ebackends as required, or just not
capable of accepting new connections after that. Which makes a
stacktrace from the postmaster the most interesting one to look at.
I have asked the customer to also look and see if there was one
particular process that was eating cpu via the task master and see if
that process can be killed. If that process can be killed and postgresql
comes back clean, then that is a step.
However, debugging this beast is a pain. I take it mingw doesn't have a
gdb we can use?
//Magnus
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/