Winsock error 10035 while trying to upgrade from 8.0 to 8.2
I'm trying to upgrade a pretty big database (60G) from postgres 8.0 to
postgres 8.2 on windows 2000 Server (both version running on the same machine
on different ports). During the migration process, I always get an error at
some point (never the same) :
LOG: could not receive data from client: Unknown winsock error 10035
which is followed by
LOG: incomplete message from client
ERROR: unexpected EOF on a client connexion
FATAL: invalid frontend message type 53 psql -U postgres -p 5433
Moving the 8.2 postgres instance to a winxp pro machine, the migration is
successfull.
I've searched google but didn't find anything related to postgres.
cyril
Source database : postgres 8.0.9
Destination database : postgres 8.2.4
OS : Windows 2000 Server SP4
migration is done on linux using 8.2.4 binaries (since piping pg_dump output
on windows stop on the first ctrl-Z) with "pg_dump -h YYY XXX | psql -h YYY -p
5433 XXX"
Cyril VELTER wrote:
I'm trying to upgrade a pretty big database (60G) from postgres 8.0 to
postgres 8.2 on windows 2000 Server (both version running on the same machine
on different ports). During the migration process, I always get an error at
some point (never the same) :
Interesting. 10035 is "A non-blocking socket operation could not be
completed immediatly".
Question: Does this error come fromthe 8.0 or the 8.2 server?
Also, do you use SSL?
//Magnus
magnus@hagander.net wrote :
Cyril VELTER wrote:I'm trying to upgrade a pretty big database (60G) from postgres 8.0 to
postgres 8.2 on windows 2000 Server (both version running on the same
machine
on different ports). During the migration process, I always get an error at
some point (never the same) :
Interesting. 10035 is "A non-blocking socket operation could not be
completed immediatly".
Question: Does this error come fromthe 8.0 or the 8.2 server?
It comes from the 8.2 server message log
Also, do you use SSL?
No I'm not. It's not even complied in the server nor in the pg_dump binary.
The server is built on windows using MSYS simply with ./configure && make all
&& make install
I've been able to reproduce the problem 6 times (at random points in the
process, but it never complete successfully). Is there any test I can do to
help investigate the problem ?
cyril
Show quoted text
//Magnus
Cyril VELTER wrote:
No I'm not. It's not even complied in the server nor in the pg_dump binary.
The server is built on windows using MSYS simply with ./configure && make all
&& make installI've been able to reproduce the problem 6 times (at random points in the
process, but it never complete successfully). Is there any test I can do to
help investigate the problem ?
Sorry I haven't gotten back to you for a while.
Yeah, if you can attach a debugger to the backend (assuming you have a
predictable backend it happens to - but if you're loading, you are using
a single session, I assume?), add a breakpoint around the area of the
problem, and get a backtrace from exactly where it shows up, that would
help.
//Magnus
Cyril VELTER wrote:
No I'm not. It's not even complied in the server nor in the pg_dump
binary.
The server is built on windows using MSYS simply with ./configure && make
all
&& make install
I've been able to reproduce the problem 6 times (at random points in the
process, but it never complete successfully). Is there any test I can do to
help investigate the problem ?
Sorry I haven't gotten back to you for a while.
Yeah, if you can attach a debugger to the backend (assuming you have a
predictable backend it happens to - but if you're loading, you are using
a single session, I assume?), add a breakpoint around the area of the
problem, and get a backtrace from exactly where it shows up, that would
help.
Thanks for your reply. I'll try to do this. I've installed gdb on the
problematic machine and recompiled postgres with debug symbols (configure
--enable-debug)
I'm not very familiar with gdb. Could you give some direction on setting the
breakpoint. After running gdb on the postgres.exe file, I'm not able to set the
breakpoint (b socket.c:574 give me an error).
Searching the source files, it seems the error message is generated in
port/win32/socket.c line 594.
Thanks,
cyril
Cyril VELTER wrote:
Cyril VELTER wrote:
No I'm not. It's not even complied in the server nor in the pg_dump
binary.
The server is built on windows using MSYS simply with ./configure && make
all
&& make install
I've been able to reproduce the problem 6 times (at random points in the
process, but it never complete successfully). Is there any test I can do tohelp investigate the problem ?
Sorry I haven't gotten back to you for a while.
Yeah, if you can attach a debugger to the backend (assuming you have a
predictable backend it happens to - but if you're loading, you are using
a single session, I assume?), add a breakpoint around the area of the
problem, and get a backtrace from exactly where it shows up, that would
help.Thanks for your reply. I'll try to do this. I've installed gdb on the
problematic machine and recompiled postgres with debug symbols (configure
--enable-debug)I'm not very familiar with gdb. Could you give some direction on setting the
breakpoint. After running gdb on the postgres.exe file, I'm not able to set the
breakpoint (b socket.c:574 give me an error).
Hmm, I keep forgetting that. There is some serious black magic required
to get gdb to even approach working state on win32. I'm too used to
working with the msvc build now. I've never actually got it working
myself, but I know others have. Hopefully someone can speak up here? :-)
Searching the source files, it seems the error message is generated in
port/win32/socket.c line 594.
Right, but the important thing is which path down to that function is it
generated in. Which is why a backtrace would help.
Looking at the code, the problem is probably somewhere in
pgwin32_recv(). Now, it really shouldn't end up doing what you're
seeing, but obviously it is.
Perhaps we just need to have it retry if it gets the WSAEWOULDBLOCK?
Thoughts?
//Magnus
Cyril VELTER wrote:
Cyril VELTER wrote:
No I'm not. It's not even complied in the server nor in the pg_dump
binary.
The server is built on windows using MSYS simply with ./configure &&
make
all
&& make install
I've been able to reproduce the problem 6 times (at random points in the
process, but it never complete successfully). Is there any test I can do
to
help investigate the problem ?
Sorry I haven't gotten back to you for a while.
Yeah, if you can attach a debugger to the backend (assuming you have a
predictable backend it happens to - but if you're loading, you are using
a single session, I assume?), add a breakpoint around the area of the
problem, and get a backtrace from exactly where it shows up, that would
help.Thanks for your reply. I'll try to do this. I've installed gdb on the
problematic machine and recompiled postgres with debug symbols (configure
--enable-debug)I'm not very familiar with gdb. Could you give some direction on setting
the
breakpoint. After running gdb on the postgres.exe file, I'm not able to set
the
breakpoint (b socket.c:574 give me an error).
Hmm, I keep forgetting that. There is some serious black magic required
to get gdb to even approach working state on win32. I'm too used to
working with the msvc build now. I've never actually got it working
myself, but I know others have. Hopefully someone can speak up here? :-)
I don't have msvc available.
Searching the source files, it seems the error message is generated in
port/win32/socket.c line 594.Right, but the important thing is which path down to that function is it
generated in. Which is why a backtrace would help.
Yes, I understand that.
Looking at the code, the problem is probably somewhere in
pgwin32_recv(). Now, it really shouldn't end up doing what you're
seeing, but obviously it is.
After looking at the code of pgwin32_recv(), I don't understand why
pgwin32_waitforsinglesocket() is called with the FD_ACCEPT argument.
Perhaps we just need to have it retry if it gets the WSAEWOULDBLOCK?
Thoughts?
I've modified pgwin32_recv() to do that (repeat the
pgwin32_waitforsinglesocket() / WSARecv while the error is WSAEWOULDBLOCK and
not raising this error. I've an upgrade running right now (I will have the
result in the next hours).
cyril
Cyril VELTER wrote:
Searching the source files, it seems the error message is generated in
port/win32/socket.c line 594.Right, but the important thing is which path down to that function is it
generated in. Which is why a backtrace would help.Yes, I understand that.
Looking at the code, the problem is probably somewhere in
pgwin32_recv(). Now, it really shouldn't end up doing what you're
seeing, but obviously it is.After looking at the code of pgwin32_recv(), I don't understand why
pgwin32_waitforsinglesocket() is called with the FD_ACCEPT argument.Perhaps we just need to have it retry if it gets the WSAEWOULDBLOCK?
Thoughts?I've modified pgwin32_recv() to do that (repeat the
pgwin32_waitforsinglesocket() / WSARecv while the error is WSAEWOULDBLOCK and
not raising this error. I've an upgrade running right now (I will have the
result in the next hours).
Replying to myself, the upgrade is not finished yet, but I can confirm that
there is cases where pgwin32_waitforsinglesocket() return and the WSARecv
immediatly fail. I-ve modified the end of pgwin32_recv() :
/* No error, zero bytes (win2000+) or error+WSAEWOULDBLOCK (<=nt4) */
for(;;) {
if (pgwin32_waitforsinglesocket(s, FD_READ | FD_CLOSE | FD_ACCEPT,
INFINITE) == 0)
return -1;
r = WSARecv(s, &wbuf, 1, &b, &flags, NULL, NULL);
if (r == SOCKET_ERROR)
{
printf("SOCKERROR");
if (WSAGetLastError() != WSAEWOULDBLOCK)
{
TranslateSocketError();
return -1;
}
}
else
{
return b;
}
}
The printf("SOCKERROR") line have been hit two times.
Any though ?
Once this upgrade is finished, I will make another try removing FD_ACCEPT from
the pgwin32_waitforsinglesocket() call.
cyril
Cyril VELTER wrote:
Cyril VELTER wrote:
Searching the source files, it seems the error message is generated in
port/win32/socket.c line 594.Right, but the important thing is which path down to that function is it
generated in. Which is why a backtrace would help.Yes, I understand that.
Looking at the code, the problem is probably somewhere in
pgwin32_recv(). Now, it really shouldn't end up doing what you're
seeing, but obviously it is.After looking at the code of pgwin32_recv(), I don't understand why
pgwin32_waitforsinglesocket() is called with the FD_ACCEPT argument.Perhaps we just need to have it retry if it gets the WSAEWOULDBLOCK?
Thoughts?I've modified pgwin32_recv() to do that (repeat the
pgwin32_waitforsinglesocket() / WSARecv while the error is WSAEWOULDBLOCK andnot raising this error. I've an upgrade running right now (I will have the
result in the next hours).Replying to myself, the upgrade is not finished yet, but I can confirm that
there is cases where pgwin32_waitforsinglesocket() return and the WSARecv
immediatly fail. I-ve modified the end of pgwin32_recv() :/* No error, zero bytes (win2000+) or error+WSAEWOULDBLOCK (<=nt4) */
for(;;) {
if (pgwin32_waitforsinglesocket(s, FD_READ | FD_CLOSE | FD_ACCEPT,
INFINITE) == 0)
return -1;r = WSARecv(s, &wbuf, 1, &b, &flags, NULL, NULL);
if (r == SOCKET_ERROR)
{
printf("SOCKERROR");
if (WSAGetLastError() != WSAEWOULDBLOCK)
{
TranslateSocketError();
return -1;
}
}
else
{
return b;
}
}The printf("SOCKERROR") line have been hit two times.
Any though ?
Once this upgrade is finished, I will make another try removing FD_ACCEPT from
the pgwin32_waitforsinglesocket() call.
Hmm. That really isn't supposed to happen, but seems it is. Does it work
when you add that loop, though? Spits out the message and works, or does
it spit out the message and still not work?
I'm also a bit worried about it getting caught in a tight loop if the
error codes are wrong, but probably it just goes back into waitfor.. and
blocks the second time. Otherwise, you'd see screenfuls of that message.
Can you determine if it was hit two times right after each other, or if
there was time between them?
//Magnus
Cyril VELTER wrote:
Cyril VELTER wrote:
Searching the source files, it seems the error message is generated in
port/win32/socket.c line 594.Right, but the important thing is which path down to that function is it
generated in. Which is why a backtrace would help.Yes, I understand that.
Looking at the code, the problem is probably somewhere in
pgwin32_recv(). Now, it really shouldn't end up doing what you're
seeing, but obviously it is.After looking at the code of pgwin32_recv(), I don't understand why
pgwin32_waitforsinglesocket() is called with the FD_ACCEPT argument.Perhaps we just need to have it retry if it gets the WSAEWOULDBLOCK?
Thoughts?I've modified pgwin32_recv() to do that (repeat the
pgwin32_waitforsinglesocket() / WSARecv while the error is WSAEWOULDBLOCK
and
not raising this error. I've an upgrade running right now (I will have the
result in the next hours).
Replying to myself, the upgrade is not finished yet, but I can confirm
that
there is cases where pgwin32_waitforsinglesocket() return and the WSARecv
immediatly fail. I-ve modified the end of pgwin32_recv() :/* No error, zero bytes (win2000+) or error+WSAEWOULDBLOCK (<=nt4) */
for(;;) {
if (pgwin32_waitforsinglesocket(s, FD_READ | FD_CLOSE | FD_ACCEPT,
INFINITE) == 0)
return -1;r = WSARecv(s, &wbuf, 1, &b, &flags, NULL, NULL);
if (r == SOCKET_ERROR)
{
printf("SOCKERROR");
if (WSAGetLastError() != WSAEWOULDBLOCK)
{
TranslateSocketError();
return -1;
}
}
else
{
return b;
}
}The printf("SOCKERROR") line have been hit two times.
Any though ?
Once this upgrade is finished, I will make another try removing FD_ACCEPT
from
the pgwin32_waitforsinglesocket() call.
Hmm. That really isn't supposed to happen, but seems it is. Does it work
when you add that loop, though? Spits out the message and works, or does
it spit out the message and still not work?
OK, I've the results of my tests :
With the previous code, then message "SOCKERROR" is printed 5 times during the
whole process (100 Gb dump import with psql). There one group of three and one
group of two, but I don't have timestamps and am not sure if they are printing
in the same loop or not. The import is finally successful.
The second test I have done is to remove FD_ACCEPT I still have the message
one times, but it still happen. The import is also sucessfull.
I'm also a bit worried about it getting caught in a tight loop if the
error codes are wrong, but probably it just goes back into waitfor.. and
blocks the second time. Otherwise, you'd see screenfuls of that message.Can you determine if it was hit two times right after each other, or if
there was time between them?
For the first test I don't known the amount of time between them (I have two
groups separeted in the logs with other messages).
What do you think ? may be a bug in the windows server installation I have
(this machines have not been updated for some times, perhaps I should try to do
that and see if the problem is still there. In the long run, I plan to upgrade
to windows 2003).
cyril
Cyril VELTER wrote:
OK, I've the results of my tests :
With the previous code, then message "SOCKERROR" is printed 5 times during the
whole process (100 Gb dump import with psql). There one group of three and one
group of two, but I don't have timestamps and am not sure if they are printing
in the same loop or not. The import is finally successful.
Ok.
The second test I have done is to remove FD_ACCEPT I still have the message
one times, but it still happen. The import is also sucessfull.
Ok. So FD_ACCEPT is not the fix. Good, I didn't think it would be.
I'm also a bit worried about it getting caught in a tight loop if the
error codes are wrong, but probably it just goes back into waitfor.. and
blocks the second time. Otherwise, you'd see screenfuls of that message.Can you determine if it was hit two times right after each other, or if
there was time between them?For the first test I don't known the amount of time between them (I have two
groups separeted in the logs with other messages).
Ok. I'm thinking of just sticking a minimal wait in there to protect
against absolute runaway, but that should be enough I think.
What do you think ? may be a bug in the windows server installation I have
(this machines have not been updated for some times, perhaps I should try to do
that and see if the problem is still there. In the long run, I plan to upgrade
to windows 2003).
I don't *think* it should be a bug with your version, it doesn't look
like it. but if you're not on the latest service pack, that's certainly
possible. Please update to latest servicepack + updates from Windows
Update / WSUS, and let me know if the problem persists.
Meanwhile, I'll try to cook up a patch.
//Magnus
De : mailto:magnus@hagander.net
Cyril VELTER wrote:
OK, I've the results of my tests :
With the previous code, then message "SOCKERROR" is printed 5 times during
the
whole process (100 Gb dump import with psql). There one group of three and
one
group of two, but I don't have timestamps and am not sure if they are
printing
in the same loop or not. The import is finally successful.
Ok.
The second test I have done is to remove FD_ACCEPT I still have the
message
one times, but it still happen. The import is also sucessfull.
Ok. So FD_ACCEPT is not the fix. Good, I didn't think it would be.
I'm also a bit worried about it getting caught in a tight loop if the
error codes are wrong, but probably it just goes back into waitfor.. and
blocks the second time. Otherwise, you'd see screenfuls of that message.Can you determine if it was hit two times right after each other, or if
there was time between them?For the first test I don't known the amount of time between them (I have
two
groups separeted in the logs with other messages).
Ok. I'm thinking of just sticking a minimal wait in there to protect
against absolute runaway, but that should be enough I think.What do you think ? may be a bug in the windows server installation I have
(this machines have not been updated for some times, perhaps I should try
to do
that and see if the problem is still there. In the long run, I plan to
upgrade
to windows 2003).
I don't *think* it should be a bug with your version, it doesn't look
like it. but if you're not on the latest service pack, that's certainly
possible. Please update to latest servicepack + updates from Windows
Update / WSUS, and let me know if the problem persists.
I AM on the latest service pack (on 2k it would be VERY OLD otherwise), but I
only do an update with windows update once in a year. I'll schedule an update
in the next weeks and keep you informed about the results.
Meanwhile, I'll try to cook up a patch.
thanks for your help
cyril
On Tue, May 29, 2007 at 11:25:30PM +0200, Magnus Hagander wrote:
What do you think ? may be a bug in the windows server installation I have
(this machines have not been updated for some times, perhaps I should try to do
that and see if the problem is still there. In the long run, I plan to upgrade
to windows 2003).I don't *think* it should be a bug with your version, it doesn't look
like it. but if you're not on the latest service pack, that's certainly
possible. Please update to latest servicepack + updates from Windows
Update / WSUS, and let me know if the problem persists.Meanwhile, I'll try to cook up a patch.
I have applied a patch for this to HEAD and 8.2. It includes a small wait
so we don't hit it too hard, and a limit on 5 retries before we simply give
up - so we don't end up in an infinite loop.
//Magnus