[Win32] Problem with rename()

Started by Peter Brantalmost 20 years ago22 messagesbugs
Jump to latest
#1Peter Brant
Peter.Brant@wicourts.gov

Hi all,

In the last couple of days, we've been bitten (a couple of times, on
different servers) by an apparent glitch or bad interaction in the
Windows implementation of rename().

The relevant log message is:

[2006-04-17 16:49:22.583 ] 2252 LOG: could not rename file
"pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try

It apparently just keeps on looping indefinitely. The "completed
rename" message from port/dirmod.c never shows up.

Shortly thereafter, Postgres becomes unresponsive. Attempts to make a
new connection just block. Autovacuums block. A "pg_ctl ... stop -m
fast" doesn't work. Only "pg_ctl ... stop -m immediate" does.

With the last occurrence, I saved off the output of "handle -a" and
"pslist -x" in case that's helpful.

Any thoughts on what might be going wrong? If it happens again, what
other clues should I be looking for?

Pete

#2Bruce Momjian
bruce@momjian.us
In reply to: Peter Brant (#1)
Re: [Win32] Problem with rename()

Peter Brant wrote:

Hi all,

In the last couple of days, we've been bitten (a couple of times, on
different servers) by an apparent glitch or bad interaction in the
Windows implementation of rename().

The relevant log message is:

[2006-04-17 16:49:22.583 ] 2252 LOG: could not rename file
"pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try

It apparently just keeps on looping indefinitely. The "completed
rename" message from port/dirmod.c never shows up.

Shortly thereafter, Postgres becomes unresponsive. Attempts to make a
new connection just block. Autovacuums block. A "pg_ctl ... stop -m
fast" doesn't work. Only "pg_ctl ... stop -m immediate" does.

With the last occurrence, I saved off the output of "handle -a" and
"pslist -x" in case that's helpful.

Any thoughts on what might be going wrong? If it happens again, what
other clues should I be looking for?

Yes, comment I added to dirmod.c give a hint:

/*
* We need these loops because even though PostgreSQL uses flags that
* allow rename while the file is open, other applications might have
* these files open without those flags.
*/

so someone else has the file opened, but didn't use the required flags.
As to what could have it open, I don't know.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Peter Brant (#1)
Re: [Win32] Problem with rename()

""Peter Brant"" <Peter.Brant@wicourts.gov>

In the last couple of days, we've been bitten (a couple of times, on
different servers) by an apparent glitch or bad interaction in the
Windows implementation of rename().

The relevant log message is:

[2006-04-17 16:49:22.583 ] 2252 LOG: could not rename file
"pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try

It apparently just keeps on looping indefinitely. The "completed
rename" message from port/dirmod.c never shows up.

Similar problems have been reported before -- which PG version and do you
have any anti-virus software installed?

Regards,
Qingqing

#4Magnus Hagander
magnus@hagander.net
In reply to: Qingqing Zhou (#3)
Re: [Win32] Problem with rename()

Hi all,

In the last couple of days, we've been bitten (a couple of
times, on different servers) by an apparent glitch or bad
interaction in the Windows implementation of rename().

The relevant log message is:

[2006-04-17 16:49:22.583 ] 2252 LOG: could not rename file
"pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try

It apparently just keeps on looping indefinitely. The
"completed rename" message from port/dirmod.c never shows up.

Shortly thereafter, Postgres becomes unresponsive. Attempts
to make a new connection just block. Autovacuums block. A
"pg_ctl ... stop -m fast" doesn't work. Only "pg_ctl ...
stop -m immediate" does.

With the last occurrence, I saved off the output of "handle
-a" and "pslist -x" in case that's helpful.

Any thoughts on what might be going wrong? If it happens
again, what other clues should I be looking for?

It would be interesting to see which processes have handle(s) open to
either of these two names. "handle -a" shuold give that, I assume?

//Magnus

#5Peter Brant
Peter.Brant@wicourts.gov
In reply to: Bruce Momjian (#2)
Re: [Win32] Problem with rename()

Unfortunately, it's not that simple. It would be straightforward to
track down if it were.

In response to other questions:

It's Postgres 8.1.3 running on Windows 2003 Server. No anti-virus
software is installed. The servers are essentially bare except for the
OS and Postgres.

We have "handle -a" output from two occurrences (different servers):

For the first one:

LOG: could not rename file "pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try

Only one process (postgres.exe) is holding a handle to
pg_xlog/000000010000010A000000BD:

F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(3306) BIND
FF4: File G:\pgsql\data\pg_xlog\000000010000010A000000BD

Nothing has the target file open.

The second is similar, except that two postgres.exe processes (and
nothing else) have the file open:

LOG: could not rename file "pg_xlog/000000010000010A0000006E" to
"pg_xlog/000000010000010A00000087", continuing to try

#1:
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(2367) SELECT
EFC: File G:\pgsql\data\pg_xlog\000000010000010A0000006E

#2:
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(2420) SELECT
FF4: File G:\pgsql\data\pg_xlog\000000010000010A0000006E

Nothing has the target file open.

Pete

Bruce Momjian <pgman@candle.pha.pa.us> 04/18/06 2:58 am >>>

Yes, comment I added to dirmod.c give a hint:

/*
* We need these loops because even though PostgreSQL uses flags
that
* allow rename while the file is open, other applications might
have
* these files open without those flags.
*/

so someone else has the file opened, but didn't use the required flags.

As to what could have it open, I don't know.

#6Harald Armin Massa
haraldarminmassa@gmail.com
In reply to: Peter Brant (#5)
Re: [Win32] Problem with rename()

Peter,

G:\pgsql\data\pg_xlog\000000010000010A000000BD

propably a very stupid question: "G" - is that really a LOKAL drive at that
server, or rather some NAS or similiar?

I had the same error in one logfile one time, but there where a large amount
of possible culprits (viral scanner, login script changing permissions,
backups, access control software...) and we could not reproduce the error.

Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607
-
PostgreSQL - supported by a community that does not put you on hold

#7Magnus Hagander
magnus@hagander.net
In reply to: Harald Armin Massa (#6)
Re: [Win32] Problem with rename()

Ok. So we're obviously blocking ourselves out.

Which process was the stalled one? Was it the same one that held the
file open, or a different one?

Looking at our code, we have the comment:
/* These flags allow concurrent rename/unlink */
(FILE_SHARE_READ |
FILE_SHARE_WRITE | FILE_SHARE_DELETE),

But I'm not sure that those flags actually guarantee that. They do allow
concurrent unlink, but not necessarily rename. I read elsewhere that it
should work, but can't find backing docs on MSDN. Seems it works in most
cases, but perhaps there are some where it doesn't?

Is there any way we can force our own other backends to close a file?
That would be an easy fix - have the postmaster tell all other backends
to close all files and reopen...

/Magnus

Show quoted text

-----Original Message-----
From: Peter Brant [mailto:Peter.Brant@wicourts.gov]
Sent: Tuesday, April 18, 2006 4:15 PM
To: Bruce Momjian; Qingqing Zhou <zhouqq@cs.toronto.edu;
Magnus Hagander <mha@sollentuna.net
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] [Win32] Problem with rename()

Unfortunately, it's not that simple. It would be
straightforward to track down if it were.

In response to other questions:

It's Postgres 8.1.3 running on Windows 2003 Server. No
anti-virus software is installed. The servers are
essentially bare except for the OS and Postgres.

We have "handle -a" output from two occurrences (different servers):

For the first one:

LOG: could not rename file
"pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try

Only one process (postgres.exe) is holding a handle to
pg_xlog/000000010000010A000000BD:

F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(3306) BIND
FF4: File G:\pgsql\data\pg_xlog\000000010000010A000000BD

Nothing has the target file open.

The second is similar, except that two postgres.exe processes
(and nothing else) have the file open:

LOG: could not rename file
"pg_xlog/000000010000010A0000006E" to
"pg_xlog/000000010000010A00000087", continuing to try

#1:
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(2367) SELECT
EFC: File G:\pgsql\data\pg_xlog\000000010000010A0000006E

#2:
F84: Event \BaseNamedObjects\pgident: postgres: bigbird
bigbird 127.0.0.1(2420) SELECT
FF4: File G:\pgsql\data\pg_xlog\000000010000010A0000006E

Nothing has the target file open.

Pete

Bruce Momjian <pgman@candle.pha.pa.us> 04/18/06 2:58 am >>>

Yes, comment I added to dirmod.c give a hint:

/*
* We need these loops because even though PostgreSQL
uses flags that
* allow rename while the file is open, other
applications might have
* these files open without those flags.
*/

so someone else has the file opened, but didn't use the
required flags.

As to what could have it open, I don't know.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Brant (#5)
Re: [Win32] Problem with rename()

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

LOG: could not rename file "pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try
...
Only one process (postgres.exe) is holding a handle to
pg_xlog/000000010000010A000000BD:
...
The second is similar, except that two postgres.exe processes (and
nothing else) have the file open:

Hmm, could these be backends that have been sitting idle for some time?
I'd expect a backend to be holding open a handle for whichever WAL
segment it last wrote to. If the backend sits idle for a couple of
checkpoints while others are advancing the end of WAL, then that segment
could become a target for renaming.

The only workable fix I can think of is to allow the checkpointer to
simply fail to rename this segment and go on about its business,
figuring that we'll be able to rename/delete the WAL segment in some
future checkpoint cycle. Not sure how messy that would be to implement.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#7)
Re: [Win32] Problem with rename()

"Magnus Hagander" <mha@sollentuna.net> writes:

Looking at our code, we have the comment:
/* These flags allow concurrent rename/unlink */
(FILE_SHARE_READ |
FILE_SHARE_WRITE | FILE_SHARE_DELETE),

But I'm not sure that those flags actually guarantee that. They do allow
concurrent unlink, but not necessarily rename. I read elsewhere that it
should work, but can't find backing docs on MSDN. Seems it works in most
cases, but perhaps there are some where it doesn't?

I think there are two different cases involved in rename:

1. Someone has a handle for the file-to-be-renamed;
2. Someone has a handle for the file that is to be deleted (ie currently
has the name being renamed to).

If #2 doesn't work then we've got serious problems. I think though that
#1 can only occur in the context of WAL segment recycling, so we can
probably work around it if that doesn't work.

regards, tom lane

#10Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#9)
Re: [Win32] Problem with rename()

Looking at our code, we have the comment:
/* These flags allow concurrent rename/unlink */
(FILE_SHARE_READ |
FILE_SHARE_WRITE | FILE_SHARE_DELETE),

But I'm not sure that those flags actually guarantee that.

They do allow

concurrent unlink, but not necessarily rename. I read

elsewhere that it

should work, but can't find backing docs on MSDN. Seems it

works in most

cases, but perhaps there are some where it doesn't?

I think there are two different cases involved in rename:

1. Someone has a handle for the file-to-be-renamed;
2. Someone has a handle for the file that is to be deleted
(ie currently
has the name being renamed to).

If #2 doesn't work then we've got serious problems. I think
though that
#1 can only occur in the context of WAL segment recycling, so we can
probably work around it if that doesn't work.

The problem reported here was 1. Nobody had handles to the new filename.
I don't think I've seen any reports of issue 2, but most were never
researched to this depth (because most were just a case of
uninstalling-the-antivirus-to-make-it-work).

//Magnus

#11Peter Brant
Peter.Brant@wicourts.gov
In reply to: Harald Armin Massa (#6)
Re: [Win32] Problem with rename()

They are local.

Pete

"Harald Armin Massa" <haraldarminmassa@gmail.com> 04/18/06 4:35 pm

"G" - is that really a LOKAL drive at that server, or rather some NAS
or similiar?

#12Peter Brant
Peter.Brant@wicourts.gov
In reply to: Tom Lane (#8)
Re: [Win32] Problem with rename()

It's definitely possible. Both failures occurred around the end of the
business day as update traffic would have been coasting to a stop. The
middle tier never closes a connection unless it's forced to (e.g. as a
result of a query error, connection going away, etc.)

Pete

Tom Lane <tgl@sss.pgh.pa.us> 04/18/06 4:50 pm >>>

Hmm, could these be backends that have been sitting idle for some
time?

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Brant (#1)
Re: [Win32] Problem with rename()

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

[2006-04-17 16:49:22.583 ] 2252 LOG: could not rename file
"pg_xlog/000000010000010A000000BD" to
"pg_xlog/000000010000010A000000D7", continuing to try
It apparently just keeps on looping indefinitely. The "completed
rename" message from port/dirmod.c never shows up.

Shortly thereafter, Postgres becomes unresponsive. Attempts to make a
new connection just block. Autovacuums block. A "pg_ctl ... stop -m
fast" doesn't work. Only "pg_ctl ... stop -m immediate" does.

BTW, whatever we decide to do about the rename problem, I'd say that the
second point represents an independent bug. The rename loop would hang
up the bgwriter, which would probably cause performance to tank, but the
rest of the system shouldn't become completely unresponsive because of
an incomplete checkpoint. The checkpoint operation shouldn't be holding
any critical locks at this point.

Can you find out anything about what the other processes are blocking on?

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: [Win32] Problem with rename()

I wrote:

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

Shortly thereafter, Postgres becomes unresponsive. Attempts to make a
new connection just block. Autovacuums block. A "pg_ctl ... stop -m
fast" doesn't work. Only "pg_ctl ... stop -m immediate" does.

BTW, whatever we decide to do about the rename problem, I'd say that the
second point represents an independent bug. The rename loop would hang
up the bgwriter, which would probably cause performance to tank, but the
rest of the system shouldn't become completely unresponsive because of
an incomplete checkpoint. The checkpoint operation shouldn't be holding
any critical locks at this point.

I looked into this and found out that in fact, InstallXLogFileSegment
holds the ControlFileLock while trying to rename the WAL segment file.
It does this specifically as an interlock against someone else trying
to create the same new WAL segment name. So once the system runs out
of already-created WAL segments, XLogFileInit hangs up on the lock,
and then anything that wants to generate WAL entries is blocked.

It's possible that we could avoid using a lock here, but it would
require accepting some errors in creation/renaming of WAL segments as
being expected rather than fatal conditions. That seems a bit risky to
me, particularly for the Windows port where I have zero confidence that
I understand what errors Windows might report :-(. Maybe such a cure
is worse than the disease, since we intend to do something about fixing
the rename problem anyway. Any comments?

regards, tom lane

#15Peter Brant
Peter.Brant@wicourts.gov
In reply to: Tom Lane (#14)
Re: [Win32] Problem with rename()

Does that also explain why an attempt to make a new connection just
hangs?

One other thing regarding that is that connection attempt seems to
kinda, sorta succeed. It never makes it as far as a command prompt, but
on the "stop -m immediate", psql does print the "HINT: In a moment you
should be able to reconnect to the database and repeat your command.",
etc. log messages.

Pete

Tom Lane <tgl@sss.pgh.pa.us> 04/18/06 8:03 pm >>>

I looked into this and found out that in fact, InstallXLogFileSegment
holds the ControlFileLock while trying to rename the WAL segment file.
It does this specifically as an interlock against someone else trying
to create the same new WAL segment name. So once the system runs out
of already-created WAL segments, XLogFileInit hangs up on the lock,
and then anything that wants to generate WAL entries is blocked.

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Brant (#15)
Re: [Win32] Problem with rename()

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

Does that also explain why an attempt to make a new connection just
hangs?

Actually, I was just wondering about that --- seems like a bare
connection attempt should not generate any WAL entries. Do you have any
nondefault actions in ~/.psqlrc or something like that?

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: [Win32] Problem with rename()

I wrote:

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

Does that also explain why an attempt to make a new connection just
hangs?

Actually, I was just wondering about that --- seems like a bare
connection attempt should not generate any WAL entries. Do you have any
nondefault actions in ~/.psqlrc or something like that?

I just repeated the hangup scenario here, and confirmed that I can still
start and stop a plain-vanilla psql session (no ~/.psqlrc, no special
per-user or per-database settings) without it hanging. I can also do
simple read-only SELECTs. So I'm thinking your hang must involve some
additional non-read-only actions.

[ thinks for awhile longer ... ] No, I take that back. Once you'd
exhausted the current pg_clog page (32K transactions), even read-only
transactions would be blocked by the need to create a new pg_clog page
(which is a WAL-logged action). A read-only transaction never actually
makes a WAL entry, but it does still consume an XID and hence a slot on
the current pg_clog page. So I just hadn't tried enough transactions.

regards, tom lane

#18Peter Brant
Peter.Brant@wicourts.gov
In reply to: Peter Brant (#1)
Re: [Win32] Problem with rename()

This is probably somewhat superfluous, but we had another one these
incidents last night whose details confirm your explanation here.

[2006-04-21 00:22:19.500 ] 2452 LOG: could not rename file
"pg_xlog/000000010000011A0000004C" to
"pg_xlog/000000010000011A00000071", continuing to try

the autovacuums (which wouldn't actually have been vacuuming anything
since update traffic would have stopped by then) continued until:

[2006-04-21 01:57:35.968 ] 4048 LOG: autovacuum: processing database
"bigbird"

and the Web site first started noticing timeouts at 01:31:42,827.

Overnight traffic is so light that 70 minutes to work through 32K / 2
transactions is probably about right.

Pete

Tom Lane <tgl@sss.pgh.pa.us> 04/18/06 9:01 pm >>>

[ thinks for awhile longer ... ] No, I take that back. Once you'd
exhausted the current pg_clog page (32K transactions), even read-only
transactions would be blocked by the need to create a new pg_clog page
(which is a WAL-logged action). A read-only transaction never
actually
makes a WAL entry, but it does still consume an XID and hence a slot
on
the current pg_clog page. So I just hadn't tried enough transactions.

#19Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
Re: [Win32] Problem with rename()

I am assuming this problem and the other rash of Win32 problems reported
in March are now all fixed in 8.1.4. If not, please let me know.

---------------------------------------------------------------------------

Tom Lane wrote:

I wrote:

"Peter Brant" <Peter.Brant@wicourts.gov> writes:

Does that also explain why an attempt to make a new connection just
hangs?

Actually, I was just wondering about that --- seems like a bare
connection attempt should not generate any WAL entries. Do you have any
nondefault actions in ~/.psqlrc or something like that?

I just repeated the hangup scenario here, and confirmed that I can still
start and stop a plain-vanilla psql session (no ~/.psqlrc, no special
per-user or per-database settings) without it hanging. I can also do
simple read-only SELECTs. So I'm thinking your hang must involve some
additional non-read-only actions.

[ thinks for awhile longer ... ] No, I take that back. Once you'd
exhausted the current pg_clog page (32K transactions), even read-only
transactions would be blocked by the need to create a new pg_clog page
(which is a WAL-logged action). A read-only transaction never actually
makes a WAL entry, but it does still consume an XID and hence a slot on
the current pg_clog page. So I just hadn't tried enough transactions.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#20Peter Brant
Peter.Brant@wicourts.gov
In reply to: Bruce Momjian (#19)
Re: [Win32] Problem with rename()

Really? If there was a patch, I missed it.

My recollection is that there was general agreement about this
particular problem (see, for example,
http://archives.postgresql.org/pgsql-bugs/2006-04/msg00189.php ), but
things kind of trailed off after that without a resolution.

As far as the complete list of Win32 problems which affected us:
- The stats collector crashing should indeed be fixed in 8.1.4
- Missing stats caused by Windows PID recycling is fixed in 8.2
- Various semaphore problems are probably all fixed with the new
Win32 semaphore implementation in 8.2
- The stuck log rename problem mentioned above is still an issue
- The "permission denied on fsync" (or something like that) problem
is still an issue. Unfortunately, IIRC, we could never really nail down
the underlying problem.

None of these problems affect us any more: the production servers now
run Linux. Great to have options! (and we were moving that direction
anyway)

Pete

Bruce Momjian <pgman@candle.pha.pa.us> 16.06.2006 22:05 >>>

I am assuming this problem and the other rash of Win32 problems
reported
in March are now all fixed in 8.1.4. If not, please let me know.

#21Bruce Momjian
bruce@momjian.us
In reply to: Peter Brant (#20)
#22Peter Brant
Peter.Brant@wicourts.gov
In reply to: Bruce Momjian (#21)