could not read transaction log directory ...?

Started by Michael Brusserover 22 years ago8 messages
#1Michael Brusser
michael@synchronicity.com

Running some process on Postgres 7.3.2 I consistently end up with a crash.
This happens on Linux RedHat 6.2, kernel 2.2
Here's the fragment from the database log.

2003-05-07 14:38:48 LOG: recycled transaction log file 0000000000000005
2003-05-07 14:48:56 LOG: recycled transaction log file 0000000000000006
2003-05-07 15:04:10 LOG: recycled transaction log file 0000000000000007
2003-05-07 15:04:10 PANIC: could not read transaction log directory -
(<my_dir_path>/pg_xlog): Unknown error 523
2003-05-07 15:04:11 LOG: server process (pid 449) was terminated by signal
6
2003-05-07 15:04:11 LOG: terminating any other active server processes
2003-05-07 15:04:11 WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
2003-05-07 15:04:11 WARNING: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.

The process is loading database with a large number of records,
it runs for about 20-30 minutes before it crashes.
Problem apparently originated in function MoveOfflineLogs (xlog.c)
At this point there are two files in the transaction log directory:
16777216 May 7 15:04 0000000000000008
16777216 May 7 14:55 0000000000000009

Does anyone have an idea why this could happen?

Thanks,
Mike.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Brusser (#1)
Re: could not read transaction log directory ...?

Michael Brusser <michael@synchronicity.com> writes:

2003-05-07 15:04:10 PANIC: could not read transaction log directory -
(<my_dir_path>/pg_xlog): Unknown error 523

Bizarre. Can you dig around in your kernel sources and see what errno
523 might mean?

regards, tom lane

#3Michael Brusser
michael@synchronicity.com
In reply to: Tom Lane (#2)
Re: could not read transaction log directory ...?

From errno.h :
... ...
/* Defined for the NFSv3 protocol */
#define EBADHANDLE 521 /* Illegal NFS file handle */
#define ENOTSYNC 522 /* Update synchronization mismatch */
#define EBADCOOKIE 523 /* Cookie is stale */
... ...

"Cookie is stale" - ..?
Should I consider some problems with the file server?
The strange thing is that process always crashes some 30 minutes
after start. Another point is that it works fine on another machine -
Red Hat 7.2 with 2.4.9 kernel.

I'm not sure what to make out of this.
Thanks,
Mike.

Show quoted text

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, May 08, 2003 12:25 AM
To: michael@synchronicity.com
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] could not read transaction log directory ...?

Michael Brusser <michael@synchronicity.com> writes:

2003-05-07 15:04:10 PANIC: could not read transaction log directory -
(<my_dir_path>/pg_xlog): Unknown error 523

Bizarre. Can you dig around in your kernel sources and see what errno
523 might mean?

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Brusser (#3)
Re: could not read transaction log directory ...?

Michael Brusser <michael@synchronicity.com> writes:

"Cookie is stale" - ..?
Should I consider some problems with the file server?

Running a database over an NFS mount is widely considered foolish.
It exposes you to all sorts of failure modes that don't exist with
a local filesystem.

The strange thing is that process always crashes some 30 minutes
after start.

Well, that would fit fine with the notion that there's some kind of
30-minute timeout on open files in your NFS stack.

Another point is that it works fine on another machine -
Red Hat 7.2 with 2.4.9 kernel.
I'm not sure what to make out of this.

I'd call it a kernel bug or NFS server bug.

regards, tom lane

#5Christopher Browne
cbbrowne@cbbrowne.com
In reply to: Michael Brusser (#3)
Re: could not read transaction log directory ...?

From errno.h :

... ...
/* Defined for the NFSv3 protocol */
#define EBADHANDLE 521 /* Illegal NFS file handle */
#define ENOTSYNC 522 /* Update synchronization mismatch */
#define EBADCOOKIE 523 /* Cookie is stale */
... ...

"Cookie is stale" - ..?
Should I consider some problems with the file server?
The strange thing is that process always crashes some 30 minutes
after start. Another point is that it works fine on another machine -
Red Hat 7.2 with 2.4.9 kernel.

I'm not sure what to make out of this.

Actually, that is somewhat suggestive...

chvatal:/usr/src/linux/fs/nfs# grep EBADCOOKIE *.c
dir.c: if (res == -EBADCOOKIE) {
nfs2xdr.c: return ERR_PTR(-EBADCOOKIE);
nfs2xdr.c: { NFSERR_BAD_COOKIE,
EBADCOOKIE },
nfs3xdr.c: return ERR_PTR(-EBADCOOKIE);
chvatal:/usr/src/linux/fs/nfs# grep EBADCOOKIE *.c

Does this mean that you are storing your filesystems on NFS?

That could well be the root of the problem; NFS has been somewhat in
flux, and is usually not a highly recommended way of storing PG data.

I hear D'Arcy Cain uses NFS fairly successfully for the purpose, but I
believe he's using NetApp boxes, which are _quite_ different from the
norm, and likely aren't what you are using.

My first suggestion would be Stop Using NFS (unless you are really quite
certain of what you're doing).
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://cbbrowne.com/info/x.html
Oh, boy, virtual memory! Now I'm gonna make myself a really *big*
RAMdisk!

#6Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Michael Brusser (#3)
Re: could not read transaction log directory ...?

Does this mean that you are storing your filesystems on NFS?

That could well be the root of the problem; NFS has been somewhat in
flux, and is usually not a highly recommended way of storing PG data.

I hear D'Arcy Cain uses NFS fairly successfully for the purpose, but I
believe he's using NetApp boxes, which are _quite_ different from the
norm, and likely aren't what you are using.

My first suggestion would be Stop Using NFS (unless you are really quite
certain of what you're doing).

Or switch to FreeBSD - ever since the 'fsx' NFS automatic testing tool came
out from Apple, FreeBSD has had an excellent NFS implementation.

Chris

#7Michael Brusser
michael@synchronicity.com
In reply to: Christopher Kings-Lynne (#6)
Re: could not read transaction log directory ...?

I don't have much choice here - these are development and
test machines, few different platforms but all on NFS.
Testing is very intensive and Postgres takes up a lot of beating.
I think this is the first time we ran into this kind of problem.

I just want to thank everyone for help.
Mike.

Show quoted text

-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: Thursday, May 08, 2003 10:16 PM
To: michael@synchronicity.com; Christopher Browne
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] could not read transaction log directory ...?

Does this mean that you are storing your filesystems on NFS?

That could well be the root of the problem; NFS has been somewhat in
flux, and is usually not a highly recommended way of storing PG data.

I hear D'Arcy Cain uses NFS fairly successfully for the purpose, but I
believe he's using NetApp boxes, which are _quite_ different from the
norm, and likely aren't what you are using.

My first suggestion would be Stop Using NFS (unless you are really quite
certain of what you're doing).

Or switch to FreeBSD - ever since the 'fsx' NFS automatic testing
tool came
out from Apple, FreeBSD has had an excellent NFS implementation.

Chris

#8scott.marlowe
scott.marlowe@ihs.com
In reply to: Michael Brusser (#7)
Re: could not read transaction log directory ...?

Do you have any options for NFS set in your fstab for this share? We had
to make a few changes to get NFS to work reliably (we still don't use it
for database, just raw text files and such.) My options are:

timeo=10,retry=1,bg,soft,intr,rsize=8192,wsize=8192

If anyone has any suggestions to make to mine feel free, those are the
settings I got from one of the networking guys, they may well be
non-optimal. But NFS doesn't just disappear for seconds at a time anymore
like it used to during snapshots.

On Thu, 8 May 2003, Michael
Brusser wrote:

Show quoted text

I don't have much choice here - these are development and
test machines, few different platforms but all on NFS.
Testing is very intensive and Postgres takes up a lot of beating.
I think this is the first time we ran into this kind of problem.

I just want to thank everyone for help.
Mike.

-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: Thursday, May 08, 2003 10:16 PM
To: michael@synchronicity.com; Christopher Browne
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] could not read transaction log directory ...?

Does this mean that you are storing your filesystems on NFS?

That could well be the root of the problem; NFS has been somewhat in
flux, and is usually not a highly recommended way of storing PG data.

I hear D'Arcy Cain uses NFS fairly successfully for the purpose, but I
believe he's using NetApp boxes, which are _quite_ different from the
norm, and likely aren't what you are using.

My first suggestion would be Stop Using NFS (unless you are really quite
certain of what you're doing).

Or switch to FreeBSD - ever since the 'fsx' NFS automatic testing
tool came
out from Apple, FreeBSD has had an excellent NFS implementation.

Chris

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org