PostgreSQL, NetBSD and NFS

Started by D'Arcy J.M. Cainabout 23 years ago52 messageshackers
Jump to latest
#1D'Arcy J.M. Cain
darcy@druid.net

I have posted before about this but I am now posting to both NetBSD and
PostgreSQL since it seems to be some sort of interaction between the two. I
have a NetAPP filer on which I am putting a PostgreSQL database. I run
PostgreSQL on a NetBSD box. I used rsync to get the database onto the filer
with no problem whatsoever but as soon as I try to open the database the NFS
mount hangs and I can't do any operations on that mounted drive without
hanging. Other things continue to run but the minute I do a df or an ls on
that drive that terminal is lost.

On the NetBSD side I get a "server not responding" error. On the filer I see
no problems at all. A reboot of the filer doesn't correct anything.

Since NetBSD works just fine with this until I start PostgreSQL and
PostgreSQL, from all reports, works well with the NetApp filer, I assume that
there is something out of the ordinary about PostgreSQL's disk access that is
triggering some subtle bug in NetBSD. Does the shared memory stuff use disk
at all? Perhaps that's the difference between PostgreSQL and other
applications.

The NetApp people are being very helpful and are willing to follow up any
leads people might have and may even suggest fixes if necessary. I have
Bcc'd the engineer on this message and will send anything I get to them.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#1)
Re: PostgreSQL, NetBSD and NFS

"D'Arcy J.M. Cain" <darcy@druid.net> writes:

I have posted before about this but I am now posting to both NetBSD and
PostgreSQL since it seems to be some sort of interaction between the two. I
have a NetAPP filer on which I am putting a PostgreSQL database. I run
PostgreSQL on a NetBSD box. I used rsync to get the database onto the filer
with no problem whatsoever but as soon as I try to open the database the NFS
mount hangs and I can't do any operations on that mounted drive without
hanging.

That's darn odd. But please be more specific: what's "open the
database"? Start the postmaster? Start a psql? Issue a query?

Does the shared memory stuff use disk at all?

No, I can't see that there would be any connection there.

Perhaps the next thing to do is to strace (ktrace, trace, truss,
whatever system-call tracing utility you got) the postmaster and
child processes. If we could determine what system call is hanging up,
we might be a little closer to solving the mystery.

regards, tom lane

#3Mark Woodward
pgsql@mohawksoft.com
In reply to: D'Arcy J.M. Cain (#1)
Re: PostgreSQL, NetBSD and NFS

Forgive my stupidity, are you running PostgreSQL with the data on an NFS
share?

D'Arcy J.M. Cain wrote:

Show quoted text

I have posted before about this but I am now posting to both NetBSD and
PostgreSQL since it seems to be some sort of interaction between the two. I
have a NetAPP filer on which I am putting a PostgreSQL database. I run
PostgreSQL on a NetBSD box. I used rsync to get the database onto the filer
with no problem whatsoever but as soon as I try to open the database the NFS
mount hangs and I can't do any operations on that mounted drive without
hanging. Other things continue to run but the minute I do a df or an ls on
that drive that terminal is lost.

On the NetBSD side I get a "server not responding" error. On the filer I see
no problems at all. A reboot of the filer doesn't correct anything.

Since NetBSD works just fine with this until I start PostgreSQL and
PostgreSQL, from all reports, works well with the NetApp filer, I assume that
there is something out of the ordinary about PostgreSQL's disk access that is
triggering some subtle bug in NetBSD. Does the shared memory stuff use disk
at all? Perhaps that's the difference between PostgreSQL and other
applications.

The NetApp people are being very helpful and are willing to follow up any
leads people might have and may even suggest fixes if necessary. I have
Bcc'd the engineer on this message and will send anything I get to them.

#4Greg Copeland
greg@CopelandConsulting.Net
In reply to: Mark Woodward (#3)
Re: PostgreSQL, NetBSD and NFS

That was going to be my question too.

I thought NFS didn't have some of the requisite file system behaviors
(locking, flushing, etc. IIRC) for PostgreSQL to function correctly or
reliably.

Please correct as needed.

Regards,

Greg

On Thu, 2003-01-30 at 13:02, mlw wrote:

Forgive my stupidity, are you running PostgreSQL with the data on an NFS
share?

D'Arcy J.M. Cain wrote:

I have posted before about this but I am now posting to both NetBSD and
PostgreSQL since it seems to be some sort of interaction between the two. I
have a NetAPP filer on which I am putting a PostgreSQL database. I run
PostgreSQL on a NetBSD box. I used rsync to get the database onto the filer
with no problem whatsoever but as soon as I try to open the database the NFS
mount hangs and I can't do any operations on that mounted drive without
hanging. Other things continue to run but the minute I do a df or an ls on
that drive that terminal is lost.

On the NetBSD side I get a "server not responding" error. On the filer I see
no problems at all. A reboot of the filer doesn't correct anything.

Since NetBSD works just fine with this until I start PostgreSQL and
PostgreSQL, from all reports, works well with the NetApp filer, I assume that
there is something out of the ordinary about PostgreSQL's disk access that is
triggering some subtle bug in NetBSD. Does the shared memory stuff use disk
at all? Perhaps that's the difference between PostgreSQL and other
applications.

The NetApp people are being very helpful and are willing to follow up any
leads people might have and may even suggest fixes if necessary. I have
Bcc'd the engineer on this message and will send anything I get to them.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

--
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Copeland (#4)
Re: PostgreSQL, NetBSD and NFS

Greg Copeland <greg@CopelandConsulting.Net> writes:

That was going to be my question too.
I thought NFS didn't have some of the requisite file system behaviors
(locking, flushing, etc. IIRC) for PostgreSQL to function correctly or
reliably.

Whether the thing is trustworthy is a different issue ;-). I was just
surprised that it didn't seem to work at all.

In practice, if the NFS server never goes down then you probably haven't
got a problem. I'm not sure you could count on the database not getting
scrambled if the NFS server crashes. But that wasn't the question...

regards, tom lane

#6Larry Rosenman
ler@lerctr.org
In reply to: Tom Lane (#5)
Re: PostgreSQL, NetBSD and NFS

--On Thursday, January 30, 2003 16:02:17 -0500 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

Greg Copeland <greg@CopelandConsulting.Net> writes:

That was going to be my question too.
I thought NFS didn't have some of the requisite file system behaviors
(locking, flushing, etc. IIRC) for PostgreSQL to function correctly or
reliably.

Whether the thing is trustworthy is a different issue ;-). I was just
surprised that it didn't seem to work at all.

In practice, if the NFS server never goes down then you probably haven't
got a problem. I'm not sure you could count on the database not getting
scrambled if the NFS server crashes. But that wasn't the question...

FWIW I use a netapp filer for my databases here for traffic analysis and IP
management.

The NETAPP has battery backed NVRAM and will replay the right stuff on it's
own.

Just another datapoint.

LER

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

#7Curt Sampson
cjs@cynic.net
In reply to: D'Arcy J.M. Cain (#1)
Re: PostgreSQL, NetBSD and NFS

On Thu, 30 Jan 2003, D'Arcy J.M. Cain wrote:

Does the shared memory stuff use disk at all? Perhaps that's the
difference between PostgreSQL and other applications.

Shared memory in NetBSD is just an interface to mmap'd pages, so it can
be swapped to disk. But I assume your swap is not on NFS....

A ktrace would be helpful. Also, it would be helpful if you tried doing
an initdb to a directory on the filer to see if you can even create a
database cluster, and tried doing that or rsyncing and accessing your
data over NFS with a NetBSD system as the NFS server.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#8D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#2)
Re: PostgreSQL, NetBSD and NFS

On Thursday 30 January 2003 12:07, Tom Lane wrote:

"D'Arcy J.M. Cain" <darcy@druid.net> writes:

I have posted before about this but I am now posting to both NetBSD and
PostgreSQL since it seems to be some sort of interaction between the two.
I have a NetAPP filer on which I am putting a PostgreSQL database. I
run PostgreSQL on a NetBSD box. I used rsync to get the database onto
the filer with no problem whatsoever but as soon as I try to open the
database the NFS mount hangs and I can't do any operations on that
mounted drive without hanging.

That's darn odd. But please be more specific: what's "open the
database"? Start the postmaster? Start a psql? Issue a query?

Start the postmaster. It is possible that I have a corrupted database but I
was using that as a debugging tool because I still don't think that the whole
NFS subsystem should lock up. The other time I tested it took hours to fail
and I found it useful to have an immediate fail.

Perhaps the next thing to do is to strace (ktrace, trace, truss,
whatever system-call tracing utility you got) the postmaster and
child processes. If we could determine what system call is hanging up,
we might be a little closer to solving the mystery.

Ktrace. Yes, am doing another test at the moment - using 100Mb to 100Mb and
TCP option to the mount. Before I was using the default UDP and going 100Mb
to 1000 Mb. If this works I will try my "guaranteed" fail next and will add
ktrace. In fact, I will do that regardless.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#9D'Arcy J.M. Cain
darcy@druid.net
In reply to: Mark Woodward (#3)
Re: PostgreSQL, NetBSD and NFS

On Thursday 30 January 2003 14:02, mlw wrote:

Forgive my stupidity, are you running PostgreSQL with the data on an NFS
share?

Yes, sorry. PostgreSQL is running from the local disk but the data is on the
mounted drive.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#10D'Arcy J.M. Cain
darcy@druid.net
In reply to: Greg Copeland (#4)
Re: PostgreSQL, NetBSD and NFS

On Thursday 30 January 2003 14:27, Greg Copeland wrote:

That was going to be my question too.

I thought NFS didn't have some of the requisite file system behaviors
(locking, flushing, etc. IIRC) for PostgreSQL to function correctly or
reliably.

Please correct as needed.

Yes, doubly so here please. I think I remember someone else saying that they
use PostgreSQL over NFS so hopefully this is not the situation.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#11D'Arcy J.M. Cain
darcy@druid.net
In reply to: D'Arcy J.M. Cain (#1)
Re: PostgreSQL, NetBSD and NFS

On Thursday 30 January 2003 18:32, Simon J. Gerraty wrote:

Is postgreSQL trying to lock a file perhaps? Would seem a sensible thing
for it to be doing...

Is that a problem? FWIW I am running statd and lockd on the NetBSD box.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#12Bill Studenmund
wrstuden@netbsd.org
In reply to: D'Arcy J.M. Cain (#8)
Re: PostgreSQL, NetBSD and NFS

On Fri, 31 Jan 2003, D'Arcy J.M. Cain wrote:

On Thursday 30 January 2003 12:07, Tom Lane wrote:

Perhaps the next thing to do is to strace (ktrace, trace, truss,
whatever system-call tracing utility you got) the postmaster and
child processes. If we could determine what system call is hanging up,
we might be a little closer to solving the mystery.

Ktrace. Yes, am doing another test at the moment - using 100Mb to 100Mb and
TCP option to the mount. Before I was using the default UDP and going 100Mb
to 1000 Mb. If this works I will try my "guaranteed" fail next and will add
ktrace. In fact, I will do that regardless.

Look at the -t option to ktrace. It controls what ktrace looks at
(syscalls, NAMEI lookups, etc.). Most importantly, you might want to NOT
include the 'i' option in there, which is in there by default. It logs the
data of all i/o transfers, which baloons the logs. While you may need the
data in the end, tracing w/o 'i' could show you the syscalls around the
failure which might be enough.

Take care,

Bill

#13Mark Woodward
pgsql@mohawksoft.com
In reply to: D'Arcy J.M. Cain (#1)
Re: PostgreSQL, NetBSD and NFS

D'Arcy J.M. Cain wrote:

On Thursday 30 January 2003 14:02, mlw wrote:

Forgive my stupidity, are you running PostgreSQL with the data on an NFS
share?

Yes, sorry. PostgreSQL is running from the local disk but the data is on the
mounted drive.

I'm not sure, I guess it could work, but NFS is a pretty poor file
system. There are always issues with file locking across various
platforms. I recall reading about mmap issues across NFS a while ago
(forget the platform, sorry). Depending on the NFS server, there may be
problems there. The NFS client may also have isses with locking, fsync,
and mmap.

If possible, look for a network block device protocol. The file level
NFS is probably inadequate for PostgreSQL.

#14Greg A. Woods
woods@weird.com
In reply to: D'Arcy J.M. Cain (#11)
Re: PostgreSQL, NetBSD and NFS

[ On Friday, January 31, 2003 at 11:54:27 (-0500), D'Arcy J.M. Cain wrote: ]

Subject: Re: PostgreSQL, NetBSD and NFS

On Thursday 30 January 2003 18:32, Simon J. Gerraty wrote:

Is postgreSQL trying to lock a file perhaps? Would seem a sensible thing
for it to be doing...

Is that a problem? FWIW I am running statd and lockd on the NetBSD box.

NetBSD's NFS implementation only supports locking as a _server_, not a
client.

http://www.unixcircle.com/features/nfs.php

Optional for file locking (lockd+statd):

lockd:

Rpc.lockd is a daemon which provides file and record-locking services
in an NFS environment.

FreeBSD, NetBSD and OpenBSD file locking is only supported on server
side.

NFS server support for locking was introduced in NetBSD-1.5:

http://www.netbsd.org/Releases/formal-1.5/NetBSD-1.5.html

* Server part of NFS locking (implemented by rpc.lockd(8)) now works.

and as you can also see from rcp.lockd/lockd.c:

----------------------------
revision 1.5
date: 2000/06/07 14:34:40; author: bouyer; state: Exp; lines: +67 -25
Implement file locking in lockd. All the stuff is done in userland, using
fhopen() and flock(). This means that if you kill lockd, all locks will
be relased (but you're supposed to kill statd at the same time, so
remote hosts will know it and re-establish the lock).
Tested against solaris 2.7 and linux 2.2.14 clients.
Shared lock are not handled efficiently, they're serialised in lockd when they
could be granted.
----------------------------

Terry Lambert has some proposed fixes to add NFS client level locking to
the FreeBSD kernel:

http://www.freebsd.org/~terry/DIFF.LOCKS.txt
http://www.freebsd.org/~terry/DIFF.LOCKS.MAN
http://www.freebsd.org/~terry/DIFF.LOCKS

--
Greg A. Woods

+1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>

#15Curt Sampson
cjs@cynic.net
In reply to: Mark Woodward (#13)
Re: PostgreSQL, NetBSD and NFS

On Fri, 31 Jan 2003, mlw wrote:

. There are always issues with file locking across various
platforms. I recall reading about mmap issues across NFS a while ago...

Postgres uses neither of these, IIRC, so that should be fine. (Actually,
postgres does effectively use mmap for shared memory on NetBSD, but
that's not mapping data on the NFS filesystem, so it's not an issue.)

The NFS client may also have isses with locking, fsync, and mmap.

Any fsync problems would affect data integrity during a crash, but
nothing otherwise.

(Of course, I'm happy to be corrected on any of these issues, if someone
can point out particular parts of postgres that would fail over NFS.)

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#1)
Re: PostgreSQL, NetBSD and NFS

"D'Arcy J.M. Cain" <darcy@druid.net> writes:

100Mb instead of 100Mb -->1000Mb. I tried mounting with and without the TCP
option and it seemed to act the same but it was better than before. Now it
doesn't crash but trying to load a large table hangs. It gets to a point
where it is calling semop over and over getting a 0 return. It does that 81
times in 0.989004 seconds and then hangs in the PostgreSQL code. It must be
in some sort of busy loop because there are no further system calls after the
last semop return and the CPU usage continues to climb.

Very bizarre. Looks like the last page it read was block 104
(851968/8192) in file "/source/data/cert/base/16556/17063". Could you
provide a formatted dump of that page? I'm partial to pg_filedump which
you can get from http://sources.redhat.com/rhdb/tools.html. Use
switches -f -i to get a reasonably complete dump.

regards, tom lane

#17D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#16)
Re: PostgreSQL, NetBSD and NFS

On Saturday 01 February 2003 13:09, Tom Lane wrote:

Very bizarre. Looks like the last page it read was block 104
(851968/8192) in file "/source/data/cert/base/16556/17063". Could you
provide a formatted dump of that page? I'm partial to pg_filedump which
you can get from http://sources.redhat.com/rhdb/tools.html. Use
switches -f -i to get a reasonably complete dump.

That's a 4.7 MB file. The dump might be quite huge. I can send you the file
itself (privately) if you want. Wouldn't that be even better?

I can tell you what the file is. It is the primary key file for the
certificate database which is the 8 million record table that I am trying to
load.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#17)
Re: PostgreSQL, NetBSD and NFS

"D'Arcy J.M. Cain" <darcy@druid.net> writes:

That's a 4.7 MB file. The dump might be quite huge.

I really just want to see the dump of that one page, and maybe the pages
before and after it for comparison's sake.

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#17)
Re: PostgreSQL, NetBSD and NFS

What was the query it failed on, exactly? That last page it read
seems to be an empty index page --- it should have moved on to the
next index page, I'd think, rather than doing anything that could
hang up.

regards, tom lane

#20D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#19)
Re: PostgreSQL, NetBSD and NFS

On Saturday 01 February 2003 14:00, Tom Lane wrote:

What was the query it failed on, exactly? That last page it read
seems to be an empty index page --- it should have moved on to the
next index page, I'd think, rather than doing anything that could
hang up.

Here's the log. As you can see, nothing was logged after the COPY command.

It's possible that the file was corrupted. I will do a new test from scratch
now that I am not switching speeds.

-- 
D'Arcy J.M. Cain <darcy@{druid|vex}.net>   |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

Attachments:

pg.logtext/x-log; charset=iso-8859-1; name=pg.logDownload
#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#20)
#22D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#22)
#24Manuel Bouyer
bouyer@antioche.eu.org
In reply to: Greg Copeland (#4)
#25D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#25)
#27D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#26)
#28D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#28)
#30D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#30)
#32D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#32)
#34Justin Clift
justin@postgresql.org
In reply to: Tom Lane (#31)
#35Byron Servies
bservies@pacang.com
In reply to: Justin Clift (#34)
#36Greg Copeland
greg@CopelandConsulting.Net
In reply to: Tom Lane (#33)
#37James Hubbard
jhubbard@mcs.uvawise.edu
In reply to: Justin Clift (#34)
#38Ian Fry
Ian.Fry@sophos.com
In reply to: Tom Lane (#33)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Copeland (#36)
#40Justin Clift
justin@postgresql.org
In reply to: James Hubbard (#37)
#41Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#39)
#42D'Arcy J.M. Cain
darcy@druid.net
In reply to: Ian Fry (#38)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: D'Arcy J.M. Cain (#42)
#44Thor Lancelot Simon
tls@rek.tjls.com
In reply to: Tom Lane (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thor Lancelot Simon (#44)
#46Michael Hertrick
m.hertrick@neovera.com
In reply to: D'Arcy J.M. Cain (#1)
#47Thor Lancelot Simon
tls@rek.tjls.com
In reply to: Tom Lane (#45)
#48David Laight
david@l8s.co.uk
In reply to: Thor Lancelot Simon (#44)
#49Greywolf
greywolf@starwolf.com
In reply to: D'Arcy J.M. Cain (#42)
#50Greywolf
greywolf@starwolf.com
In reply to: Tom Lane (#45)
#51Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Justin Clift (#34)
#52Andrew Gillham
gillham@vaultron.com
In reply to: David Laight (#48)