NFS vs. PostgreSQL on Solaris

Started by Thomas F. O'Connellalmost 19 years ago4 messagesgeneral
Jump to latest
#1Thomas F. O'Connell
tf@o.ptimized.com

I've got an interesting case study that I don't fully understand from
a postgres perspective, and I'm hoping that someone in the community
might help me understand what led to the outcome and whether it's
easily prevented or not.

The setup: PostgreSQL 8.2.3 on Solaris 10 x86

postgresql.conf:
shared_buffers = 200000

This had been a box that had been completely dedicated to postgres. A
new project required that the box start to also share a directory
over NFS. In preparation for this, nfsd had been running for quite
some time, although today was the first day that production mount
points on other boxes were added. The first external mount point was
created this morning, and all seemed quiet. The second external mount
point was then added this afternoon. That second mount point caused
postgres great consternation.

Within several few seconds of enabling the second external NFS mount
point, postgres began reporting out of memory errors. An INSERT
generated the following:

TopMemoryContext: 118832 total in 10 blocks; 17512 free (24 chunks);
101320 used
TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks);
336 used
Type information cache: 8192 total in 1 blocks; 1800 free (0 chunks);
6392 used
Operator class cache: 8192 total in 1 blocks; 4872 free (0 chunks);
3320 used
MessageContext: 262144 total in 6 blocks; 75208 free (5 chunks);
186936 used
smgr relation table: 8192 total in 1 blocks; 1776 free (0 chunks);
6416 used
TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0
chunks); 16 used
Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
PortalHeapMemory: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
ExecutorState: 147648 total in 7 blocks; 59816 free (11 chunks);
87832 used
HashTableContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
HashBatchContext: 25174100 total in 13 blocks; 7754600 free (16
chunks); 17419500 used
TupleSort: 4448280 total in 10 blocks; 1888 free (25 chunks); 4446392
used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Unique: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
AggContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 8192 total in 1 blocks; 2856 free (0 chunks); 5336 used
CacheMemoryContext: 659000 total in 19 blocks; 140248 free (1
chunks); 518752 us
ed

This was followed by a number of statements about individual
relations and, eventually, this:

MdSmgr: 8192 total in 1 blocks; 7120 free (0 chunks); 1072 used
LOCALLOCK hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG: could not fork new process for
connection: Not enough space
Timezones: 48616 total in 2 blocks; 5968 free (0 chunks); 42648 used
ErrorContext: 8192 total in 1 blocks; 8176 free (4 chunks); 16 used
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):ERROR: out of memory
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):DETAIL: Failed on
request of size 257.

This happened intermittently for the next several minutes, eventually
building to FATAL:

2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):FATAL: out of memory
2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):DETAIL: Failed on
request of size 1864.

The FATAL error was preceded by a similar report to the original
ERRORs. I don't know what distinguishes the ERROR "out of memory"
from the FATAL version.

Then, eventually, without any evidence of other things on the box
behaving badly, we saw this:

2007-04-26 14:53:05 CDT 15175 :LOG: server process (PID 18543) was
terminated by signal 11
2007-04-26 14:53:05 CDT 15175 :LOG: terminating any other active
server processes

So my questions are:

1. What aspect of postgres' memory usage would create an "out of
memory" condition?
2. What is the difference between an ERROR and FATAL "out of memory"
message?
3. What would cause postgres to die from a signal 11?

I've also got a core file if that's necessary for further forensics.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas F. O'Connell (#1)
Re: NFS vs. PostgreSQL on Solaris

"Thomas F. O'Connell" <tf@o.ptimized.com> writes:

1. What aspect of postgres' memory usage would create an "out of
memory" condition?

I'm guessing you ran the box out of swap space --- look into what other
processes got started as a result of adding the NFS mount, and how much
memory they wanted to eat.

3. What would cause postgres to die from a signal 11?
I've also got a core file if that's necessary for further forensics.

Send gdb backtrace, please.

regards, tom lane

#3Thomas F. O'Connell
tf@o.ptimized.com
In reply to: Tom Lane (#2)
Re: NFS vs. PostgreSQL on Solaris

On Apr 26, 2007, at 6:51 PM, Tom Lane wrote:

"Thomas F. O'Connell" <tf@o.ptimized.com> writes:

1. What aspect of postgres' memory usage would create an "out of
memory" condition?

I'm guessing you ran the box out of swap space --- look into what
other
processes got started as a result of adding the NFS mount, and how
much
memory they wanted to eat.

3. What would cause postgres to die from a signal 11?
I've also got a core file if that's necessary for further forensics.

Send gdb backtrace, please.

regards, tom lane

Unfortunately, the production build in question is lacking --enable-
debug. :(

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas F. O'Connell (#3)
Re: NFS vs. PostgreSQL on Solaris

"Thomas F. O'Connell" <tf@o.ptimized.com> writes:

On Apr 26, 2007, at 6:51 PM, Tom Lane wrote:

"Thomas F. O'Connell" <tf@o.ptimized.com> writes:

3. What would cause postgres to die from a signal 11?
I've also got a core file if that's necessary for further forensics.

Send gdb backtrace, please.

Unfortunately, the production build in question is lacking --enable-
debug. :(

Well, if it wasn't actually stripped then gdb could still get function
names out of it, which might or might not be enough but it's sure more
info than you provided so far.

If you built with gcc, then a possible plan B is to recompile with all
the same options plus --enable-debug, and hope that the resulting
executables are bit-for-bit the same except for addition of debug
symbols, so you could use them with the corefile. This theoretically
should work, if nothing has changed in your build environment, though
that assumption is obviously a bit shaky.

regards, tom lane