Out of memory in CIFS leads to database crash

Started by Umesh Kirdatover 12 years ago3 messagesbugs
Jump to latest
#1Umesh Kirdat
umesh.kirdat@yahoo.com

Hello All,
 
In our setup under heavy load (too many client performing
updates) we have observed the underlying CIFS module runs out of memory and the
database crashes or goes in recovery mode.
 
Nov 28 09:17:32 ng78 kernel: CIFS VFS
(1006f1e5e,pid=19342): Error in Open = Out of memory<3> ISVS(0615,i=96)
ro nOe u fmoy<3> CIFS VFS(1006f16,pid196) ror i pn u fmmor
Nov 28 09:17:32 ng78 postgres[19342]: [10-1] 192.168.20.78 19342 2013-11-28
09:17:32.882 PST ERROR: could not open file "base/16384/16794":
Cannot allocate memory
Nov 28 09:17:32 ng78 postgres[19342]: [10-2] 192.168.20.78 19342 2013-11-28
09:17:32.882 PST STATEMENT: select
 
The physical memory on the machine is 64 GB
Postgres version 9.0.4
Hardware is 64 bit
 
I wish to know why is the database crashing if the file
open fails? Why can't it handle it gracefully by rolling back the transaction?
Umesh

#2Jeff Janes
jeff.janes@gmail.com
In reply to: Umesh Kirdat (#1)
Re: Out of memory in CIFS leads to database crash

On Tue, Jan 7, 2014 at 2:03 AM, Umesh Kirdat <umesh.kirdat@yahoo.com> wrote:

Hello All,

In our setup under heavy load (too many client performing updates) we have
observed the underlying CIFS module runs out of memory and the database
crashes or goes in recovery mode.

Nov 28 09:17:32 ng78 kernel: CIFS VFS (1006f1e5e,pid=19342): Error in Open
= Out of memory<3> ISVS(0615,i=96) ro nOe u fmoy<3> CIFS
VFS(1006f16,pid196) ror i pn u fmmor
Nov 28 09:17:32 ng78 postgres[19342]: [10-1] 192.168.20.78 19342
2013-11-28 09:17:32.882 PST ERROR: could not open file "base/16384/16794":
Cannot allocate memory
Nov 28 09:17:32 ng78 postgres[19342]: [10-2] 192.168.20.78 19342
2013-11-28 09:17:32.882 PST STATEMENT: select

The physical memory on the machine is 64 GB
Postgres version 9.0.4
Hardware is 64 bit

I wish to know why is the database crashing if the file open fails? Why
can't it handle it gracefully by rolling back the transaction?

Based on the section of the log you are showing, it looks like it did just
roll back the transaction. A crash should be showing you PANIC messages,
not just ERROR. Is there more to the log than you are showing? If you are
logging over CIFS as well, perhaps the PANIC messages are getting lost
because they can't be logged.

I don't think that running with the data directory on CIFS is supported. I
certainly wouldn't be brave enough to do that with data I care about.

Cheers,

Jeff

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#2)
Re: Out of memory in CIFS leads to database crash

Jeff Janes <jeff.janes@gmail.com> writes:

On Tue, Jan 7, 2014 at 2:03 AM, Umesh Kirdat <umesh.kirdat@yahoo.com> wrote:

I wish to know why is the database crashing if the file open fails? Why
can't it handle it gracefully by rolling back the transaction?

Based on the section of the log you are showing, it looks like it did just
roll back the transaction. A crash should be showing you PANIC messages,
not just ERROR. Is there more to the log than you are showing? If you are
logging over CIFS as well, perhaps the PANIC messages are getting lost
because they can't be logged.

We will PANIC on I/O failure involving the WAL log, but as you say, this
log extract isn't showing instances of that. I/O failures on ordinary
data files shouldn't result in a panic. (I'm not sure whether it'd be
practical to downgrade the panic for WAL write failures. Certainly, the
database won't be good for much if it can't commit transactions. A WAL
write failure also implies that data from transactions besides the one
doing the write may be in jeopardy, so just pretending that the system
as a whole can carry on doesn't sound all that safe.)

I don't think that running with the data directory on CIFS is supported. I
certainly wouldn't be brave enough to do that with data I care about.

You should certainly be keeping the WAL log on a trustworthy filesystem;
and frankly I'm not sure what the point is of using a database on
known-untrustworthy storage of any breed. We can't be more reliable
than the underlying storage is.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs