problem with lost connection while running long PL/R query

Started by David M. Kaplanabout 13 years ago8 messagesgeneral

david.kaplan@ird.fr

about 13 years ago

Hi,

I have a query that uses a PL/R function to run a statistical model on
data in a postgresql table. The query runs the function 4 times, each
of which generates about 2 million lines of results, generating a final
table that has about 8 million lines. Each time the function is called,
it takes about ~3-5 minutes to finish its work. I know that the function
is running fine as it gives me progress and I have had no trouble using
it on slightly smaller datasets. However, this time it does the 4 runs,
but ends with psql saying the connection to the server has been lost.
In the postgresql log file, I have:

2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed

2013-05-16 14:05:47 CEST LOG: terminating any other active server processes

2013-05-16 14:05:47 CEST WARNING: terminating connection because of crash of another server process

2013-05-16 14:05:47 CEST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory

The process ID it mentions is the postgresql query process:

postgres 27488 87.1 26.1 2272848 2101804 ? Rs 13:44 1:01 postgres: dkaplan fads [local] SELECT

Does anyone know what might be causing this problem and how to fix it?
Is something timing out? Or is it a memory problem? How would I go
about getting more diagnostic information?

Thanks for the help,
David

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Ryan Kelly

rpkelly22@gmail.com

about 13 years ago

In reply to: David M. Kaplan (#1)

Re: problem with lost connection while running long PL/R query

On Thu, May 05/16/13, 2013 at 02:47:28PM +0200, David M. Kaplan wrote:

Hi,

I have a query that uses a PL/R function to run a statistical model
on data in a postgresql table. The query runs the function 4 times,
each of which generates about 2 million lines of results, generating
a final table that has about 8 million lines. Each time the
function is called, it takes about ~3-5 minutes to finish its work.
I know that the function is running fine as it gives me progress and
I have had no trouble using it on slightly smaller datasets.
However, this time it does the 4 runs, but ends with psql saying the
connection to the server has been lost. In the postgresql log file,
I have:

2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed

This usually indicates the OOM killer has killed your backend. Please
see the documentation here:

http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

-Ryan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: David M. Kaplan (#1)

Re: problem with lost connection while running long PL/R query

"David M. Kaplan" <david.kaplan@ird.fr> writes:

I have a query that uses a PL/R function to run a statistical model on
data in a postgresql table. The query runs the function 4 times, each
of which generates about 2 million lines of results, generating a final
table that has about 8 million lines. Each time the function is called,
it takes about ~3-5 minutes to finish its work. I know that the function
is running fine as it gives me progress and I have had no trouble using
it on slightly smaller datasets. However, this time it does the 4 runs,
but ends with psql saying the connection to the server has been lost.
In the postgresql log file, I have:

2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed

You've been hit by the dreaded OOM kill "feature" of Linux, as you can
confirm by looking into the kernel logs. A lot of people just turn that
off, especially on servers. Google for "memory overcommit" for more info.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David M. Kaplan

david.kaplan@ird.fr

about 13 years ago

In reply to: Ryan Kelly (#2)

Re: problem with lost connection while running long PL/R query

Hi,

Thanks for the help. You have definitely identified the problem, but I
am still looking for a solution that works for me. I tried setting
vm.overcommit_memory=2, but this just made the query crash quicker than
before, though without killing the entire connection to the database. I
imagine that this means that I really am trying to use more memory than
the system can handle?

I am wondering if there is a way to tell postgresql to flush a set of
table lines out to disk so that the memory they are using can be
liberated. Basically, I am creating my table with 8 million lines in 4
chunks, and each chunk fits in memory. As the 4 are inside one query, I
imagine that at some point postgresql tries to put the four together and
write them to disk as a postgresql table, and that is when things
break. Is it possible to tell postgresql to write one "chunk" at a
time? I imagine I could do this using PL/PGSQL and iteration, but
perhaps there is a postgresql configuration parameter that can help with
this?

Thanks again,
David

On 05/16/2013 03:35 PM, Ryan Kelly wrote:

On Thu, May 05/16/13, 2013 at 02:47:28PM +0200, David M. Kaplan wrote:

Hi,

I have a query that uses a PL/R function to run a statistical model
on data in a postgresql table. The query runs the function 4 times,
each of which generates about 2 million lines of results, generating
a final table that has about 8 million lines. Each time the
function is called, it takes about ~3-5 minutes to finish its work.
I know that the function is running fine as it gives me progress and
I have had no trouble using it on slightly smaller datasets.
However, this time it does the 4 runs, but ends with psql saying the
connection to the server has been lost. In the postgresql log file,
I have:

2013-05-16 14:05:47 CEST LOG: server process (PID 27488) was terminated by signal 9: Killed

This usually indicates the OOM killer has killed your backend. Please
see the documentation here:

http://www.postgresql.org/docs/9.2/interactive/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

-Ryan

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: David M. Kaplan (#4)

Re: problem with lost connection while running long PL/R query

"David M. Kaplan" <david.kaplan@ird.fr> writes:

Thanks for the help. You have definitely identified the problem, but I
am still looking for a solution that works for me. I tried setting
vm.overcommit_memory=2, but this just made the query crash quicker than
before, though without killing the entire connection to the database. I
imagine that this means that I really am trying to use more memory than
the system can handle?

I am wondering if there is a way to tell postgresql to flush a set of
table lines out to disk so that the memory they are using can be
liberated.

Assuming you don't have work_mem set to something unreasonably large,
it seems likely that the excessive memory consumption is inside your
PL/R function, and not the fault of Postgres per se. You might try
asking in some R-related forums about how to reduce the code's memory
usage.

Also, if by "crash" this time you meant you got an "out of memory" error
from Postgres, there should be a memory map in the postmaster log
showing all the memory consumption Postgres itself is aware of. If that
doesn't add up to a lot, it would be pretty solid proof that the problem
is inside R. If there are any memory contexts that seem to have bloated
unreasonably, knowing which one(s) would be helpful information.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Joe Conway

mail@joeconway.com

about 13 years ago

In reply to: Tom Lane (#5)

Re: problem with lost connection while running long PL/R query

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/16/2013 08:40 AM, Tom Lane wrote:

"David M. Kaplan" <david.kaplan@ird.fr> writes:

Thanks for the help. You have definitely identified the problem,
but I am still looking for a solution that works for me. I tried
setting vm.overcommit_memory=2, but this just made the query
crash quicker than before, though without killing the entire
connection to the database. I imagine that this means that I
really am trying to use more memory than the system can handle?

I am wondering if there is a way to tell postgresql to flush a
set of table lines out to disk so that the memory they are using
can be liberated.

Assuming you don't have work_mem set to something unreasonably
large, it seems likely that the excessive memory consumption is
inside your PL/R function, and not the fault of Postgres per se.
You might try asking in some R-related forums about how to reduce
the code's memory usage.

The two "classic" approaches to this with PL/R are either create a
custom aggregate with the PL/R as the final function (i.e. work on one
group at a time) or use the SPI cursor functionality within the PL/R
function and page your data using a cursor. Not all forms of analysis
lend themselves to these approaches, but perhaps yours does.

Ultimately I would like to implement a form of R data.frame that does
the paging with a cursor transparently for you, but I have not been
able to find the time so far.

Joe

- --
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
Y20XukH07h9XcJGTsf0o
=TwfD
-----END PGP SIGNATURE-----

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David M. Kaplan

david.kaplan@ird.fr

about 13 years ago

In reply to: Joe Conway (#6)

Re: problem with lost connection while running long PL/R query

Hi Joe,

Thanks for responding as you would clearly be the expert on this sort of
problem. My current function does page through data using a cursor
precisely to avoid out of memory problems, which is why I am somewhat
surprised and stumped as to how this can be happening. It does return
all the data at once, but one call to the function would seem to work,
so I can't see why 4 wouldn't. I am currently planning to do some test
runs using memory.profile() to see if each successive call to the PL/R
function is somehow accumulating memory usage somewhere. Perhaps I am
not properly closing a query or something like that?

I am attaching my code. Perhaps you will have some insight. To give
you a basic idea of what I am trying to do, I have separately developed
a classification model for the state of a "system" based on the data in
the postgresql table. I want to apply that model to each line of the
table. I loop over the cursor and predict the state for batches of
10,000 lines at a time.

Thanks again for the help.

Cheers,
David

On 05/16/2013 11:40 PM, Joe Conway wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/16/2013 08:40 AM, Tom Lane wrote:

"David M. Kaplan" <david.kaplan@ird.fr> writes:

Thanks for the help. You have definitely identified the problem,
but I am still looking for a solution that works for me. I tried
setting vm.overcommit_memory=2, but this just made the query
crash quicker than before, though without killing the entire
connection to the database. I imagine that this means that I
really am trying to use more memory than the system can handle?
I am wondering if there is a way to tell postgresql to flush a
set of table lines out to disk so that the memory they are using
can be liberated.

Assuming you don't have work_mem set to something unreasonably
large, it seems likely that the excessive memory consumption is
inside your PL/R function, and not the fault of Postgres per se.
You might try asking in some R-related forums about how to reduce
the code's memory usage.

The two "classic" approaches to this with PL/R are either create a
custom aggregate with the PL/R as the final function (i.e. work on one
group at a time) or use the SPI cursor functionality within the PL/R
function and page your data using a cursor. Not all forms of analysis
lend themselves to these approaches, but perhaps yours does.

Ultimately I would like to implement a form of R data.frame that does
the paging with a cursor transparently for you, but I have not been
able to find the time so far.

Joe

- --
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
Y20XukH07h9XcJGTsf0o
=TwfD
-----END PGP SIGNATURE-----

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************

David M. Kaplan

david.kaplan@ird.fr

about 13 years ago

In reply to: Joe Conway (#6)

Re: problem with lost connection while running long PL/R query

Hi,

Of course today after a reboot to update to the newest kernel,
everything works without crashing...

I imagine that yesterday the problem was that I had forgotten that I had
a Windows virtual machine running on the server that was eating a good
piece of memory. Still, using a cursor to page through the data should
have avoided major memory difficulties...

As a footnote, I can confirm that PL/R was not eating up memory little
by little (see results of memory.profile() at beginning and end of each
function invokation below).

Thanks,
David

Run 1:

NOTICE: Starting predict_classification_model
NOTICE: NULL symbol pairlist closure environment
promise
NOTICE: 1 7847 173768 4308 1129 6273
NOTICE: language special builtin char logical
integer
NOTICE: 46996 189 1839 9387 7351 17229
NOTICE: double complex character ... any list
NOTICE: 2190 1 42810 1 0 13456
NOTICE: expression bytecode externalptr weakref raw S4
NOTICE: 3 10246 823 198 199 1061

NOTICE: Ending predict_classification_model
NOTICE: NULL symbol pairlist closure environment
promise
NOTICE: 1 8062 181354 4360 1129 6273
NOTICE: language special builtin char logical
integer
NOTICE: 50616 189 1839 2085318 7496 17972
NOTICE: double complex character ... any list
NOTICE: 2393 1 43459 1 0 13929
NOTICE: expression bytecode externalptr weakref raw S4
NOTICE: 3 10683 814 187 188 1061

Run 2:

NOTICE: Starting predict_classification_model
NOTICE: NULL symbol pairlist closure environment
promise
NOTICE: 1 8062 180936 4360 1128 6271
NOTICE: language special builtin char logical
integer
NOTICE: 50536 189 1839 9684 7496 17878
NOTICE: double complex character ... any list
NOTICE: 2258 1 43312 1 0 13899
NOTICE: expression bytecode externalptr weakref raw S4
NOTICE: 3 10683 813 188 189 1061

NOTICE: Ending predict_classification_model
NOTICE: NULL symbol pairlist closure environment
promise
NOTICE: 1 8062 181356 4360 1129 6273
NOTICE: language special builtin char logical
integer
NOTICE: 50617 189 1839 2085317 7496 17972
NOTICE: double complex character ... any list
NOTICE: 2393 1 43459 1 0 13929
NOTICE: expression bytecode externalptr weakref raw S4
NOTICE: 3 10683 814 187 188 1061

Run 3:

NOTICE: Ending predict_classification_model
NOTICE: NULL symbol pairlist closure environment
promise
NOTICE: 1 8062 181356 4360 1129 6273
NOTICE: language special builtin char logical
integer
NOTICE: 50617 189 1839 2085319 7496 17972
NOTICE: double complex character ... any list
NOTICE: 2393 1 43459 1 0 13929
NOTICE: expression bytecode externalptr weakref raw S4
NOTICE: 3 10683 814 187 188 1061

On 05/16/2013 11:40 PM, Joe Conway wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/16/2013 08:40 AM, Tom Lane wrote:

"David M. Kaplan" <david.kaplan@ird.fr> writes:

Thanks for the help. You have definitely identified the problem,
but I am still looking for a solution that works for me. I tried
setting vm.overcommit_memory=2, but this just made the query
crash quicker than before, though without killing the entire
connection to the database. I imagine that this means that I
really am trying to use more memory than the system can handle?
I am wondering if there is a way to tell postgresql to flush a
set of table lines out to disk so that the memory they are using
can be liberated.

Assuming you don't have work_mem set to something unreasonably
large, it seems likely that the excessive memory consumption is
inside your PL/R function, and not the fault of Postgres per se.
You might try asking in some R-related forums about how to reduce
the code's memory usage.

The two "classic" approaches to this with PL/R are either create a
custom aggregate with the PL/R as the final function (i.e. work on one
group at a time) or use the SPI cursor functionality within the PL/R
function and page your data using a cursor. Not all forms of analysis
lend themselves to these approaches, but perhaps yours does.

Ultimately I would like to implement a form of R data.frame that does
the paging with a cursor transparently for you, but I have not been
able to find the time so far.

Joe

- --
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
Y20XukH07h9XcJGTsf0o
=TwfD
-----END PGP SIGNATURE-----

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

problem with lost connection while running long PL/R query

Attachments: