ERROR: invalid memory alloc request size 1073741824
Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH BINARY"
query.
"ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)
So a palloc() call is failing on the AllocSizeIsValid() check.
Does anyone know if this a safety catch we are tripping with a bad query
- or whether this check is something that should never be hit (i.e. a bug).
This has been some discussion before about making a change before the
9.5beta1:
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F6F3F@BPXM15GP.gisp.nec.co.jp
/messages/by-id/CAKJS1f_sxPyBHF6=nEYwPJdKKDNjVDP91b_EJApfyGTei=ViDw@mail.gmail.com
postgresql 9.5.4 (2PGDG.rhel6)
work_mem = 256MB
Thanks,
Stefan
On 01/30/2018 06:35 PM, Stefan Blanke wrote:
Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH BINARY"
query."ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)So a palloc() call is failing on the AllocSizeIsValid() check.
Does anyone know if this a safety catch we are tripping with a bad
query - or whether this check is something that should never be hit
(i.e. a bug).
That very much depends on where exactly the failure happens - it might
be both (or relying on an assumption that changed over). It's not clear
if it fails during query execution or formatting the output (I seem to
remember there were/are issues with rows containing multiple large bytea
values, for example).
Can you attach gdb to the backend, set breakpoint on that elog and share
the backtrace? Then we can investigate further.
In case you don't have experience with doing that, here's a howto on the
community wiki:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
This has been some discussion before about making a change before the
9.5beta1:/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F6F3F@BPXM15GP.gisp.nec.co.jp
/messages/by-id/CAKJS1f_sxPyBHF6=nEYwPJdKKDNjVDP91b_EJApfyGTei=ViDw@mail.gmail.com
Both of those threads are about a hashjoin issue, which should have been
fixed in 9.5. Moreover, you haven't shared the query and it's query
plan, so it's unclear if it's doing hash joins at all. That would be
useful for investigating this issue, BTW.
postgresql 9.5.4 (2PGDG.rhel6)
work_mem = 256MB
FWIW you're missing more than a year of bugfixes
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke <
stefan.blanke@framestore.com> wrote:
Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH BINARY"
query."ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)
I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?
Regards, Jan
So a palloc() call is failing on the AllocSizeIsValid() check.
Does anyone know if this a safety catch we are tripping with a bad query -
or whether this check is something that should never be hit (i.e. a bug).This has been some discussion before about making a change before the
9.5beta1:/messages/by-id/9A28C8860F777E439
AA12E8AEA7694F8010F6F3F%40BPXM15GP.gisp.nec.co.jp/messages/by-id/CAKJS1f_sxPyBHF6%25
3DnEYwPJdKKDNjVDP91b_EJApfyGTei%3DViDw%40mail.gmail.compostgresql 9.5.4 (2PGDG.rhel6)
work_mem = 256MBThanks,
Stefan
--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info
On 01/30/2018 10:43 PM, Jan Wieck wrote:
On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke
<stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote:Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH
BINARY" query."ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?
I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
That very much depends on where exactly the failure happens - it might
be both (or relying on an assumption that changed over). It's not clear
if it fails during query execution or formatting the output (I seem to
remember there were/are issues with rows containing multiple large bytea
values, for example).Can you attach gdb to the backend, set breakpoint on that elog and share
the backtrace? Then we can investigate further.
I can't easily get a meaningful query plan (i.e. from the live system)
as the query that failed is based on a temporary table built by some
pl/pgsql (the database is more of a conveyor belt than a long term store).
Unfortunately it's difficult to reproduce the ERROR. This has only
occurred a couple of times on a live system in the last year and these
queries run frequently.
I think we will move to 9.5.10 and see if it ever happens again; then if
it does get a backtrace on that newer version. Incidentally we never saw
this error on 8.4 where this code was running from 2009 until 2017.
I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.
I had dismissed corruption before writing in. It's exactly 1GB every
time this has happened - and we can dump the full dataset periodically
without issue.
I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?
It's good quality commercial hardware in our colo - no exactly sure what.
Thanks for taking the time to look at this!
Stefan
Show quoted text
On 01/30/18 22:00, Tomas Vondra wrote:
On 01/30/2018 10:43 PM, Jan Wieck wrote:
On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke
<stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote:Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH
BINARY" query."ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.regards
On Wed, Jan 31, 2018 at 12:32 PM, Stefan Blanke <
stefan.blanke@framestore.com> wrote:
I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.I had dismissed corruption before writing in. It's exactly 1GB every time
this has happened - and we can dump the full dataset periodically without
issue.I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?It's good quality commercial hardware in our colo - no exactly sure what.
If it is a sporadic issue and you can dump the full dataset, then I just
lost my money (Tomas, you coming to PGConf in Jersey City?).
But then, if this is a plain COPY to stdout ... I am wondering what could
possibly be in that path that wants to allocate 1GB. Or is this not so
plain but rather a COPY ... SELECT ...?
Regards, Jan
Thanks for taking the time to look at this!
StefanOn 01/30/18 22:00, Tomas Vondra wrote:
On 01/30/2018 10:43 PM, Jan Wieck wrote:
On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke
<stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>>
wrote:Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH
BINARY" query."ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.regards
--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info
On 01/31/2018 09:51 PM, Jan Wieck wrote:
On Wed, Jan 31, 2018 at 12:32 PM, Stefan Blanke
<stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote:I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.I had dismissed corruption before writing in. It's exactly 1GB every
time this has happened - and we can dump the full dataset
periodically without issue.I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?It's good quality commercial hardware in our colo - no exactly sure
what.If it is a sporadic issue and you can dump the full dataset, then I just
lost my money (Tomas, you coming to PGConf in Jersey City?).
Unfortunately no, but I'm sure there will be other opportunities to buy
me a beer ;-) Like pgcon, for example.
But then, if this is a plain COPY to stdout ... I am wondering what
could possibly be in that path that wants to allocate 1GB. Or is this
not so plain but rather a COPY ... SELECT ...?
That's what I've been guessing, and why I was asking for a query plan.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
We've upgraded to PostgreSQL 11.5 (postgresql.org rhel 6 rpm) and I have
had another occurrence of this invalid alloc of 1GB. Apologies for never
providing a query plan when discussing this two years ago; we decided to
move to a newer PostgreSQL to see if the issue went away but took a
while to complete the move.
The invalid alloc still only occurs occasionally every few months on a
query that we run every minute; so unfortunately we still don't have a
contained reproducible test case.
This is the SQL we are running with a query plan - the query plan is
from an new empty database so the planner has no stats.
CREATE OR REPLACE FUNCTION
create_table()
RETURNS TABLE (a INTEGER, b INTEGER)
AS $$
UPDATE y
SET c = true,
d = false
WHERE e IS NOT true
AND f IS NOT true
AND g = 1
AND h = 0
AND i = 0
AND (j IS NULL
OR j > 0)
RETURNING y.a, y.b;
$$ LANGUAGE SQL;
-- Prepared statement (PQprepare)
CREATE TEMPORARY TABLE temp_table AS
SELECT * FROM create_table()
-- Prepared statement (PQprepare)
SELECT y.a, y.b,
x.k,
x.l,
y.m,
y.n,
y.o
FROM temp_table
INNER JOIN y ON temp_table.b = y.b
AND temp_table.a = y.a
INNER JOIN x ON x.a = y.a
-- The two prepared statements are executed one after another
-- in the order shown many times an hour.
The query plan for the second prepared statement is:
Nested Loop (cost=17.14..64.38 rows=16 width=112)
-> Hash Join (cost=17.00..61.47 rows=16 width=80)
Hash Cond: ((temp_table.b = y.b) AND (temp_table.a = y.a))
-> Seq Scan on temp_table (cost=0.00..32.60 rows=2260 width=8)
-> Hash (cost=12.80..12.80 rows=280 width=76)
-> Seq Scan on y (cost=0.00..12.80 rows=280 width=76)
-> Index Scan using x_pkey on x (cost=0.14..0.18 rows=1 width=40)
Index Cond: (a = temp_table.a)
Thanks,
Stefan
Show quoted text
On 31/01/2018 21:23, Tomas Vondra wrote:
On 01/31/2018 09:51 PM, Jan Wieck wrote:
On Wed, Jan 31, 2018 at 12:32 PM, Stefan Blanke
<stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote:I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.I had dismissed corruption before writing in. It's exactly 1GB every
time this has happened - and we can dump the full dataset
periodically without issue.I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?It's good quality commercial hardware in our colo - no exactly sure
what.If it is a sporadic issue and you can dump the full dataset, then I just
lost my money (Tomas, you coming to PGConf in Jersey City?).Unfortunately no, but I'm sure there will be other opportunities to buy
me a beer ;-) Like pgcon, for example.But then, if this is a plain COPY to stdout ... I am wondering what
could possibly be in that path that wants to allocate 1GB. Or is this
not so plain but rather a COPY ... SELECT ...?That's what I've been guessing, and why I was asking for a query plan.
regards
Stefan Blanke <stefan.blanke@framestore.com> writes:
We've upgraded to PostgreSQL 11.5 (postgresql.org rhel 6 rpm) and I have
had another occurrence of this invalid alloc of 1GB. Apologies for never
providing a query plan when discussing this two years ago; we decided to
move to a newer PostgreSQL to see if the issue went away but took a
while to complete the move.
The invalid alloc still only occurs occasionally every few months on a
query that we run every minute; so unfortunately we still don't have a
contained reproducible test case.
Given the lack of stats, I wonder whether the issue could be related
to the plan sometimes being horribly bad, eg due to the temp table
being much larger than expected. (A possible mechanism would be
hash table bloat, perhaps, but that's getting way ahead of the
evidence.)
Could you adjust your process to log the actual temp table size
each time, ie "select count(*) from temp_table" in between the
two steps, and then note whether the failures are correlated
with unusual temp table sizes?
regards, tom lane
Could you adjust your process to log the actual temp table size
each time, ie "select count(*) from temp_table" in between the
two steps, and then note whether the failures are correlated
with unusual temp table sizes?
Thanks, I'll add a count there and come back with a number when we next
hit the "invalid memory alloc".
Stefan
Show quoted text
On 11/03/2020 15:37, Tom Lane wrote:
Stefan Blanke <stefan.blanke@framestore.com> writes:
We've upgraded to PostgreSQL 11.5 (postgresql.org rhel 6 rpm) and I have
had another occurrence of this invalid alloc of 1GB. Apologies for never
providing a query plan when discussing this two years ago; we decided to
move to a newer PostgreSQL to see if the issue went away but took a
while to complete the move.The invalid alloc still only occurs occasionally every few months on a
query that we run every minute; so unfortunately we still don't have a
contained reproducible test case.Given the lack of stats, I wonder whether the issue could be related
to the plan sometimes being horribly bad, eg due to the temp table
being much larger than expected. (A possible mechanism would be
hash table bloat, perhaps, but that's getting way ahead of the
evidence.)Could you adjust your process to log the actual temp table size
each time, ie "select count(*) from temp_table" in between the
two steps, and then note whether the failures are correlated
with unusual temp table sizes?regards, tom lane