dsa_allocate() faliure

Started by Rick Ottenover 8 years ago43 messageshackers

rottenwindfish@gmail.com

over 8 years ago

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages
CONTEXT: parallel worker

I tried throttling back the number of parallel workers to just 2, that
didn't help.

The query is joining two views that each have 50 or so underlying queries,
unioned, in them. Unfortunately due to an invalid index, it is sequence
scanning some of the tables. I can't fix the indexes until a few create
materialized view commands that are currently running (and have been
running for 6 days) finish or I kill them, because they are holding a lock
that is blocking any attempt to reindex.

So that leaves me looking for some tunable (hopefully one that doesn't
require a restart) which will fix this by adding sufficient resources to
the system to allow the dsa_allocate() to find enough (contiguous?) pages.
My system seems to have plenty of extra capacity.

There was a thread on pghackers in December where someone else was seeing a
similar error, but couldn't reproduce it consistently. I've run the above
query hundreds of times over the last 24 hours, but just the one fails when
I select just the right parameters - and fails every time I run it with
those parameters.

In that thread someone speculated it had to do with running many parallel
bitmap heap scans in one query. I count 98 in the query plan.

I'm hoping there is a "magic X tunable" which I just need to bump up a
little to let queries like this run without the fatal failure.

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Rick Otten (#1)

Re: dsa_allocate() faliure

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages
CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should always
* succeed, since both get_best_segment and make_new_segment should
* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our backend
* private state is out of whack, so use FATAL to kill the process.
*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

cc'ing the DSA authors for comment.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Tom Lane (#2)

Re: dsa_allocate() faliure

On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages
CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should always
* succeed, since both get_best_segment and make_new_segment should
* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our backend
* private state is out of whack, so use FATAL to kill the process.
*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

This is probably the bug fixed here:

/messages/by-id/E1eQzIl-0004wM-K3@gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix. The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them. The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output? That seems plausible given the
description of a 50-branch UNION. The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

--
Thomas Munro
http://www.enterprisedb.com

Rick Otten

rottenwindfish@gmail.com

over 8 years ago

In reply to: Thomas Munro (#3)

Re: dsa_allocate() faliure

If I do a "set max_parallel_workers_per_gather=0;" before I run the query
in that session, it runs just fine.
If I set it to 2, the query dies with the dsa_allocate error.

I'll use that as a work around until 10.2 comes out. Thanks! I have
something that will help.

On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro <thomas.munro@enterprisedb.com

Show quoted text

wrote:

On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free

pages

CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should

always

* succeed, since both get_best_segment and make_new_segment

should

* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our

backend

* private state is out of whack, so use FATAL to kill the

process.

*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

This is probably the bug fixed here:

/messages/by-id/E1eQzIl-0004wM-K3%25
40gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix. The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them. The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output? That seems plausible given the
description of a 50-branch UNION. The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

--
Thomas Munro
http://www.enterprisedb.com

Sand Stone

sand.m.stone@gmail.com

about 8 years ago

In reply to: Rick Otten (#4)

Re: dsa_allocate() faliure

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Show quoted text

On Mon, Jan 29, 2018 at 1:35 PM, Rick Otten <rottenwindfish@gmail.com> wrote:

If I do a "set max_parallel_workers_per_gather=0;" before I run the query in
that session, it runs just fine.
If I set it to 2, the query dies with the dsa_allocate error.

I'll use that as a work around until 10.2 comes out. Thanks! I have
something that will help.

On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free
pages
CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should
always
* succeed, since both get_best_segment and make_new_segment
should
* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our
backend
* private state is out of whack, so use FATAL to kill the
process.
*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

This is probably the bug fixed here:

/messages/by-id/E1eQzIl-0004wM-K3@gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix. The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them. The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output? That seems plausible given the
description of a 50-branch UNION. The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

--
Thomas Munro
http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

about 8 years ago

In reply to: Sand Stone (#5)

Re: dsa_allocate() faliure

On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

At which commit ID?

All of your worker nodes... so this happened at the same time or at
different times? I don't know much about Citus -- do you mean that
these were separate PostgreSQL clusters, and they were all running the
same query and they all crashed like this?

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Any chance of an EXPLAIN plan for the query that crashed like this?
Do you know if it's using multiple Gather[Merge] nodes and parallel
bitmap heap scans? Was it a regular backend process or a parallel
worker process (or a Citus worker process, if that is a thing?) that
raised the error?

--
Thomas Munro
http://www.enterprisedb.com

Sand Stone

sand.m.stone@gmail.com

about 8 years ago

In reply to: Thomas Munro (#6)

Re: dsa_allocate() faliure

At which commit ID?

83fcc615020647268bb129cbf86f7661feee6412 (5/6)

do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?

A few worker nodes, a table is hash partitioned by "aTable.did" by
Citus, and further partitioned by PG10 by time range on field "ts". As
far as I could tell, Citus just does a query rewrite, and execute the
same type of queries to all nodes.

so this happened at the same time or at different times?

At the same time. The queries are simple count and sum queries, here
is the relevant part from one of the worker nodes:
2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not
find 7 free pages
2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker
STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS
sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
(aTable.count)::double precision)) AS avg FROM public.aTable_102117
aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
00:00:00+00'::timestamp with time zone) AND (aTable.ts
OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)

a parallel worker process

I think this is more of PG10 parallel bg worker issue. I don't think
Citus just lets each worker PG server do its own planning.

I will try to do more experiments about this, and see if there is any
specific query to cause the parallel query execution to fail. As far
as I can tell, the level of concurrency triggered this issue. That is
executing 10s of queries as shown on the worker nodes, depending on
the stats, the PG10 core may or may not spawn more bg workers.

Thanks for your time!

On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Show quoted text

On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

At which commit ID?

All of your worker nodes... so this happened at the same time or at
different times? I don't know much about Citus -- do you mean that
these were separate PostgreSQL clusters, and they were all running the
same query and they all crashed like this?

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Any chance of an EXPLAIN plan for the query that crashed like this?
Do you know if it's using multiple Gather[Merge] nodes and parallel
bitmap heap scans? Was it a regular backend process or a parallel
worker process (or a Citus worker process, if that is a thing?) that
raised the error?

--
Thomas Munro
http://www.enterprisedb.com

Sand Stone

sand.m.stone@gmail.com

almost 8 years ago

In reply to: Sand Stone (#7)

Re: dsa_allocate() faliure

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Show quoted text

On Wed, May 23, 2018 at 7:06 AM Sand Stone <sand.m.stone@gmail.com> wrote:

At which commit ID?

83fcc615020647268bb129cbf86f7661feee6412 (5/6)

do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?

A few worker nodes, a table is hash partitioned by "aTable.did" by
Citus, and further partitioned by PG10 by time range on field "ts". As
far as I could tell, Citus just does a query rewrite, and execute the
same type of queries to all nodes.

so this happened at the same time or at different times?

At the same time. The queries are simple count and sum queries, here
is the relevant part from one of the worker nodes:
2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not
find 7 free pages
2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker
STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS
sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
(aTable.count)::double precision)) AS avg FROM public.aTable_102117
aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
00:00:00+00'::timestamp with time zone) AND (aTable.ts
OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)

a parallel worker process

I think this is more of PG10 parallel bg worker issue. I don't think
Citus just lets each worker PG server do its own planning.

I will try to do more experiments about this, and see if there is any
specific query to cause the parallel query execution to fail. As far
as I can tell, the level of concurrency triggered this issue. That is
executing 10s of queries as shown on the worker nodes, depending on
the stats, the PG10 core may or may not spawn more bg workers.

Thanks for your time!

On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

At which commit ID?

All of your worker nodes... so this happened at the same time or at
different times? I don't know much about Citus -- do you mean that
these were separate PostgreSQL clusters, and they were all running the
same query and they all crashed like this?

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Any chance of an EXPLAIN plan for the query that crashed like this?
Do you know if it's using multiple Gather[Merge] nodes and parallel
bitmap heap scans? Was it a regular backend process or a parallel
worker process (or a Citus worker process, if that is a thing?) that
raised the error?

--
Thomas Munro
http://www.enterprisedb.com

Thomas Munro

thomas.munro@gmail.com

almost 8 years ago

In reply to: Sand Stone (#8)

Re: dsa_allocate() faliure

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com> wrote:

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

#10

Sand Stone

sand.m.stone@gmail.com

almost 8 years ago

In reply to: Thomas Munro (#9)

Re: dsa_allocate() faliure

Can you still see the problem with Citus 7.4?

Hi, Thomas. I actually went back to the cluster with Citus7.4 and
PG10.4. And modified the parallel param. So far, I haven't seen any
server crash.

The main difference between crashes observed and no crash, is the set
of Linux TCP time out parameters (to release the ports faster).
Unfortunately, I cannot "undo" the Linux params and run the stress
tests anymore, as this is a multi-million $ cluster and people are
doing more useful things on it. I will keep an eye on any parallel
execution issue.

On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Show quoted text

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com> wrote:

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

#11

Sand Stone

sand.m.stone@gmail.com

almost 8 years ago

In reply to: Sand Stone (#10)

Re: dsa_allocate() faliure

I attached a query (and its query plan) that caused the crash:
"dsa_allocate could not find 13 free pages" on one of the worker nodes. I
anonymised the query text a bit. Interestingly, this time only one (same
one) of the nodes is crashing. Since this is a production environment, I
cannot get the stack trace. Once turned off parallel execution for this
node. The whole query finished just fine. So the parallel query plan is
from one of the nodes not crashed, hopefully the same plan would have been
executed on the crashed node. In theory, every worker node has the same
bits, and very similar data.

On Sat, Aug 25, 2018 at 7:46 AM Sand Stone <sand.m.stone@gmail.com> wrote:

Show quoted text

Can you still see the problem with Citus 7.4?

Hi, Thomas. I actually went back to the cluster with Citus7.4 and
PG10.4. And modified the parallel param. So far, I haven't seen any
server crash.

The main difference between crashes observed and no crash, is the set
of Linux TCP time out parameters (to release the ports faster).
Unfortunately, I cannot "undo" the Linux params and run the stress
tests anymore, as this is a multi-million $ cluster and people are
doing more useful things on it. I will keep an eye on any parallel
execution issue.

On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com>

wrote:

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

#12

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Sand Stone (#11)

Re: dsa_allocate() faliure

On Wed, Aug 29, 2018 at 5:48 PM Sand Stone <sand.m.stone@gmail.com> wrote:

I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

/messages/by-id/194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

--
Thomas Munro
http://www.enterprisedb.com

#13

Arne Roland

A.Roland@index.de

over 7 years ago

In reply to: Thomas Munro (#12)

RE: dsa_allocate() faliure

Hello,

I'm not sure whether this is connected at all, but I'm facing the same error with a generated query on postgres 10.6.
It works with parallel query disabled and gives "dsa_allocate could not find 7 free pages" otherwise.

I've attached query and strace. The table is partitioned on (o, date). It's not depended on the precise lists I'm using, while it obviously does depend on the fact that the optimizer chooses a parallel query.

Regards
Arne Roland

-----Original Message-----
From: Thomas Munro <thomas.munro@enterprisedb.com>
Sent: Friday, October 5, 2018 4:17 AM
To: Sand Stone <sand.m.stone@gmail.com>
Cc: Rick Otten <rottenwindfish@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-performance@lists.postgresql.org; Robert Haas <robertmhaas@gmail.com>
Subject: Re: dsa_allocate() faliure

On Wed, Aug 29, 2018 at 5:48 PM Sand Stone <sand.m.stone@gmail.com> wrote:

I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

/messages/by-id/194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

--
Thomas Munro
http://www.enterprisedb.com

#14

Arne Roland

A.Roland@index.de

over 7 years ago

In reply to: Arne Roland (#13)

RE: dsa_allocate() faliure

Hello,

does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?

All the best
Arne Roland

#15

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Arne Roland (#14)

Re: dsa_allocate() faliure

On Tue, Jan 29, 2019 at 2:50 AM Arne Roland <A.Roland@index.de> wrote:

does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?

Hi Arne,

This seems to be a bug; that error should not be reached. I wonder if
it is a different manifestation of the bug reported as #15585 (ie some
type of rare corruption). Are you able to reproduce this
consistently? Can you please show the query plan?

--
Thomas Munro
http://www.enterprisedb.com

#16

Fabio Isabettini

fisabettini@voipfuture.com

over 7 years ago

In reply to: Thomas Munro (#15)

Re: dsa_allocate() faliure

Hello,
we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

The query reads remotely (via pl/proxy) tables containing a lot of data (up to millions or rows for each table/node) after a remote “group by" returns to the caller “master” node only a few hundreds of rows from each “slave” node.
The tables are partitioned using the INHERITANCE method that we are using since years with no issue. All tables have the same columns structure and number, about 300 columns. In the query there are no join, only a variable set of partitions depending on the date range.
The “REMOTE FATAL” refers to the pl/proxy that runs on 2 different slaves, [a0] and [a1], nodes with identical configuration and database structure, but it seems to fail only on node [a1].
When we get the error if we reduce the date range and therefore the quantity of data read, the error disappears, the same if we set max_parallel_workers_per_gather = 0.
Obviously we cannot force the user to use short periods of time to avoid the error and so we have disabled the parallel query feature for the time being.
It is difficult to reproduce the issue because not always the user gets the error, furthermore re-running the same query in different moments/days it usually works. It is a kind of weird.
We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Regards,
Fabio Isabettini
Voipfuture (Germany)

The failing node [a1] configuration:

OS: Centos 7 kernerl 3.10.0-862.11.6.el7.x86_64
Postgres: postgres-10.5-862.11.6.1
RAM: 256 GB (The main server containing the master node and [a0] node, the slave that has no issue, has 384 GB of RAM)
CPU cores: 32

shared_buffers = 64GB
max_worker_processes = 32
max_parallel_workers_per_gather = 8
max_parallel_workers = 32

Show quoted text

On 28. Jan 2019, at 19:56:01, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Jan 29, 2019 at 2:50 AM Arne Roland <A.Roland@index.de> wrote:

does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?

Hi Arne,

This seems to be a bug; that error should not be reached. I wonder if
it is a different manifestation of the bug reported as #15585 (ie some
type of rare corruption). Are you able to reproduce this
consistently? Can you please show the query plan?

--
Thomas Munro
http://www.enterprisedb.com

#17

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Fabio Isabettini (#16)

Re: dsa_allocate() faliure

On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
<fisabettini@voipfuture.com> wrote:

we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Hi Fabio,

Thanks for your report. Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here. It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will. I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once. But
we have about three reports of each of the two symptoms. One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks. Clearly there is issue needing
a fix here, but I don't yet know what it is.

--
Thomas Munro
http://www.enterprisedb.com

#18

Fabio Isabettini

fisabettini@voipfuture.com

over 7 years ago

In reply to: Thomas Munro (#17)

Re: dsa_allocate() faliure

Hi Thomas,
it is a Production system and we don’t have permanent access to it.
Also to have an auto_explain feature always on, is not an option in production.
I will ask the customer to give us notice asap the error present itself to connect immediately and try to get a query plan.

Regards

Fabio Isabettini
www.voipfuture.com

Show quoted text

On 30. Jan 2019, at 04:13:14, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
<fisabettini@voipfuture.com> wrote:

we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Hi Fabio,

Thanks for your report. Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here. It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will. I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once. But
we have about three reports of each of the two symptoms. One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks. Clearly there is issue needing
a fix here, but I don't yet know what it is.

--
Thomas Munro
http://www.enterprisedb.com

#19

Arne Roland

A.Roland@index.de

over 7 years ago

In reply to: Thomas Munro (#15)

RE: dsa_allocate() faliure

Hi Thomas,

this is reproducible, while it's highly sensitive to the change of plans (i.e. the precise querys that do break change with every new analyze). Disabling parallel query seems to solve the problem (as expected).
At some point even the simple query
select count(*) from test_tab where (o = '0' and date >= '30.01.2019'::date-interval '14 days' or o = '1' and date >= '30.01.2019'::date-interval '14 days') and coalesce(fid,fallback) >=6 and coalesce(fid,fallback) <=6
was reported to fail (with the same error) at the live database, but I wasn't able to obtain a plan, since it works again with the current live data (maybe autoanalyze is at fault here).
The table test_tab has roughly 70 children that inherit from it. The children and the corresponding indexes should be named like '%part%'.

I attached a query with a plan that fails on my test database.

I don't want to rule out the possibility that it could be related to #15585; at least both issues seem to be related to Parallel Bitmap and inheritance/partitioned tables, but the error occurs relatively quickly in my case and every one of my processes (the children and the master) are failing with 'FATAL: dsa_allocate could not find 7 free pages'.

Regards
Arne

#20

Justin Pryzby

pryzby@telsasoft.com

over 7 years ago

In reply to: Arne Roland (#19)

Re: dsa_allocate() faliure

On Thu, Jan 31, 2019 at 06:19:54PM +0000, Arne Roland wrote:

this is reproducible, while it's highly sensitive to the change of plans (i.e. the precise querys that do break change with every new analyze). Disabling parallel query seems to solve the problem (as expected).
At some point even the simple query
select count(*) from test_tab where (o = '0' and date >= '30.01.2019'::date-interval '14 days' or o = '1' and date >= '30.01.2019'::date-interval '14 days') and coalesce(fid,fallback) >=6 and coalesce(fid,fallback) <=6
was reported to fail (with the same error) at the live database, but I wasn't able to obtain a plan, since it works again with the current live data (maybe autoanalyze is at fault here).
The table test_tab has roughly 70 children that inherit from it. The children and the corresponding indexes should be named like '%part%'.

I attached a query with a plan that fails on my test database.

Thanks - note that previously Thomas said:

On Mon, Dec 03, 2018 at 11:45:00AM +1300, Thomas Munro wrote:

On Sat, Dec 1, 2018 at 9:46 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);
+ abort()

If anyone can reproduce this problem with a debugger, it'd be
interesting to see the output of dsa_dump(area), and
FreePageManagerDump(segment_map->fpm). This error condition means

Are you able to cause the error in a test/devel/non-production environment to
run under a debugger, or could you compile with "abort();" after that elog() to
save a corefile ?

Justin

#21