dsa_allocate() faliure

Started by Rick Ottenalmost 8 years ago43 messages
#1Rick Otten
rottenwindfish@gmail.com

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages
CONTEXT: parallel worker

I tried throttling back the number of parallel workers to just 2, that
didn't help.

The query is joining two views that each have 50 or so underlying queries,
unioned, in them. Unfortunately due to an invalid index, it is sequence
scanning some of the tables. I can't fix the indexes until a few create
materialized view commands that are currently running (and have been
running for 6 days) finish or I kill them, because they are holding a lock
that is blocking any attempt to reindex.

So that leaves me looking for some tunable (hopefully one that doesn't
require a restart) which will fix this by adding sufficient resources to
the system to allow the dsa_allocate() to find enough (contiguous?) pages.
My system seems to have plenty of extra capacity.

There was a thread on pghackers in December where someone else was seeing a
similar error, but couldn't reproduce it consistently. I've run the above
query hundreds of times over the last 24 hours, but just the one fails when
I select just the right parameters - and fails every time I run it with
those parameters.

In that thread someone speculated it had to do with running many parallel
bitmap heap scans in one query. I count 98 in the query plan.

I'm hoping there is a "magic X tunable" which I just need to bump up a
little to let queries like this run without the fatal failure.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Rick Otten (#1)
Re: dsa_allocate() faliure

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages
CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should always
* succeed, since both get_best_segment and make_new_segment should
* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our backend
* private state is out of whack, so use FATAL to kill the process.
*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

cc'ing the DSA authors for comment.

regards, tom lane

#3Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#2)
Re: dsa_allocate() faliure

On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages
CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should always
* succeed, since both get_best_segment and make_new_segment should
* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our backend
* private state is out of whack, so use FATAL to kill the process.
*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

This is probably the bug fixed here:

/messages/by-id/E1eQzIl-0004wM-K3@gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix. The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them. The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output? That seems plausible given the
description of a 50-branch UNION. The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

--
Thomas Munro
http://www.enterprisedb.com

#4Rick Otten
rottenwindfish@gmail.com
In reply to: Thomas Munro (#3)
Re: dsa_allocate() faliure

If I do a "set max_parallel_workers_per_gather=0;" before I run the query
in that session, it runs just fine.
If I set it to 2, the query dies with the dsa_allocate error.

I'll use that as a work around until 10.2 comes out. Thanks! I have
something that will help.

On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro <thomas.munro@enterprisedb.com

Show quoted text

wrote:

On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free

pages

CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should

always

* succeed, since both get_best_segment and make_new_segment

should

* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our

backend

* private state is out of whack, so use FATAL to kill the

process.

*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

This is probably the bug fixed here:

/messages/by-id/E1eQzIl-0004wM-K3%25
40gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix. The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them. The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output? That seems plausible given the
description of a 50-branch UNION. The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

--
Thomas Munro
http://www.enterprisedb.com

#5Sand Stone
sand.m.stone@gmail.com
In reply to: Rick Otten (#4)
Re: dsa_allocate() faliure

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Show quoted text

On Mon, Jan 29, 2018 at 1:35 PM, Rick Otten <rottenwindfish@gmail.com> wrote:

If I do a "set max_parallel_workers_per_gather=0;" before I run the query in
that session, it runs just fine.
If I set it to 2, the query dies with the dsa_allocate error.

I'll use that as a work around until 10.2 comes out. Thanks! I have
something that will help.

On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rick Otten <rottenwindfish@gmail.com> writes:

I'm wondering if there is anything I can tune in my PG 10.1 database to
avoid these errors:

$ psql -f failing_query.sql
psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free
pages
CONTEXT: parallel worker

Hmm. There's only one place in the source code that emits that message
text:

/*
* Ask the free page manager for a run of pages. This should
always
* succeed, since both get_best_segment and make_new_segment
should
* only return a non-NULL pointer if it actually contains enough
* contiguous freespace. If it does fail, something in our
backend
* private state is out of whack, so use FATAL to kill the
process.
*/
if (!FreePageManagerGet(segment_map->fpm, npages, &first_page))
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

This is probably the bug fixed here:

/messages/by-id/E1eQzIl-0004wM-K3@gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix. The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them. The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output? That seems plausible given the
description of a 50-branch UNION. The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

--
Thomas Munro
http://www.enterprisedb.com

#6Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Sand Stone (#5)
Re: dsa_allocate() faliure

On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

At which commit ID?

All of your worker nodes... so this happened at the same time or at
different times? I don't know much about Citus -- do you mean that
these were separate PostgreSQL clusters, and they were all running the
same query and they all crashed like this?

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Any chance of an EXPLAIN plan for the query that crashed like this?
Do you know if it's using multiple Gather[Merge] nodes and parallel
bitmap heap scans? Was it a regular backend process or a parallel
worker process (or a Citus worker process, if that is a thing?) that
raised the error?

--
Thomas Munro
http://www.enterprisedb.com

#7Sand Stone
sand.m.stone@gmail.com
In reply to: Thomas Munro (#6)
Re: dsa_allocate() faliure

At which commit ID?

83fcc615020647268bb129cbf86f7661feee6412 (5/6)

do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?

A few worker nodes, a table is hash partitioned by "aTable.did" by
Citus, and further partitioned by PG10 by time range on field "ts". As
far as I could tell, Citus just does a query rewrite, and execute the
same type of queries to all nodes.

so this happened at the same time or at different times?

At the same time. The queries are simple count and sum queries, here
is the relevant part from one of the worker nodes:
2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not
find 7 free pages
2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker
STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS
sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
(aTable.count)::double precision)) AS avg FROM public.aTable_102117
aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
00:00:00+00'::timestamp with time zone) AND (aTable.ts
OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)

a parallel worker process

I think this is more of PG10 parallel bg worker issue. I don't think
Citus just lets each worker PG server do its own planning.

I will try to do more experiments about this, and see if there is any
specific query to cause the parallel query execution to fail. As far
as I can tell, the level of concurrency triggered this issue. That is
executing 10s of queries as shown on the worker nodes, depending on
the stats, the PG10 core may or may not spawn more bg workers.

Thanks for your time!

On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Show quoted text

On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

At which commit ID?

All of your worker nodes... so this happened at the same time or at
different times? I don't know much about Citus -- do you mean that
these were separate PostgreSQL clusters, and they were all running the
same query and they all crashed like this?

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Any chance of an EXPLAIN plan for the query that crashed like this?
Do you know if it's using multiple Gather[Merge] nodes and parallel
bitmap heap scans? Was it a regular backend process or a parallel
worker process (or a Citus worker process, if that is a thing?) that
raised the error?

--
Thomas Munro
http://www.enterprisedb.com

#8Sand Stone
sand.m.stone@gmail.com
In reply to: Sand Stone (#7)
Re: dsa_allocate() faliure

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Show quoted text

On Wed, May 23, 2018 at 7:06 AM Sand Stone <sand.m.stone@gmail.com> wrote:

At which commit ID?

83fcc615020647268bb129cbf86f7661feee6412 (5/6)

do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?

A few worker nodes, a table is hash partitioned by "aTable.did" by
Citus, and further partitioned by PG10 by time range on field "ts". As
far as I could tell, Citus just does a query rewrite, and execute the
same type of queries to all nodes.

so this happened at the same time or at different times?

At the same time. The queries are simple count and sum queries, here
is the relevant part from one of the worker nodes:
2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not
find 7 free pages
2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker
STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS
sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
(aTable.count)::double precision)) AS avg FROM public.aTable_102117
aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
00:00:00+00'::timestamp with time zone) AND (aTable.ts
OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)

a parallel worker process

I think this is more of PG10 parallel bg worker issue. I don't think
Citus just lets each worker PG server do its own planning.

I will try to do more experiments about this, and see if there is any
specific query to cause the parallel query execution to fail. As far
as I can tell, the level of concurrency triggered this issue. That is
executing 10s of queries as shown on the worker nodes, depending on
the stats, the PG10 core may or may not spawn more bg workers.

Thanks for your time!

On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand.m.stone@gmail.com> wrote:

dsa_allocate could not find 7 free pages

I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

At which commit ID?

All of your worker nodes... so this happened at the same time or at
different times? I don't know much about Citus -- do you mean that
these were separate PostgreSQL clusters, and they were all running the
same query and they all crashed like this?

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).

Any chance of an EXPLAIN plan for the query that crashed like this?
Do you know if it's using multiple Gather[Merge] nodes and parallel
bitmap heap scans? Was it a regular backend process or a parallel
worker process (or a Citus worker process, if that is a thing?) that
raised the error?

--
Thomas Munro
http://www.enterprisedb.com

#9Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Sand Stone (#8)
Re: dsa_allocate() faliure

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com> wrote:

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

#10Sand Stone
sand.m.stone@gmail.com
In reply to: Thomas Munro (#9)
Re: dsa_allocate() faliure

Can you still see the problem with Citus 7.4?

Hi, Thomas. I actually went back to the cluster with Citus7.4 and
PG10.4. And modified the parallel param. So far, I haven't seen any
server crash.

The main difference between crashes observed and no crash, is the set
of Linux TCP time out parameters (to release the ports faster).
Unfortunately, I cannot "undo" the Linux params and run the stress
tests anymore, as this is a multi-million $ cluster and people are
doing more useful things on it. I will keep an eye on any parallel
execution issue.

On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Show quoted text

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com> wrote:

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

#11Sand Stone
sand.m.stone@gmail.com
In reply to: Sand Stone (#10)
1 attachment(s)
Re: dsa_allocate() faliure

I attached a query (and its query plan) that caused the crash:
"dsa_allocate could not find 13 free pages" on one of the worker nodes. I
anonymised the query text a bit. Interestingly, this time only one (same
one) of the nodes is crashing. Since this is a production environment, I
cannot get the stack trace. Once turned off parallel execution for this
node. The whole query finished just fine. So the parallel query plan is
from one of the nodes not crashed, hopefully the same plan would have been
executed on the crashed node. In theory, every worker node has the same
bits, and very similar data.

===
psql (10.4)
\dx
List of installed extensions
Name | Version | Schema | Description
----------------+---------+------------+-----------------------------------
citus | 7.4-3 | pg_catalog | Citus distributed database
hll | 2.10 | public | type for storing hyperloglog data
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language

On Sat, Aug 25, 2018 at 7:46 AM Sand Stone <sand.m.stone@gmail.com> wrote:

Show quoted text

Can you still see the problem with Citus 7.4?

Hi, Thomas. I actually went back to the cluster with Citus7.4 and
PG10.4. And modified the parallel param. So far, I haven't seen any
server crash.

The main difference between crashes observed and no crash, is the set
of Linux TCP time out parameters (to release the ports faster).
Unfortunately, I cannot "undo" the Linux params and run the stress
tests anymore, as this is a multi-million $ cluster and people are
doing more useful things on it. I will keep an eye on any parallel
execution issue.

On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@gmail.com>

wrote:

Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

parallelquery-crash-plan.txttext/plain; charset=US-ASCII; name=parallelquery-crash-plan.txtDownload
#12Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Sand Stone (#11)
Re: dsa_allocate() faliure

On Wed, Aug 29, 2018 at 5:48 PM Sand Stone <sand.m.stone@gmail.com> wrote:

I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

/messages/by-id/194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

--
Thomas Munro
http://www.enterprisedb.com

#13Arne Roland
A.Roland@index.de
In reply to: Thomas Munro (#12)
2 attachment(s)
RE: dsa_allocate() faliure

Hello,

I'm not sure whether this is connected at all, but I'm facing the same error with a generated query on postgres 10.6.
It works with parallel query disabled and gives "dsa_allocate could not find 7 free pages" otherwise.

I've attached query and strace. The table is partitioned on (o, date). It's not depended on the precise lists I'm using, while it obviously does depend on the fact that the optimizer chooses a parallel query.

Regards
Arne Roland

-----Original Message-----
From: Thomas Munro <thomas.munro@enterprisedb.com>
Sent: Friday, October 5, 2018 4:17 AM
To: Sand Stone <sand.m.stone@gmail.com>
Cc: Rick Otten <rottenwindfish@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-performance@lists.postgresql.org; Robert Haas <robertmhaas@gmail.com>
Subject: Re: dsa_allocate() faliure

On Wed, Aug 29, 2018 at 5:48 PM Sand Stone <sand.m.stone@gmail.com> wrote:

I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

/messages/by-id/194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

strace.logapplication/octet-stream; name=strace.logDownload
query.sqlapplication/octet-stream; name=query.sqlDownload
#14Arne Roland
A.Roland@index.de
In reply to: Arne Roland (#13)
RE: dsa_allocate() faliure

Hello,

does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?

All the best
Arne Roland

#15Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Arne Roland (#14)
Re: dsa_allocate() faliure

On Tue, Jan 29, 2019 at 2:50 AM Arne Roland <A.Roland@index.de> wrote:

does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?

Hi Arne,

This seems to be a bug; that error should not be reached. I wonder if
it is a different manifestation of the bug reported as #15585 (ie some
type of rare corruption). Are you able to reproduce this
consistently? Can you please show the query plan?

--
Thomas Munro
http://www.enterprisedb.com

#16Fabio Isabettini
fisabettini@voipfuture.com
In reply to: Thomas Munro (#15)
Re: dsa_allocate() faliure

Hello,
we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

The query reads remotely (via pl/proxy) tables containing a lot of data (up to millions or rows for each table/node) after a remote “group by" returns to the caller “master” node only a few hundreds of rows from each “slave” node.
The tables are partitioned using the INHERITANCE method that we are using since years with no issue. All tables have the same columns structure and number, about 300 columns. In the query there are no join, only a variable set of partitions depending on the date range.
The “REMOTE FATAL” refers to the pl/proxy that runs on 2 different slaves, [a0] and [a1], nodes with identical configuration and database structure, but it seems to fail only on node [a1].
When we get the error if we reduce the date range and therefore the quantity of data read, the error disappears, the same if we set max_parallel_workers_per_gather = 0.
Obviously we cannot force the user to use short periods of time to avoid the error and so we have disabled the parallel query feature for the time being.
It is difficult to reproduce the issue because not always the user gets the error, furthermore re-running the same query in different moments/days it usually works. It is a kind of weird.
We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Regards,
Fabio Isabettini
Voipfuture (Germany)

The failing node [a1] configuration:

OS: Centos 7 kernerl 3.10.0-862.11.6.el7.x86_64
Postgres: postgres-10.5-862.11.6.1
RAM: 256 GB (The main server containing the master node and [a0] node, the slave that has no issue, has 384 GB of RAM)
CPU cores: 32

shared_buffers = 64GB
max_worker_processes = 32
max_parallel_workers_per_gather = 8
max_parallel_workers = 32

Show quoted text

On 28. Jan 2019, at 19:56:01, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Jan 29, 2019 at 2:50 AM Arne Roland <A.Roland@index.de> wrote:

does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?

Hi Arne,

This seems to be a bug; that error should not be reached. I wonder if
it is a different manifestation of the bug reported as #15585 (ie some
type of rare corruption). Are you able to reproduce this
consistently? Can you please show the query plan?

--
Thomas Munro
http://www.enterprisedb.com

#17Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Fabio Isabettini (#16)
Re: dsa_allocate() faliure

On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
<fisabettini@voipfuture.com> wrote:

we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Hi Fabio,

Thanks for your report. Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here. It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will. I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once. But
we have about three reports of each of the two symptoms. One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks. Clearly there is issue needing
a fix here, but I don't yet know what it is.

--
Thomas Munro
http://www.enterprisedb.com

#18Fabio Isabettini
fisabettini@voipfuture.com
In reply to: Thomas Munro (#17)
Re: dsa_allocate() faliure

Hi Thomas,
it is a Production system and we don’t have permanent access to it.
Also to have an auto_explain feature always on, is not an option in production.
I will ask the customer to give us notice asap the error present itself to connect immediately and try to get a query plan.

Regards

Fabio Isabettini
www.voipfuture.com

Show quoted text

On 30. Jan 2019, at 04:13:14, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
<fisabettini@voipfuture.com> wrote:

we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Hi Fabio,

Thanks for your report. Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here. It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will. I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once. But
we have about three reports of each of the two symptoms. One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks. Clearly there is issue needing
a fix here, but I don't yet know what it is.

--
Thomas Munro
http://www.enterprisedb.com

#19Arne Roland
A.Roland@index.de
In reply to: Thomas Munro (#15)
2 attachment(s)
RE: dsa_allocate() faliure

Hi Thomas,

this is reproducible, while it's highly sensitive to the change of plans (i.e. the precise querys that do break change with every new analyze). Disabling parallel query seems to solve the problem (as expected).
At some point even the simple query
select count(*) from test_tab where (o = '0' and date >= '30.01.2019'::date-interval '14 days' or o = '1' and date >= '30.01.2019'::date-interval '14 days') and coalesce(fid,fallback) >=6 and coalesce(fid,fallback) <=6
was reported to fail (with the same error) at the live database, but I wasn't able to obtain a plan, since it works again with the current live data (maybe autoanalyze is at fault here).
The table test_tab has roughly 70 children that inherit from it. The children and the corresponding indexes should be named like '%part%'.

I attached a query with a plan that fails on my test database.

I don't want to rule out the possibility that it could be related to #15585; at least both issues seem to be related to Parallel Bitmap and inheritance/partitioned tables, but the error occurs relatively quickly in my case and every one of my processes (the children and the master) are failing with 'FATAL: dsa_allocate could not find 7 free pages'.

Regards
Arne

Attachments:

explain1s.txttext/plain; name=explain1s.txtDownload
query1s.sqlapplication/octet-stream; name=query1s.sqlDownload
#20Justin Pryzby
pryzby@telsasoft.com
In reply to: Arne Roland (#19)
Re: dsa_allocate() faliure

On Thu, Jan 31, 2019 at 06:19:54PM +0000, Arne Roland wrote:

this is reproducible, while it's highly sensitive to the change of plans (i.e. the precise querys that do break change with every new analyze). Disabling parallel query seems to solve the problem (as expected).
At some point even the simple query
select count(*) from test_tab where (o = '0' and date >= '30.01.2019'::date-interval '14 days' or o = '1' and date >= '30.01.2019'::date-interval '14 days') and coalesce(fid,fallback) >=6 and coalesce(fid,fallback) <=6
was reported to fail (with the same error) at the live database, but I wasn't able to obtain a plan, since it works again with the current live data (maybe autoanalyze is at fault here).
The table test_tab has roughly 70 children that inherit from it. The children and the corresponding indexes should be named like '%part%'.

I attached a query with a plan that fails on my test database.

Thanks - note that previously Thomas said:

On Mon, Dec 03, 2018 at 11:45:00AM +1300, Thomas Munro wrote:

On Sat, Dec 1, 2018 at 9:46 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);
+ abort()

If anyone can reproduce this problem with a debugger, it'd be
interesting to see the output of dsa_dump(area), and
FreePageManagerDump(segment_map->fpm). This error condition means

Are you able to cause the error in a test/devel/non-production environment to
run under a debugger, or could you compile with "abort();" after that elog() to
save a corefile ?

Justin

#21Jakub Glapa
jakub.glapa@gmail.com
In reply to: Thomas Munro (#17)
Re: dsa_allocate() faliure

Hi Thomas,
I was one of the reporter in the early Dec last year.
I somehow dropped the ball and forgot about the issue.
Anyhow I upgraded the clusters to pg11.1 and nothing changed. I also have a
rule to coredump but a segfault does not happen while this is occurring.
I see the error showing up every night on 2 different servers. But it's a
bit of a heisenbug because If I go there now it won't be reproducible.
It was suggested by Justin Pryzby that I recompile pg src with his patch
that would cause a coredump.
But I don't feel comfortable doing this especially if I would have to run
this with prod data.
My question is. Can I do anything like increasing logging level or enable
some additional options?
It's a production server but I'm willing to sacrifice a bit of it's
performance if that would help.

--
regards,
pozdrawiam,
Jakub Glapa

On Wed, Jan 30, 2019 at 4:13 AM Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

Show quoted text

On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
<fisabettini@voipfuture.com> wrote:

we are facing a similar issue on a Production system using a Postgresql

10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ;

ID: EXCEPTION on getstatistics_media ; ID: uidatareader.

run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7

free pages

We would like not to stop the Production system and upgrade it to PG11.

And even though would this guarantee a permanent fix?

Any suggestion?

Hi Fabio,

Thanks for your report. Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here. It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will. I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once. But
we have about three reports of each of the two symptoms. One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks. Clearly there is issue needing
a fix here, but I don't yet know what it is.

--
Thomas Munro
http://www.enterprisedb.com

#22Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Jakub Glapa (#21)
Re: dsa_allocate() faliure

On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa <jakub.glapa@gmail.com> wrote:

I see the error showing up every night on 2 different servers. But it's a bit of a heisenbug because If I go there now it won't be reproducible.

Huh. Ok well that's a lot more frequent that I thought. Is it always
the same query? Any chance you can get the plan? Are there more
things going on on the server, like perhaps concurrent parallel
queries?

It was suggested by Justin Pryzby that I recompile pg src with his patch that would cause a coredump.

Small correction to Justin's suggestion: don't abort() after
elog(ERROR, ...), it'll never be reached.

But I don't feel comfortable doing this especially if I would have to run this with prod data.
My question is. Can I do anything like increasing logging level or enable some additional options?
It's a production server but I'm willing to sacrifice a bit of it's performance if that would help.

If you're able to run a throwaway copy of your production database on
another system that you don't have to worry about crashing, you could
just replace ERROR with PANIC and run a high-speed loop of the query
that crashed in product, or something. This might at least tell us
whether it's reach that condition via something dereferencing a
dsa_pointer or something manipulating the segment lists while
allocating/freeing.

In my own 100% unsuccessful attempts to reproduce this I was mostly
running the same query (based on my guess at what ingredients are
needed), but perhaps it requires a particular allocation pattern that
will require more randomness to reach... hmm.

--
Thomas Munro
http://www.enterprisedb.com

#23Arne Roland
A.Roland@index.de
In reply to: Thomas Munro (#22)
RE: dsa_allocate() faliure

It's definitely a quite a relatively complex pattern. The query I set you last time was minimal with respect to predicates (so removing any single one of the predicates converted that one into a working query).

Huh. Ok well that's a lot more frequent that I thought. Is it always the same query? Any chance you can get the plan? Are there more things going on on the server, like perhaps concurrent parallel queries?

I had this bug occurring while I was the only one working on the server. I checked there was just one transaction with a snapshot at all and it was a autovacuum busy with a totally unrelated relation my colleague was working on.

The bug is indeed behaving like a ghost.
One child relation needed a few new rows to test a particular application a colleague of mine was working on. The insert triggered an autoanalyze and the explain changed slightly:
Besides row and cost estimates the change is that the line
Recheck Cond: (((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, fallback) ) <= 1) AND (gid && '{853078,853080,853082}'::integer[]))
is now
Recheck Cond: ((gid && '{853078,853080,853082}'::integer[]) AND ((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, fallback) ) <= 1))
and the error vanished.

I could try to hunt down another query by assembling seemingly random queries. I don't see a very clear pattern from the queries aborting with this error on our production servers. I'm not surprised that bug is had to chase on production servers. They usually are quite lively.

If you're able to run a throwaway copy of your production database on another system that you don't have to worry about crashing, you could just replace ERROR with PANIC and run a high-speed loop of the query that crashed in product, or something. This might at least tell us whether it's reach that condition via something dereferencing a dsa_pointer or something manipulating the segment lists while allocating/freeing.

I could take a backup and restore the relevant tables on a throwaway system. You are just suggesting to replace line 728
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);
by
elog(PANIC,
"dsa_allocate could not find %zu free pages", npages);
correct? Just for my understanding: why would the shutdown of the whole instance create more helpful logging?

All the best
Arne

#24Justin Pryzby
pryzby@telsasoft.com
In reply to: Arne Roland (#23)
Re: dsa_allocate() faliure

On Mon, Feb 04, 2019 at 08:31:47PM +0000, Arne Roland wrote:

I could take a backup and restore the relevant tables on a throwaway system. You are just suggesting to replace line 728
elog(FATAL,
"dsa_allocate could not find %zu free pages", npages);
by
elog(PANIC,
"dsa_allocate could not find %zu free pages", npages);
correct? Just for my understanding: why would the shutdown of the whole instance create more helpful logging?

You'd also start with pg_ctl -c, which would allow it to dump core, which could
be inspected with GDB to show a backtrace and other internals, which up to now
nobody (including myself) has been able to provide.

Justin

#25Justin Pryzby
pryzby@telsasoft.com
In reply to: Jakub Glapa (#21)
Re: dsa_allocate() faliure

Moving to -hackers, hopefully it doesn't confuse the list scripts too much.

On Mon, Feb 04, 2019 at 08:52:17AM +0100, Jakub Glapa wrote:

I see the error showing up every night on 2 different servers. But it's a
bit of a heisenbug because If I go there now it won't be reproducible.

Do you have query logging enabled ? If not, could you consider it on at least
one of those servers ? I'm interested to know what ELSE is running at the time
that query failed.

Perhaps you could enable query logging JUST for the interval of time that the
server usually errors ? The CSV logs can be imported to postgres for analysis.
You might do something like SELECT left(message,99),COUNT(1),max(session_id) FROM postgres_log WHERE log_time BETWEEN .. AND .. GROUP BY 1 ORDER BY 2;
And just maybe there'd be a query there that only runs once per day which would
allow reproducing the error at will. Or utility command like vacuum..

I think ideally you'd set:

log_statement = all
log_min_messages = info
log_destination = 'stderr,csvlog'
# stderr isn't important for this purpose, but I keep it set to capture crash messages, too

You should set these to something that works well at your site:

log_rotation_age = '2min'
log_rotation_size = '32MB'

I would normally set these, and I don't see any reason why you wouldn't set
them too:

log_checkpoints = on
log_lock_waits = on
log_temp_files = on
log_min_error_statement = notice
log_temp_files = 0
log_min_duration_statement = '9sec'
log_autovacuum_min_duration = '999sec'

And I would set these too but maybe you'd prefer to do something else:

log_directory = /var/log/postgresql
log_file_mode = 0640
log_filename = postgresql-%Y-%m-%d_%H%M%S.log

Justin

#26Jakub Glapa
jakub.glapa@gmail.com
In reply to: Justin Pryzby (#25)
Re: dsa_allocate() faliure

Do you have query logging enabled ? If not, could you consider it on at

least
one of those servers ? I'm interested to know what ELSE is running at the
time
that query failed.

Ok, I have configured that and will enable in the time window when the
errors usually occur. I'll report as soon as I have something.

--
regards,
pozdrawiam,
Jakub Glapa

On Thu, Feb 7, 2019 at 12:21 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Show quoted text

Moving to -hackers, hopefully it doesn't confuse the list scripts too much.

On Mon, Feb 04, 2019 at 08:52:17AM +0100, Jakub Glapa wrote:

I see the error showing up every night on 2 different servers. But it's a
bit of a heisenbug because If I go there now it won't be reproducible.

Do you have query logging enabled ? If not, could you consider it on at
least
one of those servers ? I'm interested to know what ELSE is running at the
time
that query failed.

Perhaps you could enable query logging JUST for the interval of time that
the
server usually errors ? The CSV logs can be imported to postgres for
analysis.
You might do something like SELECT
left(message,99),COUNT(1),max(session_id) FROM postgres_log WHERE log_time
BETWEEN .. AND .. GROUP BY 1 ORDER BY 2;
And just maybe there'd be a query there that only runs once per day which
would
allow reproducing the error at will. Or utility command like vacuum..

I think ideally you'd set:

log_statement = all
log_min_messages = info
log_destination = 'stderr,csvlog'
# stderr isn't important for this purpose, but I keep it set to capture
crash messages, too

You should set these to something that works well at your site:

log_rotation_age = '2min'
log_rotation_size = '32MB'

I would normally set these, and I don't see any reason why you wouldn't set
them too:

log_checkpoints = on
log_lock_waits = on
log_temp_files = on
log_min_error_statement = notice
log_temp_files = 0
log_min_duration_statement = '9sec'
log_autovacuum_min_duration = '999sec'

And I would set these too but maybe you'd prefer to do something else:

log_directory = /var/log/postgresql
log_file_mode = 0640
log_filename = postgresql-%Y-%m-%d_%H%M%S.log

Justin

#27Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Jakub Glapa (#26)
Re: dsa_allocate() faliure

On Thu, Feb 7, 2019 at 9:10 PM Jakub Glapa <jakub.glapa@gmail.com> wrote:

Do you have query logging enabled ? If not, could you consider it on at least

one of those servers ? I'm interested to know what ELSE is running at the time
that query failed.

Ok, I have configured that and will enable in the time window when the errors usually occur. I'll report as soon as I have something.

I don't have the answer yet but I have some progress: I finally
reproduced the "could not find %d free pages" error by running lots of
concurrent parallel queries. Will investigate.

Set up:

create table foo (p int, a int, b int) partition by list (p);
create table foo_1 partition of foo for values in (1);
create table foo_2 partition of foo for values in (2);
create table foo_3 partition of foo for values in (3);
alter table foo_1 set (parallel_workers = 4);
alter table foo_2 set (parallel_workers = 4);
alter table foo_3 set (parallel_workers = 4);
insert into foo
select generate_series(1, 10000000)::int % 3 + 1,
generate_series(1, 10000000)::int % 50,
generate_series(1, 10000000)::int % 50;
create index on foo_1(a);
create index on foo_2(a);
create index on foo_3(a);
create index on foo_1(b);
create index on foo_2(b);
create index on foo_3(b);
analyze;

Then I ran three copies of :

#!/bin/sh
(
echo "set max_parallel_workers_per_gather = 4;"
for I in `seq 1 100000`; do
echo "explain analyze select count(*) from foo where a between 5
and 6 or b between 5 and 6;"
done
) | psql postgres

--
Thomas Munro
http://www.enterprisedb.com

#28Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#27)
Re: dsa_allocate() faliure

On Fri, Feb 8, 2019 at 4:49 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I don't have the answer yet but I have some progress: I finally
reproduced the "could not find %d free pages" error by running lots of
concurrent parallel queries. Will investigate.

Sometimes FreeManagerPutInternal() returns a
number-of-contiguous-pages-created-by-this-insertion that is too large
by one. If this happens to be a new max-number-of-contiguous-pages,
it causes trouble some arbitrary time later because the max is wrong
and this FPM cannot satisfy a request that large, and it may not be
recomputed for some time because the incorrect value prevents
recomputation. Not sure yet if this is due to the lazy computation
logic or a plain old fence-post error in the btree consolidation code
or something else.

--
Thomas Munro
http://www.enterprisedb.com

#29Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#28)
Re: dsa_allocate() faliure

On Fri, Feb 8, 2019 at 8:00 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Sometimes FreeManagerPutInternal() returns a
number-of-contiguous-pages-created-by-this-insertion that is too large
by one. If this happens to be a new max-number-of-contiguous-pages,
it causes trouble some arbitrary time later because the max is wrong
and this FPM cannot satisfy a request that large, and it may not be
recomputed for some time because the incorrect value prevents
recomputation. Not sure yet if this is due to the lazy computation
logic or a plain old fence-post error in the btree consolidation code
or something else.

I spent a long time thinking about this and starting at code this
afternoon, but I didn't really come up with much of anything useful.
It seems like a strange failure mode, because
FreePageManagerPutInternal() normally just returns its third argument
unmodified. The only cases where anything else happens are the ones
where we're able to consolidate the returned span with a preceding or
following span, and I'm scratching my head as to how that logic could
be wrong, especially since it also has some Assert() statements that
seem like they would detect the kinds of inconsistencies that would
lead to trouble. For example, if we somehow ended up with two spans
that (improperly) overlapped, we'd trip an Assert(). And if that
didn't happen -- because we're not in an Assert-enabled build -- the
code is written so that it only relies on the npages value of the last
of the consolidated scans, so an error in the npages value of one of
the earlier spans would just get fixed up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#29)
Re: dsa_allocate() faliure

On Sat, Feb 9, 2019 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 8, 2019 at 8:00 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Sometimes FreeManagerPutInternal() returns a
number-of-contiguous-pages-created-by-this-insertion that is too large
by one. [...]

I spent a long time thinking about this and starting at code this
afternoon, but I didn't really come up with much of anything useful.
It seems like a strange failure mode, because
FreePageManagerPutInternal() normally just returns its third argument
unmodified. [...]

Bleugh. Yeah. What I said before wasn't quite right. The value
returned by FreePageManagerPutInternal() is actually correct at the
moment it is returned, but it ceases to be correct immediately
afterwards if the following call to FreePageBtreeCleanup() happens to
reduce the size of that particular span. The problem is that we
clobber fpm->contiguous_pages with the earlier (and by now incorrect)
value that we were holding in a local variable.

--
Thomas Munro
http://www.enterprisedb.com

#31Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#30)
Re: dsa_allocate() faliure

On Sun, Feb 10, 2019 at 7:24 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Feb 9, 2019 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 8, 2019 at 8:00 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Sometimes FreeManagerPutInternal() returns a
number-of-contiguous-pages-created-by-this-insertion that is too large
by one. [...]

I spent a long time thinking about this and starting at code this
afternoon, but I didn't really come up with much of anything useful.
It seems like a strange failure mode, because
FreePageManagerPutInternal() normally just returns its third argument
unmodified. [...]

Bleugh. Yeah. What I said before wasn't quite right. The value
returned by FreePageManagerPutInternal() is actually correct at the
moment it is returned, but it ceases to be correct immediately
afterwards if the following call to FreePageBtreeCleanup() happens to
reduce the size of that particular span.

... but why would it do that? I can reproduce cases where (for
example) FreePageManagerPutInternal() returns 179, and then
FreePageManagerLargestContiguous() returns 179, but then after
FreePageBtreeCleanup() it returns 178. At that point FreePageDump()
says:

btree depth 1:
77@0 l: 27(1) 78(178)
freelists:
1: 27
129: 78(178)

But at first glance it shouldn't be allocating pages, because it just
does consolidation to try to convert to singleton format, and then it
does recycle list cleanup using soft=true so that no allocation of
btree pages should occur.

--
Thomas Munro
http://www.enterprisedb.com

#32Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#30)
Re: dsa_allocate() faliure

On Sun, Feb 10, 2019 at 1:55 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Bleugh. Yeah. What I said before wasn't quite right. The value
returned by FreePageManagerPutInternal() is actually correct at the
moment it is returned, but it ceases to be correct immediately
afterwards if the following call to FreePageBtreeCleanup() happens to
reduce the size of that particular span. The problem is that we
clobber fpm->contiguous_pages with the earlier (and by now incorrect)
value that we were holding in a local variable.

Yeah, I had similar bugs to that during the initial development work I
did on freepage.c, and that's why I got rid of some lazy recomputation
thing that I had tried at some point. The version that got committed
brought that back again, but possibly it's got the same kind of
problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#31)
1 attachment(s)
Re: dsa_allocate() faliure

On Sun, Feb 10, 2019 at 2:37 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

... but why would it do that? I can reproduce cases where (for
example) FreePageManagerPutInternal() returns 179, and then
FreePageManagerLargestContiguous() returns 179, but then after
FreePageBtreeCleanup() it returns 178. At that point FreePageDump()
says:

btree depth 1:
77@0 l: 27(1) 78(178)
freelists:
1: 27
129: 78(178)

But at first glance it shouldn't be allocating pages, because it just
does consolidation to try to convert to singleton format, and then it
does recycle list cleanup using soft=true so that no allocation of
btree pages should occur.

I think I see what's happening. At the moment the problem occurs,
there is no btree - there is only a singleton range. So
FreePageManagerInternal() takes the fpm->btree_depth == 0 branch and
then ends up in the section with the comment /* Not contiguous; we
need to initialize the btree. */. And that section, sadly, does not
respect the 'soft' flag, so kaboom. Something like the attached might
fix it.

Boy, I love FreePageManagerDump!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

dsa-allocate-btree-init-soft.patchapplication/octet-stream; name=dsa-allocate-btree-init-soft.patchDownload
diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
index d110030b63..18c3c188e8 100644
--- a/src/backend/utils/mmgr/freepage.c
+++ b/src/backend/utils/mmgr/freepage.c
@@ -1526,6 +1526,8 @@ FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
 
 			if (!relptr_is_null(fpm->btree_recycle))
 				root = FreePageBtreeGetRecycled(fpm);
+			else if (soft)
+				return 0;		/* Should not allocate if soft. */
 			else if (FreePageManagerGetInternal(fpm, 1, &root_page))
 				root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
 			else
#34Justin Pryzby
pryzby@telsasoft.com
In reply to: Robert Haas (#33)
Re: dsa_allocate() faliure

On Sun, Feb 10, 2019 at 12:10:52PM +0530, Robert Haas wrote:

I think I see what's happening. At the moment the problem occurs,
there is no btree - there is only a singleton range. So
FreePageManagerInternal() takes the fpm->btree_depth == 0 branch and
then ends up in the section with the comment /* Not contiguous; we
need to initialize the btree. */. And that section, sadly, does not
respect the 'soft' flag, so kaboom. Something like the attached might
fix it.

I ran overnight with this patch, but all parallel processes ended up stuck in
the style of bug#15585. So that's either not the root cause, or there's a 2nd
issue.

/messages/by-id/15585-324ff6a93a18da46@postgresql.org

Justin

In reply to: Justin Pryzby (#34)
Re: dsa_allocate() faliure

Hi

I ran overnight with this patch, but all parallel processes ended up stuck in
the style of bug#15585. So that's either not the root cause, or there's a 2nd
issue.

Maybe i missed something in this discussion, but you can reproduce bug#15585? How? With this testcase: /messages/by-id/CAEepm=1MvOE-Sfv1chudx5KEmw_qHYrj8F9Og_WmdBRhXSQ++w@mail.gmail.com ?

regards, Sergei

#36Justin Pryzby
pryzby@telsasoft.com
In reply to: Sergei Kornilov (#35)
Re: dsa_allocate() faliure

On Sun, Feb 10, 2019 at 07:11:22PM +0300, Sergei Kornilov wrote:

I ran overnight with this patch, but all parallel processes ended up stuck in
the style of bug#15585. So that's either not the root cause, or there's a 2nd
issue.

Maybe i missed something in this discussion, but you can reproduce bug#15585? How? With this testcase: /messages/by-id/CAEepm=1MvOE-Sfv1chudx5KEmw_qHYrj8F9Og_WmdBRhXSQ++w@mail.gmail.com ?

By running the queued_alters query multiple times in a loop:
/messages/by-id/20181231221734.GB25379@telsasoft.com

I'm able to trigger dsa "ERROR"s with that query on a newly initdb cluster with
only that table. But I think some servers are more likely to hit it than
others.

I've only tripped on 15585 twice, and only while trying to trigger other DSA
bugs (the working hypothesis is that bug is 2ndary issue which happens after
hitting some other bug). And not consistently or quickly.

Justin

#37Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#33)
1 attachment(s)
Re: dsa_allocate() faliure

On Sun, Feb 10, 2019 at 5:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Feb 10, 2019 at 2:37 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

But at first glance it shouldn't be allocating pages, because it just
does consolidation to try to convert to singleton format, and then it
does recycle list cleanup using soft=true so that no allocation of
btree pages should occur.

I think I see what's happening. At the moment the problem occurs,
there is no btree - there is only a singleton range. So
FreePageManagerInternal() takes the fpm->btree_depth == 0 branch and
then ends up in the section with the comment /* Not contiguous; we
need to initialize the btree. */. And that section, sadly, does not
respect the 'soft' flag, so kaboom. Something like the attached might
fix it.

Ouch. Yeah, that'd do it and matches the evidence. With this change,
I couldn't reproduce the problem after 90 minutes with a test case
that otherwise hits it within a couple of minutes.

Here's a patch with a commit message explaining the change.

It also removes an obsolete comment, which is in fact related. The
comment refers to an output parameter internal_pages_used, which must
have been used to report this exact phenomenon in an earlier
development version. But there is no such parameter in the committed
version, and instead there is the soft flag to prevent internal
allocation. I have no view on which approach is best, but yeah, if
we're using a soft flag, it has to work reliably.

This brings us to a difficult choice: we're about to cut a new
release, and this could in theory be included. Even though the fix is
quite convincing, it doesn't seem wise to change such complicated code
at the last minute, and I know from an off-list chat that that is also
Robert's view. So I'll wait until after the release, and we'll have
to live with the bug for another 3 months.

Note that this patch addresses the error "dsa_allocate could not find
%zu free pages". (The error "dsa_area could not attach to segment" is
something else and apparently rarer.)

Boy, I love FreePageManagerDump!

Yeah. And I love reproducible bugs.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Fix-freepage.c-bug-that-causes-rare-dsa_allocate-fai.patchapplication/octet-stream; name=0001-Fix-freepage.c-bug-that-causes-rare-dsa_allocate-fai.patchDownload
From 744bbad5c865d4accfca9003a2ff1ec0172312d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 11 Feb 2019 10:11:39 +1300
Subject: [PATCH] Fix freepage.c bug that causes rare dsa_allocate() failures.

In a corner case, a btree page was allocated during a clean-up operation
that could cause the book-keeping of the largest contiguous span of free
space to get out of whack.  That was supposed to be prevented by the use
of the "soft" flag to avoid allocating internal pages during incidental
clean-up work, but the flag was ignored in the case where the FPM was
promoted from singleton format to btree format.  Repair.

Remove an obsolete comment in passing.

Author: Robert Haas
Diagnosed-by: Robert Haas and Thomas Munro
Reported-by: Justin Pryzby, Rick Otten, Sand Stone, Arne Roland and others
Discussion: https://postgr.es/m/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com
---
 src/backend/utils/mmgr/freepage.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/mmgr/freepage.c b/src/backend/utils/mmgr/freepage.c
index d110030b63..9257235001 100644
--- a/src/backend/utils/mmgr/freepage.c
+++ b/src/backend/utils/mmgr/freepage.c
@@ -1470,9 +1470,7 @@ FreePageManagerGetInternal(FreePageManager *fpm, Size npages, Size *first_page)
  * pages; if false, do it always.  Returns 0 if the soft flag caused the
  * insertion to be skipped, or otherwise the size of the contiguous span
  * created by the insertion.  This may be larger than npages if we're able
- * to consolidate with an adjacent range.  *internal_pages_used is set to
- * true if the btree allocated pages for internal purposes, which might
- * invalidate the current largest run requiring it to be recomputed.
+ * to consolidate with an adjacent range.
  */
 static Size
 FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
@@ -1526,6 +1524,8 @@ FreePageManagerPutInternal(FreePageManager *fpm, Size first_page, Size npages,
 
 			if (!relptr_is_null(fpm->btree_recycle))
 				root = FreePageBtreeGetRecycled(fpm);
+			else if (soft)
+				return 0;		/* Should not allocate if soft. */
 			else if (FreePageManagerGetInternal(fpm, 1, &root_page))
 				root = (FreePageBtree *) fpm_page_to_pointer(base, root_page);
 			else
-- 
2.20.1

#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#37)
Re: dsa_allocate() faliure

Thomas Munro <thomas.munro@enterprisedb.com> writes:

This brings us to a difficult choice: we're about to cut a new
release, and this could in theory be included. Even though the fix is
quite convincing, it doesn't seem wise to change such complicated code
at the last minute, and I know from an off-list chat that that is also
Robert's view.

Yeah ... at this point we're just too close to the release deadline,
I'm afraid, even though the fix *looks* pretty safe. Not worth the risk
given that this seems to be a low-probability bug.

I observe from

https://coverage.postgresql.org/src/backend/utils/mmgr/freepage.c.gcov.html

that the edge cases in this function aren't too well exercised by
our regression tests, meaning that the buildfarm might not prove
much either way about the correctness of this patch. That is one
factor pushing me to think we shouldn't risk it. But, taking a
longer view, is that something that's practical to improve?

So I'll wait until after the release, and we'll have
to live with the bug for another 3 months.

Check. Please hold off committing until you see the release tags
appear, probably late Tuesday my time / Wednesday noonish yours.

regards, tom lane

#39Justin Pryzby
pryzby@telsasoft.com
In reply to: Thomas Munro (#37)
Re: dsa_allocate() faliure

On Mon, Feb 11, 2019 at 09:45:07AM +1100, Thomas Munro wrote:

Ouch. Yeah, that'd do it and matches the evidence. With this change,
I couldn't reproduce the problem after 90 minutes with a test case
that otherwise hits it within a couple of minutes.

...

Note that this patch addresses the error "dsa_allocate could not find
%zu free pages". (The error "dsa_area could not attach to segment" is
something else and apparently rarer.)

"could not attach" is the error reported early this morning while
stress-testing this patch with queued_alters queries in loops, so that's
consistent with your understanding. And I guess it preceded getting stuck on
lock; although I don't how long between the first happened and the second, I'm
guess not long and perhaps immedidately; since the rest of the processes were
all stuck as in bug#15585 rather than ERRORing once every few minutes.

I mentioned that "could not attach to segment" occurs in leader either/or
parallel worker. And most of the time causes an ERROR only, and doesn't wedge
all future parallel workers. Maybe bug#15585 "wedged" state maybe only occurs
after some pattern of leader+worker failures (?) I've just triggered bug#15585
again, but if there's a pattern, I don't see it.

Please let me know whether you're able to reproduce the "not attach" bug using
simultaneous loops around the queued_alters query; it's easy here.

Justin

#40Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Justin Pryzby (#39)
Re: dsa_allocate() faliure

On Mon, Feb 11, 2019 at 11:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Mon, Feb 11, 2019 at 09:45:07AM +1100, Thomas Munro wrote:

Ouch. Yeah, that'd do it and matches the evidence. With this change,
I couldn't reproduce the problem after 90 minutes with a test case
that otherwise hits it within a couple of minutes.

...

Note that this patch addresses the error "dsa_allocate could not find
%zu free pages". (The error "dsa_area could not attach to segment" is
something else and apparently rarer.)

"could not attach" is the error reported early this morning while
stress-testing this patch with queued_alters queries in loops, so that's
consistent with your understanding. And I guess it preceded getting stuck on
lock; although I don't how long between the first happened and the second, I'm
guess not long and perhaps immedidately; since the rest of the processes were
all stuck as in bug#15585 rather than ERRORing once every few minutes.

I mentioned that "could not attach to segment" occurs in leader either/or
parallel worker. And most of the time causes an ERROR only, and doesn't wedge
all future parallel workers. Maybe bug#15585 "wedged" state maybe only occurs
after some pattern of leader+worker failures (?) I've just triggered bug#15585
again, but if there's a pattern, I don't see it.

Please let me know whether you're able to reproduce the "not attach" bug using
simultaneous loops around the queued_alters query; it's easy here.

I haven't ever managed to reproduce that one yet. It's great you have
a reliable repro... Let's discuss it on the #15585 thread.

--
Thomas Munro
http://www.enterprisedb.com

#41Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#38)
Re: dsa_allocate() faliure

On Mon, Feb 11, 2019 at 10:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I observe from

https://coverage.postgresql.org/src/backend/utils/mmgr/freepage.c.gcov.html

that the edge cases in this function aren't too well exercised by
our regression tests, meaning that the buildfarm might not prove
much either way about the correctness of this patch. That is one
factor pushing me to think we shouldn't risk it. But, taking a
longer view, is that something that's practical to improve?

Yeah. This is a nice example of code that really deserves unit tests
written in C. Could be good motivation to built the infrastructure I
mentioned here:

/messages/by-id/CAEepm=2heu+5zwB65jWap3XY-UP6PpJZiKLQRSV2UQH9BmVRXQ@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#41)
Re: dsa_allocate() faliure

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Mon, Feb 11, 2019 at 10:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I observe from
https://coverage.postgresql.org/src/backend/utils/mmgr/freepage.c.gcov.html
that the edge cases in this function aren't too well exercised by
our regression tests, meaning that the buildfarm might not prove
much either way about the correctness of this patch. That is one
factor pushing me to think we shouldn't risk it. But, taking a
longer view, is that something that's practical to improve?

Yeah. This is a nice example of code that really deserves unit tests
written in C. Could be good motivation to built the infrastructure I
mentioned here:
/messages/by-id/CAEepm=2heu+5zwB65jWap3XY-UP6PpJZiKLQRSV2UQH9BmVRXQ@mail.gmail.com

Meh. I think if you hold out for that, you're going to be waiting a
long time. I was thinking more along the lines of making a test API
in src/test/modules/, akin to what we've got for predtest or rbtree.

regards, tom lane

#43Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#38)
Re: dsa_allocate() faliure

On Mon, Feb 11, 2019 at 10:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

So I'll wait until after the release, and we'll have
to live with the bug for another 3 months.

Check. Please hold off committing until you see the release tags
appear, probably late Tuesday my time / Wednesday noonish yours.

Pushed.

--
Thomas Munro
http://www.enterprisedb.com