BUG in 10.1 - dsa_area could not attach to a segment that has been freed

Started by Alexander Voytsekhovskyyover 8 years ago19 messagesbugs

Jump to latest

Alexander Voytsekhovskyy

young.inbox@gmail.com

over 8 years ago

Hi,

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

I am running
PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit

I am getting this error when

1) i set set enable_nestloop to off

2) run query

SELECT
delivery_data.id_product AS id_product,
client_tt.id_client_tt
FROM delivery_data
JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt =
delivery_data.id_client_tt)
JOIN client_tt ON (client_tt_tmp.id_unique = client_tt.id_client_tt)
WHERE delivery_data.delivery_date BETWEEN '2017-09-01' AND '2017-11-30'
AND client_tt.id_client_tt IN (167427, 167429, 167430, 167433,
167434, 167435, 167436, 167437, 167438, 167485, 167520, 167521, 167527,
167528, 167529, 167530)
AND delivery_data.id_ta IN (3911)
GROUP BY client_tt.id_client_tt, delivery_data.id_product
HAVING SUM(delivery_data.amount) > 0
ORDER BY client_tt.id_client_tt

Here is explain for this query:
https://explain.depesz.com/s/slpM

If i change any part of this query, like remove filter
client_tt.id_client_tt or try it on other data set - it works

For me it's looks like a bug (

Michael Paquier

michael@paquier.xyz

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#1)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Wed, Nov 29, 2017 at 2:17 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

You are likely seeing the same problem as discussed here:
/messages/by-id/15e84857859.d4f2ff2b8299.3910523281971861294@zohocorp.com
--
Michael

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Michael Paquier (#2)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Wed, Nov 29, 2017 at 12:18 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 29, 2017 at 2:17 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

You are likely seeing the same problem as discussed here:
/messages/by-id/15e84857859.d4f2ff2b8299.3910523281971861294@zohocorp.com

Hmm. Well, it's the same error message but I don't think it's the
same: that other thread was about C extension code using APIs
incorrectly. This is parallel query, which shouldn't reach that error
unless all the backends detach before one backend tries to attach,
which I wouldn't normally expect to happen unless something else went
wrong first. Then Alexander would presumably have also seen another
error message too but he didn't say that. So I'm not yet sure how
this condition is being reached.

--
Thomas Munro
http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Thomas Munro (#3)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Tue, Nov 28, 2017 at 7:04 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Hmm. Well, it's the same error message but I don't think it's the
same: that other thread was about C extension code using APIs
incorrectly. This is parallel query, which shouldn't reach that error
unless all the backends detach before one backend tries to attach,
which I wouldn't normally expect to happen unless something else went
wrong first.

Why not? Can't it just be that the workers are slow getting started?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Robert Haas (#4)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Wed, Nov 29, 2017 at 1:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Why not? Can't it just be that the workers are slow getting started?

In the normal non-error control flow, don't we expect
ExecShutdownGather() to run ExecParallelFinish() before
ExecParallelCleanup(), meaning that the leader waits for workers to
finish completely before it detaches itself? Doesn't that need to be
case to avoid random "unable to map dynamic shared memory segment" and
"dsa_area could not attach to a segment that has been freed" errors,
and for the parallel instrumentation shown in EXPLAIN to be reliable?

Could it be that the leader thought that a worker didn't start up, but
in fact it did?

--
Thomas Munro
http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#1)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Hi,

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

Can you try to get call stack so that we can get a better picture of
what is happening here? One way could be you can change this ERROR to
PANIC which will generate core dump and you can get stack trace.

Here is explain for this query:
https://explain.depesz.com/s/slpM

If i change any part of this query, like remove filter
client_tt.id_client_tt or try it on other data set - it works

Can you share the plan after you change the query?

For me it's looks like a bug (

Yeah.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Alexander Voytsekhovskyy

young.inbox@gmail.com

over 8 years ago

In reply to: Amit Kapila (#6)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Hi,

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

Can you try to get call stack so that we can get a better picture of
what is happening here? One way could be you can change this ERROR to
PANIC which will generate core dump and you can get stack trace.

Yes, i can do both - i can reproduce it on test environment also

is there somewhere manual for that?

Here is explain for this query:
https://explain.depesz.com/s/slpM

If i change any part of this query, like remove filter
client_tt.id_client_tt or try it on other data set - it works

Can you share the plan after you change the query?

Yes, here is query without one of conditions in WHERE section

SELECT
delivery_data.id_product AS id_product,
client_tt.id_client_tt
FROM delivery_data
JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt =
delivery_data.id_client_tt)
JOIN client_tt ON (client_tt_tmp.id_unique = client_tt.id_client_tt)
WHERE delivery_data.delivery_date BETWEEN '2017-09-01' AND '2017-11-30'
AND delivery_data.id_ta IN (3911)
GROUP BY client_tt.id_client_tt, delivery_data.id_product
HAVING SUM(delivery_data.amount) > 0
ORDER BY client_tt.id_client_tt

And explain:

https://explain.depesz.com/s/9aBI

Show quoted text

For me it's looks like a bug (

Yeah.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#7)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Hi,

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

Can you try to get call stack so that we can get a better picture of
what is happening here? One way could be you can change this ERROR to
PANIC which will generate core dump and you can get stack trace.

Yes, i can do both - i can reproduce it on test environment also

is there somewhere manual for that?

You can get some information from the below link:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Alexander Voytsekhovskyy

young.inbox@gmail.com

over 8 years ago

In reply to: Amit Kapila (#8)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

Here is GDB log:

Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00005560250a10cf in ?? ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08e3 in ?? ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000556024ec9896 in bringetbitmap ()
Detaching from program: /usr/lib/postgresql/10/bin/postgres, process 7453

On Wed, Nov 29, 2017 at 4:12 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Hi,

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

Can you try to get call stack so that we can get a better picture of
what is happening here? One way could be you can change this ERROR to
PANIC which will generate core dump and you can get stack trace.

Yes, i can do both - i can reproduce it on test environment also

is there somewhere manual for that?

You can get some information from the below link:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_
a_running_PostgreSQL_backend_on_Linux/BSD

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Thomas Munro (#5)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Tue, Nov 28, 2017 at 8:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Nov 29, 2017 at 1:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Why not? Can't it just be that the workers are slow getting started?

In the normal non-error control flow, don't we expect
ExecShutdownGather() to run ExecParallelFinish() before
ExecParallelCleanup(), meaning that the leader waits for workers to
finish completely before it detaches itself? Doesn't that need to be
case to avoid random "unable to map dynamic shared memory segment" and
"dsa_area could not attach to a segment that has been freed" errors,
and for the parallel instrumentation shown in EXPLAIN to be reliable?

Oh, hmm.

Could it be that the leader thought that a worker didn't start up, but
in fact it did?

Well, I don't know how that could happen, but I can't prove it didn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Alexander Voytsekhovskyy

young.inbox@gmail.com

over 8 years ago

In reply to: Amit Kapila (#8)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

I see my fault

here is GDB with correct symbols

Program received signal SIGUSR1, User defined signal 1.
pagetable_insert (found=<synthetic pointer>, key=8192, tb=<optimized out>)
at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540
540
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:
No such file or directory.
Continuing.

Program received signal SIGUSR1, User defined signal 1.
pagetable_insert (found=<synthetic pointer>, key=8192, tb=<optimized out>)
at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540
540 in
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08ee in slot_deform_tuple (slot=slot@entry=0x556026638e60,
natts=natts@entry=29) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:1073
1073
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:
No such file or directory.
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08ee in slot_deform_tuple (slot=slot@entry=0x556026638e60,
natts=natts@entry=29) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:1073
1073 in
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00005560250a0c4f in pagetable_insert (found=<synthetic pointer>,
key=14592, tb=0x556026699df0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:493
493
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:
No such file or directory.
Detaching from program: /usr/lib/postgresql/10/bin/postgres, process 7453

On Wed, Nov 29, 2017 at 4:12 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Hi,

in very certain case i am getting error

"ERROR: dsa_area could not attach to a segment that has been freed"

Can you try to get call stack so that we can get a better picture of
what is happening here? One way could be you can change this ERROR to
PANIC which will generate core dump and you can get stack trace.

Yes, i can do both - i can reproduce it on test environment also

is there somewhere manual for that?

You can get some information from the below link:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_
a_running_PostgreSQL_backend_on_Linux/BSD

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#11)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On 11/29/2017 04:46 PM, Alexander Voytsekhovskyy wrote:

I see my fault

here is GDB with correct symbols

Program received signal SIGUSR1, User defined signal 1.
pagetable_insert (found=<synthetic pointer>, key=8192, tb=<optimized
out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540
540
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:
No such file or directory.
Continuing.

I think you really need to do "handle SIGUSR1 noprint nostop" to
eliminate the SIGUSR1 noise. That is absolutely irrelevant to the issue
you're trying to investigate.

Furthermore, this does not give us any sort of stack trace. You need to
either set a breakpoint at the elog() call, which I believe is this one:

https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/utils/mmgr/dsa.c#L1736

wait for it to get triggered, and then generate backtrace using 'bt'.

So something like

1) gdb -p $PID

2) (gdb) handle SIGUSR1 noprint nostop

3) (gdb) break dsa.c:1736

4) run the query, gdb should interrupt at the breakpoint

5) (gdb) bt

and then send us the output of that gdb command.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Alexander Voytsekhovskyy

young.inbox@gmail.com

over 8 years ago

In reply to: Tomas Vondra (#12)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

thanks for detailed instructions!

Here it is:

Note: breakpoint 1 also set at pc 0x5560252c21f5.
Breakpoint 2 at 0x5560252c21f5: file
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c,
line 1736.
(gdb) bt
#0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1048
#2 WaitEventSetWait (set=0x556026359498, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffc2849a460, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=100663296) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1000
#3 0x0000556025089767 in secure_read (port=0x55602639ddb0,
ptr=0x5560257195a0 <PqRecvBuffer>, len=8192) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/be-secure.c:166
#4 0x0000556025094f48 in pq_recvbuf () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/pqcomm.c:963
#5 0x0000556025095c85 in pq_getbyte () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/pqcomm.c:1006
#6 0x000055602518d350 in SocketBackend (inBuf=0x7ffc2849a620) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:328
#7 ReadCommand (inBuf=0x7ffc2849a620) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:501
#8 PostgresMain (argc=<optimized out>, argv=argv@entry=0x5560263a06e0,
dbname=0x5560263a05d8 "ddc_pvmrussia", username=<optimized out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:4030
#9 0x0000556024ec7fe9 in BackendRun (port=0x55602639ddb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4357
#10 BackendStartup (port=0x55602639ddb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4029
#11 ServerLoop () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1753
#12 0x000055602511d36b in PostmasterMain (argc=5, argv=<optimized out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1361
#13 0x0000556024ec93e5 in main (argc=5, argv=0x556026358850) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/main/main.c:228
(gdb)

On Wed, Nov 29, 2017 at 6:02 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Show quoted text

On 11/29/2017 04:46 PM, Alexander Voytsekhovskyy wrote:

I see my fault

here is GDB with correct symbols

Program received signal SIGUSR1, User defined signal 1.
pagetable_insert (found=<synthetic pointer>, key=8192, tb=<optimized
out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../

src/include/lib/simplehash.h:540

540
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../

src/include/lib/simplehash.h:

No such file or directory.
Continuing.

I think you really need to do "handle SIGUSR1 noprint nostop" to
eliminate the SIGUSR1 noise. That is absolutely irrelevant to the issue
you're trying to investigate.

Furthermore, this does not give us any sort of stack trace. You need to
either set a breakpoint at the elog() call, which I believe is this one:

https://github.com/postgres/postgres/blob/REL_10_STABLE/
src/backend/utils/mmgr/dsa.c#L1736

wait for it to get triggered, and then generate backtrace using 'bt'.

So something like

1) gdb -p $PID

2) (gdb) handle SIGUSR1 noprint nostop

3) (gdb) break dsa.c:1736

4) run the query, gdb should interrupt at the breakpoint

5) (gdb) bt

and then send us the output of that gdb command.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#13)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On 11/29/2017 05:23 PM, Alexander Voytsekhovskyy wrote:

thanks for detailed instructions!

Here it is:

Note: breakpoint 1 also set at pc 0x5560252c21f5.
Breakpoint 2 at 0x5560252c21f5: file
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c,
line 1736.
(gdb) bt
#0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1048

Unfortunately, that's still not it. This shows the backtrace at the
moment when gdb attached to the process (because it interrupts the
exectution). You need to do "c" first, to continue the execution.

So it should be:

1) gdb -p $PID
2) (gdb) handle SIGUSR1 noprint nostop
3) (gdb) break dsa.c:1736
4) (gdb) c
5) run the query, gdb should interrupt at the breakpoint
6) (gdb) bt

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Alexander Voytsekhovskyy

young.inbox@gmail.com

over 8 years ago

In reply to: Tomas Vondra (#14)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

Thanks for helping, here is one more try

#0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
#1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
dp=dp@entry=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
#2 0x00005560250a2c2b in tbm_attach_shared_iterate
(dsa=dsa@entry=0x556026700be8,
dp=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
#3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710)
at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176
#4 0x000055602505b36d in ExecScanFetch (recheckMtd=0x556025066470
<BitmapHeapRecheck>, accessMtd=0x556025066500 <BitmapHeapNext>,
node=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execScan.c:97
#5 ExecScan (node=0x556026460710, accessMtd=0x556025066500
<BitmapHeapNext>, recheckMtd=0x556025066470 <BitmapHeapRecheck>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execScan.c:164
#6 0x0000556025061033 in ExecProcNode (node=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#7 ExecAppend (pstate=0x556026582ff8) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAppend.c:222
#8 0x0000556025068ce6 in ExecProcNode (node=0x556026582ff8) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#9 gather_getnext (gatherstate=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeGather.c:281
#10 ExecGather (pstate=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeGather.c:215
#11 0x000055602506b6d8 in ExecProcNode (node=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#12 ExecHashJoinOuterGetTuple (hashvalue=0x7ffc2849a0dc,
hjstate=0x5560263b0220, outerNode=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeHashjoin.c:612
#13 ExecHashJoin (pstate=0x5560263b0220) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeHashjoin.c:201
#14 0x0000556025077af6 in ExecProcNode (node=0x5560263b0220) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#15 ExecSort (pstate=0x5560263affb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeSort.c:106
#16 0x000055602506138c in ExecProcNode (node=0x5560263affb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#17 fetch_input_tuple (aggstate=aggstate@entry=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:695
#18 0x000055602506363b in agg_retrieve_direct (aggstate=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:2336
#19 ExecAgg (pstate=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:2147
#20 0x000055602505528b in ExecProcNode (node=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#21 ExecutePlan (execute_once=<optimized out>, dest=0x5560266986c0,
direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x5560263aef38, estate=0x5560263aecd8) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:1722
#22 standard_ExecutorRun (queryDesc=0x556026567f28, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:363
#23 0x000055602518f95b in PortalRunSelect (portal=portal@entry=0x556026377d48,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x5560266986c0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/pquery.c:932
#24 0x0000556025190f58 in PortalRun (portal=portal@entry=0x556026377d48,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x5560266986c0,
altdest=altdest@entry=0x5560266986c0, completionTag=0x7ffc2849a470 "") at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/pquery.c:773
#25 0x000055602518c8dc in exec_simple_query (query_string=0x55602640eda8
"SELECT\n delivery_data.id_product AS id_product,\n
client_tt.id_client_tt\nFROM delivery_data\n JOIN client_tt AS
client_tt_tmp ON (client_tt_tmp.id_client_tt
=\ndelivery_data.id_client_tt)\n JOIN clien"...) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:1099
#26 0x000055602518ddf6 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x5560263a08e0, dbname=0x5560263a07d8 "ddc_pvmrussia",
username=<optimized out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:4088
#27 0x0000556024ec7fe9 in BackendRun (port=0x55602639e570) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4357
#28 BackendStartup (port=0x55602639e570) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4029
#29 ServerLoop () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1753
#30 0x000055602511d36b in PostmasterMain (argc=5, argv=<optimized out>) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1361
#31 0x0000556024ec93e5 in main (argc=5, argv=0x556026358850) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/main/main.c:228

On Wed, Nov 29, 2017 at 8:41 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Show quoted text

On 11/29/2017 05:23 PM, Alexander Voytsekhovskyy wrote:

thanks for detailed instructions!

Here it is:

Note: breakpoint 1 also set at pc 0x5560252c21f5.
Breakpoint 2 at 0x5560252c21f5: file
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../

src/backend/utils/mmgr/dsa.c,

line 1736.
(gdb) bt
#0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../

src/backend/storage/ipc/latch.c:1048

Unfortunately, that's still not it. This shows the backtrace at the
moment when gdb attached to the process (because it interrupts the
exectution). You need to do "c" first, to continue the execution.

So it should be:

1) gdb -p $PID
2) (gdb) handle SIGUSR1 noprint nostop
3) (gdb) break dsa.c:1736
4) (gdb) c
5) run the query, gdb should interrupt at the breakpoint
6) (gdb) bt

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#15)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Thanks for helping, here is one more try

#0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
#1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
dp=dp@entry=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
#2 0x00005560250a2c2b in tbm_attach_shared_iterate
(dsa=dsa@entry=0x556026700be8, dp=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
#3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176

Thank you for the report and the back trace. I think this might be a
manifestation of the problem I just described[1] on -hackers.
Depending on the shape of a multi-Gather query plan and therefore the
order of control flow, you might finish up using the DSA area that
belongs to a different Gather node and then find that it goes away too
soon. Investigating.

/messages/by-id/CAEepm=1U6as=brnVvMNixEV2tpi8NuyQoTmO8Qef0-VV+=7MDA@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

#17

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Thomas Munro (#16)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Thu, Nov 30, 2017 at 10:18 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Thanks for helping, here is one more try

#0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
#1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
dp=dp@entry=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
#2 0x00005560250a2c2b in tbm_attach_shared_iterate
(dsa=dsa@entry=0x556026700be8, dp=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
#3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176

Thank you for the report and the back trace. I think this might be a
manifestation of the problem I just described[1] on -hackers.
Depending on the shape of a multi-Gather query plan and therefore the
order of control flow, you might finish up using the DSA area that
belongs to a different Gather node and then find that it goes away too
soon. Investigating.

I haven't managed to reproduce this, but I was coincidentally
investigating a bug that appears to explain it. I think what happened
is that a background worker was first to execute BitmapHeapNext and
allocated a dsa_pointer, and then the leader process reached
BitmapHeapNext and called tbm_attach_shared_iterate which tried to
deference it, but it had es_query_dsa set to another gather node's DSA
area (whichever Gather most recently ran ExecInitParallelPlan). That
requires a certain order of execution and timing that I'm not sure how
to reach. I have posted a patch that should fix it over here:

/messages/by-id/CAEepm=0Mv9BigJPpribGQhnHqVGYo2+kmzekGUVJJc9Y_ZVaYA@mail.gmail.com

Are you able to provide a minimal reproducer, an anonymised partial
dump, or perhaps try out the patch on a copy of your database?

--
Thomas Munro
http://www.enterprisedb.com

#18

Alexander Voytsekhovskyy

av@mobile-ua.com

over 8 years ago

In reply to: Thomas Munro (#17)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

Hi,

i can confirm that this patch fix the bug for me.

I can’t provide minimal reproducer, but i can share whole data set with you

Show quoted text

On Thu, Nov 30, 2017 at 10:18 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:

Thanks for helping, here is one more try

#0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
#1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
dp=dp@entry=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
#2 0x00005560250a2c2b in tbm_attach_shared_iterate
(dsa=dsa@entry=0x556026700be8, dp=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
#3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176

Thank you for the report and the back trace. I think this might be a
manifestation of the problem I just described[1] on -hackers.
Depending on the shape of a multi-Gather query plan and therefore the
order of control flow, you might finish up using the DSA area that
belongs to a different Gather node and then find that it goes away too
soon. Investigating.

I haven't managed to reproduce this, but I was coincidentally
investigating a bug that appears to explain it. I think what happened
is that a background worker was first to execute BitmapHeapNext and
allocated a dsa_pointer, and then the leader process reached
BitmapHeapNext and called tbm_attach_shared_iterate which tried to
deference it, but it had es_query_dsa set to another gather node's DSA
area (whichever Gather most recently ran ExecInitParallelPlan). That
requires a certain order of execution and timing that I'm not sure how
to reach. I have posted a patch that should fix it over here:

/messages/by-id/CAEepm=0Mv9BigJPpribGQhnHqVGYo2+kmzekGUVJJc9Y_ZVaYA@mail.gmail.com

Are you able to provide a minimal reproducer, an anonymised partial
dump, or perhaps try out the patch on a copy of your database?

--
Thomas Munro
http://www.enterprisedb.com

#19

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Alexander Voytsekhovskyy (#18)

Re: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

On Fri, Dec 8, 2017 at 9:55 AM, Alexander Voytsekhovskyy
<av@mobile-ua.com> wrote:

i can confirm that this patch fix the bug for me.

Great news. Thanks for taking the time to test it!

I can’t provide minimal reproducer, but i can share whole data set with you

It's OK, your report is enough for me (I don't want to receive any
non-anonymised data).

--
Thomas Munro
http://www.enterprisedb.com