Somebody has not thought through subscription locking considerations

Started by Tom Lanealmost 9 years ago17 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I noticed this failure report:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-03-29%2019%3A45%3A27

in which we find

*** /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/expected/updatable_views.out	Thu Mar 30 04:45:43 2017
--- /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/results/updatable_views.out	Thu Mar 30 05:32:37 2017
***************
*** 349,354 ****
--- 349,358 ----
  DROP VIEW ro_view10, ro_view12, ro_view18;
  DROP SEQUENCE seq CASCADE;
  NOTICE:  drop cascades to view ro_view19
+ ERROR:  deadlock detected
+ DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
+ Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
+ HINT:  See server log for query details.
  -- simple updatable view
  CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
  INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);

and the referenced bit of log is

[58dc19dd.1d98:175] ERROR: deadlock detected
[58dc19dd.1d98:176] DETAIL: Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
Process 7576: DROP SEQUENCE seq CASCADE;
Process 7577: VACUUM FULL pg_class;
[58dc19dd.1d98:177] HINT: See server log for query details.
[58dc19dd.1d98:178] STATEMENT: DROP SEQUENCE seq CASCADE;

Of course, 1259 is pg_class and 6102 is pg_subscription_rel.

I await with interest an explanation of what "VACUUM FULL pg_class" is
doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
to mention why a DROP SEQUENCE is holding some fairly strong lock on that
relation. *Especially* in a situation where no subscriptions exist ---
but even if any did, this seems unacceptable on its face. Access to core
catalogs like pg_class cannot depend on random other stuff.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: Somebody has not thought through subscription locking considerations

On 30/03/17 07:25, Tom Lane wrote:

I noticed this failure report:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-03-29%2019%3A45%3A27

in which we find

*** /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/expected/updatable_views.out	Thu Mar 30 04:45:43 2017
--- /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/results/updatable_views.out	Thu Mar 30 05:32:37 2017
***************
*** 349,354 ****
--- 349,358 ----
DROP VIEW ro_view10, ro_view12, ro_view18;
DROP SEQUENCE seq CASCADE;
NOTICE:  drop cascades to view ro_view19
+ ERROR:  deadlock detected
+ DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
+ Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
+ HINT:  See server log for query details.
-- simple updatable view
CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);

and the referenced bit of log is

[58dc19dd.1d98:175] ERROR: deadlock detected
[58dc19dd.1d98:176] DETAIL: Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
Process 7576: DROP SEQUENCE seq CASCADE;
Process 7577: VACUUM FULL pg_class;
[58dc19dd.1d98:177] HINT: See server log for query details.
[58dc19dd.1d98:178] STATEMENT: DROP SEQUENCE seq CASCADE;

Of course, 1259 is pg_class and 6102 is pg_subscription_rel.

I await with interest an explanation of what "VACUUM FULL pg_class" is
doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
to mention why a DROP SEQUENCE is holding some fairly strong lock on that
relation. *Especially* in a situation where no subscriptions exist ---
but even if any did, this seems unacceptable on its face. Access to core
catalogs like pg_class cannot depend on random other stuff.

Hmm, the DROP SEQUENCE is result of not having dependency info for
relations/subscriptions I think. I was told during review it's needless
bloat of dependency catalog. I guess we should revisit that. It's also
likely that RemoveSubscriptionRel will work fine with lower lock level.

I have no idea why VACUUM FULL of pg_class would touch the
pg_subscription_rel though, I'll have to dig into that.

I can see that locking for example pg_trigger or pg_depend in
ShareRowExclusiveLock will also block VACUUM FULL on pg_class (and other
related tables like pg_attribute). So maybe we just need to be careful
about not taking such a strong lock...

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Petr Jelinek (#2)
Re: Somebody has not thought through subscription locking considerations

On Fri, Mar 31, 2017 at 9:53 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 30/03/17 07:25, Tom Lane wrote:

I noticed this failure report:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&amp;dt=2017-03-29%2019%3A45%3A27

in which we find

*** /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/expected/updatable_views.out     Thu Mar 30 04:45:43 2017
--- /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/results/updatable_views.out      Thu Mar 30 05:32:37 2017
***************
*** 349,354 ****
--- 349,358 ----
DROP VIEW ro_view10, ro_view12, ro_view18;
DROP SEQUENCE seq CASCADE;
NOTICE:  drop cascades to view ro_view19
+ ERROR:  deadlock detected
+ DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
+ Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
+ HINT:  See server log for query details.
-- simple updatable view
CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);

and the referenced bit of log is

[58dc19dd.1d98:175] ERROR: deadlock detected
[58dc19dd.1d98:176] DETAIL: Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
Process 7576: DROP SEQUENCE seq CASCADE;
Process 7577: VACUUM FULL pg_class;
[58dc19dd.1d98:177] HINT: See server log for query details.
[58dc19dd.1d98:178] STATEMENT: DROP SEQUENCE seq CASCADE;

Of course, 1259 is pg_class and 6102 is pg_subscription_rel.

I await with interest an explanation of what "VACUUM FULL pg_class" is
doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
to mention why a DROP SEQUENCE is holding some fairly strong lock on that
relation. *Especially* in a situation where no subscriptions exist ---
but even if any did, this seems unacceptable on its face. Access to core
catalogs like pg_class cannot depend on random other stuff.

Hmm, the DROP SEQUENCE is result of not having dependency info for
relations/subscriptions I think. I was told during review it's needless
bloat of dependency catalog. I guess we should revisit that. It's also
likely that RemoveSubscriptionRel will work fine with lower lock level.

I have no idea why VACUUM FULL of pg_class would touch the
pg_subscription_rel though, I'll have to dig into that.

VACUUM FULL of any table acquires ShareRowExclusiveLock on
pg_subscription_rel because when doDeletion removes old heap the
RemoveSubscriptionRel is called in heap_drop_with_catalog. The
following stack trace is what I got when run VACUUM FULL pg_class.

#2 0x000000000084b9d2 in LockRelationOid (relid=6102, lockmode=6) at lmgr.c:115
#3 0x00000000004cec37 in relation_open (relationId=6102, lockmode=6)
at heapam.c:1122
#4 0x00000000004ceef9 in heap_open (relationId=6102, lockmode=6) at
heapam.c:1288
#5 0x0000000000599a02 in RemoveSubscriptionRel (subid=0, relid=16409)
at pg_subscription.c:361
#6 0x000000000056037f in heap_drop_with_catalog (relid=16409) at heap.c:1842
#7 0x000000000055b420 in doDeletion (object=0x1bca758, flags=1) at
dependency.c:1125
#8 0x000000000055b192 in deleteOneObject (object=0x1bca758,
depRel=0x7fff000716a0, flags=1) at dependency.c:1028
#9 0x000000000055a169 in deleteObjectsInList
(targetObjects=0x1b99708, depRel=0x7fff000716a0, flags=1) at
dependency.c:263
#10 0x000000000055a21a in performDeletion (object=0x7fff00071750,
behavior=DROP_RESTRICT, flags=1) at dependency.c:344
#11 0x0000000000615597 in finish_heap_swap (OIDOldHeap=1259,
OIDNewHeap=16409, is_system_catalog=1 '\001', swap_toast_by_content=0
'\000', check_constraints=0 '\000', is_internal=1 '\001',
frozenXid=571, cutoffMulti=1, newrelpersistence=112 'p') at
cluster.c:1574
#12 0x0000000000613bd3 in rebuild_relation (OldHeap=0x7f541f0d24a0,
indexOid=0, verbose=0 '\000') at cluster.c:590
#13 0x000000000061362a in cluster_rel (tableOid=1259, indexOid=0,
recheck=0 '\000', verbose=0 '\000') at cluster.c:404
#14 0x000000000069f40f in vacuum_rel (relid=1259, relation=0x1b9a770,
options=17, params=0x7fff00071a80) at vacuum.c:1441
#15 0x000000000069db9b in vacuum (options=17, relation=0x1b9a770,
relid=0, params=0x7fff00071a80, va_cols=0x0, bstrategy=0x1bf2f50,
isTopLevel=1 '\001') at vacuum.c:304
#16 0x000000000069d7ec in ExecVacuum (vacstmt=0x1b9a7c8, isTopLevel=1
'\001') at vacuum.c:122
#17 0x0000000000873a9c in standard_ProcessUtility (pstmt=0x1b9ab28,
queryString=0x1b99d50 "vacuum FULL pg_class;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x1b9ac20,
completionTag=0x7fff00071eb0 "") at utility.c:670
#18 0x0000000000873222 in ProcessUtility (pstmt=0x1b9ab28,
queryString=0x1b99d50 "vacuum FULL pg_class;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x1b9ac20,
completionTag=0x7fff00071eb0 "") at utility.c:353
#19 0x000000000087220c in PortalRunUtility (portal=0x1b35540,
pstmt=0x1b9ab28, isTopLevel=1 '\001', setHoldSnapshot=0 '\000',
dest=0x1b9ac20, completionTag=0x7fff00071eb0 "") at pquery.c:1174
#20 0x0000000000872419 in PortalRunMulti (portal=0x1b35540,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x1b9ac20,
altdest=0x1b9ac20, completionTag=0x7fff00071eb0 "") at pquery.c:1317
#21 0x0000000000871945 in PortalRun (portal=0x1b35540,
count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001',
dest=0x1b9ac20, altdest=0x1b9ac20, completionTag=0x7fff00071eb0 "") at
pquery.c:795
#22 0x000000000086ba9e in exec_simple_query (query_string=0x1b99d50
"vacuum FULL pg_class;") at postgres.c:1101
#23 0x000000000086fbf8 in PostgresMain (argc=1, argv=0x1b44220,
dbname=0x1b44080 "postgres", username=0x1b44058 "masahiko") at
postgres.c:4071
#24 0x00000000007d6e46 in BackendRun (port=0x1b3c5d0) at postmaster.c:4317
#25 0x00000000007d65c6 in BackendStartup (port=0x1b3c5d0) at postmaster.c:3989
#26 0x00000000007d2bde in ServerLoop () at postmaster.c:1729
#27 0x00000000007d22a2 in PostmasterMain (argc=5, argv=0x1b15ed0) at
postmaster.c:1337
#28 0x0000000000710cfa in main (argc=5, argv=0x1b15ed0) at main.c:228

I can see that locking for example pg_trigger or pg_depend in
ShareRowExclusiveLock will also block VACUUM FULL on pg_class (and other
related tables like pg_attribute). So maybe we just need to be careful
about not taking such a strong lock...

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Masahiko Sawada (#3)
Re: Somebody has not thought through subscription locking considerations

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Fri, Mar 31, 2017 at 9:53 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 30/03/17 07:25, Tom Lane wrote:

I await with interest an explanation of what "VACUUM FULL pg_class" is
doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
to mention why a DROP SEQUENCE is holding some fairly strong lock on that
relation.

VACUUM FULL of any table acquires ShareRowExclusiveLock on
pg_subscription_rel because when doDeletion removes old heap the
RemoveSubscriptionRel is called in heap_drop_with_catalog.

This seems entirely horrid: it *guarantees* deadlock possibilities.
And I wonder what happens when I VACUUM FULL pg_subscription_rel
itself.

At the very least, it would be a good idea to exclude the system
catalogs from logical replication, wouldn't it?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: Somebody has not thought through subscription locking considerations

On 31/03/17 19:35, Tom Lane wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Fri, Mar 31, 2017 at 9:53 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:

On 30/03/17 07:25, Tom Lane wrote:

I await with interest an explanation of what "VACUUM FULL pg_class" is
doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
to mention why a DROP SEQUENCE is holding some fairly strong lock on that
relation.

VACUUM FULL of any table acquires ShareRowExclusiveLock on
pg_subscription_rel because when doDeletion removes old heap the
RemoveSubscriptionRel is called in heap_drop_with_catalog.

This seems entirely horrid: it *guarantees* deadlock possibilities.
And I wonder what happens when I VACUUM FULL pg_subscription_rel
itself.

At the very least, it would be a good idea to exclude the system
catalogs from logical replication, wouldn't it?

They are excluded. It works same way for triggers and many other objects
so I would not say it's horrid.

The problematic part is that the pg_subscription_rel manipulation
functions acquire ShareRowExclusiveLock and not the usual
RowExclusiveLock because there were some worries about concurrency. I
think though that it's not needed though given the access patterns
there. It's only updated by CREATE SUBSCRIPTION/ALTER SUBSCRIPTION
REFRESH and then by tablesync which holds exclusive lock on the table
itself anyway.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#5)
Re: Somebody has not thought through subscription locking considerations

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

On 31/03/17 19:35, Tom Lane wrote:

At the very least, it would be a good idea to exclude the system
catalogs from logical replication, wouldn't it?

They are excluded.

Well, the exclusion is apparently not checked early enough. I would say
that touches of system catalogs should never result in any attempts to
access the logical relation infrastructure, but what we have here is
such an access.

The problematic part is that the pg_subscription_rel manipulation
functions acquire ShareRowExclusiveLock and not the usual
RowExclusiveLock because there were some worries about concurrency.

No, the problematic part is that there is any heap_open happening at
all. That open could very easily result in a recursive attempt to read
pg_class, for example, which is going to be fatal if we're in the midst
of vacuum full'ing or reindex'ing pg_class. It's frankly astonishing
to me that this patch seems to have survived testing under
CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
preventing such recursive lookups.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: Somebody has not thought through subscription locking considerations

On 31/03/17 20:23, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

On 31/03/17 19:35, Tom Lane wrote:

At the very least, it would be a good idea to exclude the system
catalogs from logical replication, wouldn't it?

They are excluded.

Well, the exclusion is apparently not checked early enough. I would say
that touches of system catalogs should never result in any attempts to
access the logical relation infrastructure, but what we have here is
such an access.

The problematic part is that the pg_subscription_rel manipulation
functions acquire ShareRowExclusiveLock and not the usual
RowExclusiveLock because there were some worries about concurrency.

No, the problematic part is that there is any heap_open happening at
all. That open could very easily result in a recursive attempt to read
pg_class, for example, which is going to be fatal if we're in the midst
of vacuum full'ing or reindex'ing pg_class. It's frankly astonishing
to me that this patch seems to have survived testing under
CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
preventing such recursive lookups.

Hmm okay, so the solution is to either use standard dependency info for
this so that it's only called for tables that are actually know to be
subscribed or have some exceptions in the current code to call the
function only for user catalogs. Any preferences?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#7)
Re: Somebody has not thought through subscription locking considerations

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

On 31/03/17 20:23, Tom Lane wrote:

No, the problematic part is that there is any heap_open happening at
all. That open could very easily result in a recursive attempt to read
pg_class, for example, which is going to be fatal if we're in the midst
of vacuum full'ing or reindex'ing pg_class. It's frankly astonishing
to me that this patch seems to have survived testing under
CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
preventing such recursive lookups.

Hmm okay, so the solution is to either use standard dependency info for
this so that it's only called for tables that are actually know to be
subscribed or have some exceptions in the current code to call the
function only for user catalogs. Any preferences?

Looking at dependency info isn't going to fix this, it only moves the
unsafe catalog access somewhere else (ie pg_depend instead of
pg_subscription_rel). I suspect the only safe solution is doing an
IsCatalogRelation or similar test pretty early in the logical replication
code paths.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Tom Lane (#8)
Re: Somebody has not thought through subscription locking considerations

On 31/03/17 21:00, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

On 31/03/17 20:23, Tom Lane wrote:

No, the problematic part is that there is any heap_open happening at
all. That open could very easily result in a recursive attempt to read
pg_class, for example, which is going to be fatal if we're in the midst
of vacuum full'ing or reindex'ing pg_class. It's frankly astonishing
to me that this patch seems to have survived testing under
CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
preventing such recursive lookups.

Hmm okay, so the solution is to either use standard dependency info for
this so that it's only called for tables that are actually know to be
subscribed or have some exceptions in the current code to call the
function only for user catalogs. Any preferences?

Looking at dependency info isn't going to fix this, it only moves the
unsafe catalog access somewhere else (ie pg_depend instead of
pg_subscription_rel). I suspect the only safe solution is doing an
IsCatalogRelation or similar test pretty early in the logical replication
code paths.

I don't follow, everything else does dependency info check in this
situation, how is this any different?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#9)
Re: Somebody has not thought through subscription locking considerations

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

On 31/03/17 21:00, Tom Lane wrote:

Looking at dependency info isn't going to fix this, it only moves the
unsafe catalog access somewhere else (ie pg_depend instead of
pg_subscription_rel). I suspect the only safe solution is doing an
IsCatalogRelation or similar test pretty early in the logical replication
code paths.

I don't follow, everything else does dependency info check in this
situation, how is this any different?

What do you mean by "everything else"? The key point here is that
access to the bootstrap catalogs like pg_class and pg_attribute can't
be dependent on accesses to other catalogs, or we get into circularities.
We certainly aren't trying to look in pg_depend when we do a heap_open.

(Or, if you tell me that we are now because the logical replication
patch added it, I'm going to start agitating for reverting the patch
and sending it back for redesign.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Tom Lane (#10)
Re: Somebody has not thought through subscription locking considerations

On 01/04/17 00:52, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

On 31/03/17 21:00, Tom Lane wrote:

Looking at dependency info isn't going to fix this, it only moves the
unsafe catalog access somewhere else (ie pg_depend instead of
pg_subscription_rel). I suspect the only safe solution is doing an
IsCatalogRelation or similar test pretty early in the logical replication
code paths.

I don't follow, everything else does dependency info check in this
situation, how is this any different?

What do you mean by "everything else"? The key point here is that
access to the bootstrap catalogs like pg_class and pg_attribute can't
be dependent on accesses to other catalogs, or we get into circularities.
We certainly aren't trying to look in pg_depend when we do a heap_open.

(Or, if you tell me that we are now because the logical replication
patch added it, I'm going to start agitating for reverting the patch
and sending it back for redesign.)

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Petr Jelinek (#11)
Re: Somebody has not thought through subscription locking considerations

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

Hmmm. What the heap_drop_with_catalog call is being applied to is a
short-lived table that is not pg_class. It happens to own the disk
storage that used to belong to pg_class, but it is not pg_class;
in particular it doesn't have the same OID, and it is not what would
be looked in if someone happened to need to fetch a pg_class row
at that point. So there's no circularity involved.

However ... by that same token, it ought to be okay to consult
pg_subscription_rel during that drop step. Indeed, if we put
an IsCatalogRelation check in there, it wouldn't fire, because
the target table has OID 16409 according to your stack trace.
So that line of thought is indeed not very helpful.

On further reflection it seems like you were right, the problem is
taking a self-exclusive lock on pg_subscription_rel during low-level
catalog operations. We're going to have to find a way to reduce that
lock strength, or we're going to have a lot of deadlock problems.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Tom Lane (#12)
Re: Somebody has not thought through subscription locking considerations

On 01/04/17 01:20, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

Hmmm. What the heap_drop_with_catalog call is being applied to is a
short-lived table that is not pg_class. It happens to own the disk
storage that used to belong to pg_class, but it is not pg_class;
in particular it doesn't have the same OID, and it is not what would
be looked in if someone happened to need to fetch a pg_class row
at that point. So there's no circularity involved.

On further reflection it seems like you were right, the problem is
taking a self-exclusive lock on pg_subscription_rel during low-level
catalog operations. We're going to have to find a way to reduce that
lock strength, or we're going to have a lot of deadlock problems.

Well we have heavy lock because we were worried about failure scenarios
in our dump upsert in SetSubscriptionRelState which does cache lookup
and if it finds something it updates, otherwise inserts. We have similar
pattern in other places but they are called from DDL statements usually
so the worst that can happen is DDL fails with weird errors when done
concurrently with same name so using RowExclusiveLock is okay.

That being said, looking at use-cases for SetSubscriptionRelState that's
basically CREATE SUBSCRIPTION, ALTER SUBSCRIPTION REFRESH and tablesync
worker. So the DDL thing applies to first ones as well and tablesync
should not be running in case the record does not exist so it's fine if
it fails. In terms of RemoveSubscriptionRel that's only called from
heap_drop_with_catalog and tablesync holds relation lock so there is no
way heap_drop_with_catalog will happen on the same relation. This leads
me to thinking that RowExclusiveLock is fine for both
SetSubscriptionRelState and RemoveSubscriptionRel as long as we document
that callers should be aware that SetSubscriptionRelState has
concurrency might fail on unique index check.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#13)
Re: Somebody has not thought through subscription locking considerations

On 01/04/17 01:57, Petr Jelinek wrote:

On 01/04/17 01:20, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

Hmmm. What the heap_drop_with_catalog call is being applied to is a
short-lived table that is not pg_class. It happens to own the disk
storage that used to belong to pg_class, but it is not pg_class;
in particular it doesn't have the same OID, and it is not what would
be looked in if someone happened to need to fetch a pg_class row
at that point. So there's no circularity involved.

On further reflection it seems like you were right, the problem is
taking a self-exclusive lock on pg_subscription_rel during low-level
catalog operations. We're going to have to find a way to reduce that
lock strength, or we're going to have a lot of deadlock problems.

Well we have heavy lock because we were worried about failure scenarios
in our dump upsert in SetSubscriptionRelState which does cache lookup

*dumb (ie, like me ;) )

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Petr Jelinek
petr.jelinek@2ndquadrant.com
In reply to: Petr Jelinek (#13)
1 attachment(s)
Re: Somebody has not thought through subscription locking considerations

On 01/04/17 01:57, Petr Jelinek wrote:

On 01/04/17 01:20, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

Hmmm. What the heap_drop_with_catalog call is being applied to is a
short-lived table that is not pg_class. It happens to own the disk
storage that used to belong to pg_class, but it is not pg_class;
in particular it doesn't have the same OID, and it is not what would
be looked in if someone happened to need to fetch a pg_class row
at that point. So there's no circularity involved.

On further reflection it seems like you were right, the problem is
taking a self-exclusive lock on pg_subscription_rel during low-level
catalog operations. We're going to have to find a way to reduce that
lock strength, or we're going to have a lot of deadlock problems.

Well we have heavy lock because we were worried about failure scenarios
in our dumb upsert in SetSubscriptionRelState which does cache lookup
and if it finds something it updates, otherwise inserts. We have similar
pattern in other places but they are called from DDL statements usually
so the worst that can happen is DDL fails with weird errors when done
concurrently with same name so using RowExclusiveLock is okay.

That being said, looking at use-cases for SetSubscriptionRelState that's
basically CREATE SUBSCRIPTION, ALTER SUBSCRIPTION REFRESH and tablesync
worker. So the DDL thing applies to first ones as well and tablesync
should not be running in case the record does not exist so it's fine if
it fails. In terms of RemoveSubscriptionRel that's only called from
heap_drop_with_catalog and tablesync holds relation lock so there is no
way heap_drop_with_catalog will happen on the same relation. This leads
me to thinking that RowExclusiveLock is fine for both
SetSubscriptionRelState and RemoveSubscriptionRel as long as we document
that callers should be aware that SetSubscriptionRelState has
concurrency issues and fail on unique index check.

And a simple patch to do so. Peter do you see any problem with doing this?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

Use-weaker-locks-when-updating-subscription-relation.patchapplication/x-patch; name=Use-weaker-locks-when-updating-subscription-relation.patchDownload
From a12259efe7b4d0af570e7d2d7cb48b249158a1bd Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmodos@pjmodos.net>
Date: Sat, 1 Apr 2017 02:20:16 +0200
Subject: [PATCH] Use weaker locks when updating subscription relation state

---
 src/backend/catalog/pg_subscription.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index e7a1634..b0bd248 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -215,6 +215,9 @@ textarray_to_stringlist(ArrayType *textarray)
 
 /*
  * Set the state of a subscription table.
+ *
+ * The insert or update logic in this function is not concurrency safe so
+ * it may raise an error.
  */
 Oid
 SetSubscriptionRelState(Oid subid, Oid relid, char state,
@@ -227,7 +230,7 @@ SetSubscriptionRelState(Oid subid, Oid relid, char state,
 	Datum		values[Natts_pg_subscription_rel];
 
 	/* Prevent concurrent changes. */
-	rel = heap_open(SubscriptionRelRelationId, ShareRowExclusiveLock);
+	rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock);
 
 	/* Try finding existing mapping. */
 	tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -358,8 +361,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	HeapTuple	tup;
 	int			nkeys = 0;
 
-	/* Prevent concurrent changes (see SetSubscriptionRelState()). */
-	rel = heap_open(SubscriptionRelRelationId, ShareRowExclusiveLock);
+	rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock);
 
 	if (OidIsValid(subid))
 	{
@@ -387,7 +389,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid)
 	}
 	heap_endscan(scan);
 
-	heap_close(rel, ShareRowExclusiveLock);
+	heap_close(rel, RowExclusiveLock);
 }
 
 
-- 
2.7.4

#16Noah Misch
noah@leadboat.com
In reply to: Petr Jelinek (#15)
Re: Somebody has not thought through subscription locking considerations

On Sat, Apr 01, 2017 at 02:25:54AM +0200, Petr Jelinek wrote:

On 01/04/17 01:57, Petr Jelinek wrote:

On 01/04/17 01:20, Tom Lane wrote:

Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

Hmmm. What the heap_drop_with_catalog call is being applied to is a
short-lived table that is not pg_class. It happens to own the disk
storage that used to belong to pg_class, but it is not pg_class;
in particular it doesn't have the same OID, and it is not what would
be looked in if someone happened to need to fetch a pg_class row
at that point. So there's no circularity involved.

On further reflection it seems like you were right, the problem is
taking a self-exclusive lock on pg_subscription_rel during low-level
catalog operations. We're going to have to find a way to reduce that
lock strength, or we're going to have a lot of deadlock problems.

Well we have heavy lock because we were worried about failure scenarios
in our dumb upsert in SetSubscriptionRelState which does cache lookup
and if it finds something it updates, otherwise inserts. We have similar
pattern in other places but they are called from DDL statements usually
so the worst that can happen is DDL fails with weird errors when done
concurrently with same name so using RowExclusiveLock is okay.

That being said, looking at use-cases for SetSubscriptionRelState that's
basically CREATE SUBSCRIPTION, ALTER SUBSCRIPTION REFRESH and tablesync
worker. So the DDL thing applies to first ones as well and tablesync
should not be running in case the record does not exist so it's fine if
it fails. In terms of RemoveSubscriptionRel that's only called from
heap_drop_with_catalog and tablesync holds relation lock so there is no
way heap_drop_with_catalog will happen on the same relation. This leads
me to thinking that RowExclusiveLock is fine for both
SetSubscriptionRelState and RemoveSubscriptionRel as long as we document
that callers should be aware that SetSubscriptionRelState has
concurrency issues and fail on unique index check.

And a simple patch to do so. Peter do you see any problem with doing this?

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Petr Jelinek (#15)
Re: Somebody has not thought through subscription locking considerations

On 3/31/17 20:25, Petr Jelinek wrote:

On 01/04/17 01:57, Petr Jelinek wrote:

That being said, looking at use-cases for SetSubscriptionRelState that's
basically CREATE SUBSCRIPTION, ALTER SUBSCRIPTION REFRESH and tablesync
worker. So the DDL thing applies to first ones as well and tablesync
should not be running in case the record does not exist so it's fine if
it fails. In terms of RemoveSubscriptionRel that's only called from
heap_drop_with_catalog and tablesync holds relation lock so there is no
way heap_drop_with_catalog will happen on the same relation. This leads
me to thinking that RowExclusiveLock is fine for both
SetSubscriptionRelState and RemoveSubscriptionRel as long as we document
that callers should be aware that SetSubscriptionRelState has
concurrency issues and fail on unique index check.

And a simple patch to do so. Peter do you see any problem with doing this?

committed

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers