BUG #14808: V10-beta4, backend abort

Started by phb07almost 9 years ago28 messagesbugs

phb07@apra.asso.fr

almost 9 years ago

The following bug has been logged on the website:

Bug reference: 14808
Logged by: Philippe BEAUDOIN
Email address: phb07@apra.asso.fr
PostgreSQL version: 10beta4
Operating system: Linux
Description:

Hi all,

While continuing to play with transition tables in statement level trigger,
I have encountered what looks like a backend abort.

I have been able to reproduce the case with the following simple script:

#!/bin/sh
export PGHOST=localhost
export PGPORT=5410

dropdb test
createdb test

psql test <<*EOF*
\set ON_ERROR_STOP on

CREATE OR REPLACE FUNCTION create_tbl(grpdef_schema TEXT, grpdef_tblseq
TEXT)
RETURNS VOID LANGUAGE plpgsql SECURITY DEFINER AS
\$_create_tbl\$
DECLARE
v_fullTableName TEXT;
v_logTableName TEXT;
v_logFnctName TEXT;
v_colList1 TEXT;
v_colList2 TEXT;
v_colList3 TEXT;
v_colList4 TEXT;
BEGIN
-- build the different name for table, trigger, functions,...
v_fullTableName = grpdef_schema || '.' || grpdef_tblseq;
v_logTableName = grpdef_tblseq || '_log';
v_logFnctName = grpdef_tblseq || '_log_idx';
-- build the tables's columns lists
SELECT string_agg('tbl.' || col_name, ','),
string_agg('o.' || col_name || ' AS ' || col_name_o || ', n.' ||
col_name || ' AS ' || col_name_n, ','),
string_agg('r.' || col_name_o, ','),
string_agg('r.' || col_name_n, ',')
INTO v_colList1, v_colList2, v_colList3, v_colList4 FROM (
SELECT quote_ident(attname) AS col_name, quote_ident('o_' || attname)
AS col_name_o, quote_ident('n_' || attname) AS col_name_n
FROM pg_catalog.pg_attribute
WHERE attrelid = v_fullTableName::regclass
AND attnum > 0 AND NOT attisdropped
ORDER BY attnum) AS t;
-- create the log table: it looks like the application table, with some
additional technical columns
EXECUTE 'DROP TABLE IF EXISTS ' || v_logTableName;
EXECUTE 'CREATE TABLE ' || v_logTableName
|| ' (LIKE ' || v_fullTableName || ') ';
EXECUTE 'ALTER TABLE ' || v_logTableName
|| ' ADD COLUMN verb VARCHAR(3),'
|| ' ADD COLUMN tuple VARCHAR(3)';
-- create the log function
EXECUTE 'CREATE OR REPLACE FUNCTION ' || v_logFnctName || '() RETURNS
TRIGGER AS \$logfnct\$'
|| 'DECLARE'
|| ' r RECORD;'
|| 'BEGIN'
|| ' IF (TG_OP = ''DELETE'') THEN'
|| ' INSERT INTO ' || v_logTableName || ' SELECT ' || v_colList1
|| ', ''DEL'', ''OLD'' FROM old_table tbl;'
|| ' ELSIF (TG_OP = ''INSERT'') THEN'
|| ' INSERT INTO ' || v_logTableName || ' SELECT ' || v_colList1
|| ', ''INS'', ''NEW'' FROM new_table tbl;'
|| ' ELSIF (TG_OP = ''UPDATE'') THEN'
|| ' FOR r IN'
|| ' WITH'
|| ' o AS (SELECT ' || v_colList1 || ', row_number() OVER
() AS ord FROM old_table tbl'
|| ' ),'
|| ' n AS (SELECT ' || v_colList1 || ', row_number() OVER
() AS ord FROM new_table tbl'
|| ' )'
|| ' SELECT ' || v_colList2
|| ' FROM o JOIN n USING(ord)'
|| ' LOOP'
|| ' INSERT INTO ' || v_logTableName
|| ' SELECT ' || v_colList3 || ', ''UPD'', ''OLD'';'
|| ' INSERT INTO ' || v_logTableName
|| ' SELECT ' || v_colList4 || ', ''UPD'', ''NEW'';'
|| ' END LOOP;'
|| ' END IF;'
|| ' RETURN NULL;'
|| 'END;'
|| '\$logfnct\$ LANGUAGE plpgsql SECURITY DEFINER;';
-- creation of the log trigger on the application table, using the
previously created log function
EXECUTE 'CREATE TRIGGER insert_log_trg'
|| ' AFTER INSERT ON ' || v_fullTableName || ' REFERENCING NEW
TABLE AS new_table'
|| ' FOR EACH STATEMENT EXECUTE PROCEDURE ' || v_logFnctName ||
'()';
EXECUTE 'CREATE TRIGGER update_log_trg'
|| ' AFTER UPDATE ON ' || v_fullTableName || ' REFERENCING OLD
TABLE AS old_table NEW TABLE AS new_table'
|| ' FOR EACH STATEMENT EXECUTE PROCEDURE ' || v_logFnctName ||
'()';
EXECUTE 'CREATE TRIGGER delete_log_trg'
|| ' AFTER DELETE ON ' || v_fullTableName || ' REFERENCING OLD
TABLE AS old_table'
|| ' FOR EACH STATEMENT EXECUTE PROCEDURE ' || v_logFnctName ||
'()';
RETURN;
END;
\$_create_tbl\$;

CREATE TABLE myTbl1 (
col11 INT NOT NULL,
col12 TEXT ,
col13 TEXT ,
PRIMARY KEY (col11)
);

CREATE TABLE myTbl3 (
col41 INT NOT NULL,
col44 INT ,
PRIMARY KEY (col41),
FOREIGN KEY (col44) REFERENCES myTbl1 (col11) ON DELETE CASCADE ON UPDATE
SET NULL
);

select create_tbl('public','mytbl1');
select create_tbl('public','mytbl3');

insert into myTbl1 select i, 'ABC', 'abc' from generate_series (1,10100) as
i;
update myTbl1 set col13=E'\\034'::bytea where col11 <= 500;
delete from myTbl1 where col11 > 10000;

*EOF*

As a result, the last DELETE statement fails. I get:

CREATE FUNCTION
CREATE TABLE
CREATE TABLE
NOTICE: table "mytbl1_log" does not exist, skipping
create_tbl
------------

(1 row)

NOTICE: table "mytbl3_log" does not exist, skipping
create_tbl
------------

(1 row)

INSERT 0 1101
UPDATE 0
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

The postgresql.conf file has default parameters, except:
listen_addresses = '*'
port = 5410
max_prepared_transactions 5
logging_collector = on
track_functions = all
track_commit_timestamp = on

Best regards.
Philippe Beaudoin.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Paquier

michael@paquier.xyz

almost 9 years ago

In reply to: phb07 (#1)

Re: BUG #14808: V10-beta4, backend abort

On Sat, Sep 9, 2017 at 3:48 PM, <phb07@apra.asso.fr> wrote:

INSERT 0 1101
UPDATE 0
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

Crash is confirmed here so I am adding an open item. I have not dug
into the issue seriously, but at short glance we have a code path
trying to access something that has already been free'd:
#0 0x0000561bfe0f0959 in tuplestore_tuple_count
(state=0x7f7f7f7f7f7f7f7f) at tuplestore.c:548
548 return state->tuples;
(gdb) bt
#0 0x0000561bfe0f0959 in tuplestore_tuple_count
(state=0x7f7f7f7f7f7f7f7f) at tuplestore.c:548
#1 0x0000561bfdd8ca22 in SPI_register_trigger_data
(tdata=0x7ffc92083860) at spi.c:2764
#2 0x00007f7075da7156 in plpgsql_exec_trigger (func=0x561bfe7bc5a8,
trigdata=0x7ffc92083860) at pl_exec.c:692
#3 0x00007f7075da08e7 in plpgsql_call_handler (fcinfo=0x7ffc920833d0)
at pl_handler.c:24
(gdb) p state
$1 = (Tuplestorestate *) 0x7f7f7f7f7f7f7f7
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Michael Paquier (#2)

Re: BUG #14808: V10-beta4, backend abort

Michael Paquier <michael.paquier@gmail.com> writes:

Crash is confirmed here so I am adding an open item. I have not dug
into the issue seriously, but at short glance we have a code path
trying to access something that has already been free'd:

I think this can be blamed on commit c46c0e52.

What is happening is that the AFTER triggers are queuing more triggers,
which have TransitionCaptureStates associated, but
ExecEndModifyTable thinks it can DestroyTransitionCaptureState
unconditionally. When the subsidiary triggers eventually get executed,
their ats_transition_capture pointers are pointing at garbage.

My first instinct is to get rid of DestroyTransitionCaptureState
altogether, on the grounds that the TransitionCaptureState will
go away at transaction cleanup and we can't really get rid of it
any sooner than that.

The other question that seems pretty relevant is why the subsidiary
triggers, which are the constraint triggers associated with the
tables' foreign keys, are getting queued with TransitionCaptureState
pointers in the first place. This seems horridly expensive and
unnecessary. It also directly contradicts the claim in
MakeTransitionCaptureState that constraint triggers cannot have
transition tables.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Tom Lane (#3)

Re: BUG #14808: V10-beta4, backend abort

I wrote:

What is happening is that the AFTER triggers are queuing more triggers,
which have TransitionCaptureStates associated, but
ExecEndModifyTable thinks it can DestroyTransitionCaptureState
unconditionally. When the subsidiary triggers eventually get executed,
their ats_transition_capture pointers are pointing at garbage.

On closer inspection, the issue is specific to TCS-using AFTER triggers
that are being fired as a result of foreign key enforcement triggers.
In the example, each row deleted from myTbl1 causes firing of the
ON DELETE CASCADE enforcement trigger, which will execute a DELETE
against myTbl3. That won't delete anything (since myTbl3 is empty)
but we nonetheless queue a firing of the TCS-using AFTER STATEMENT
trigger for myTbl3.

Now, the ExecEndModifyTable instance for the DELETE supposes that
all TCS-using triggers must have been fired during ExecutorFinish;
but *that doesn't happen if EXEC_FLAG_SKIP_TRIGGERS is set*, which
it is because ri_PerformCheck tells SPI not to fire triggers.

I do not recall the exact details of why that is, but I believe
the intention is to fire RI-triggered triggers at the end of the
outer statement rather than exposing the individual RI action as
a statement.

I made a quick hack to not delete the TCS if EXEC_FLAG_SKIP_TRIGGERS
is set, as attached. The example succeeds given this. However,
note that the AFTER STATEMENT trigger will be fired once per myTbl1 row
deletion, each time seeing a transition table consisting of only the
myTbl3 rows that went away as a result of that one single row deletion.
I'm not sure that this is per the letter or spirit of the SQL spec.

If it isn't, I don't think there is much we can do about it for v10.
In v11 or later, we could think about somehow merging the transition
tables for all the RI actions triggered by one statement. This might
well need to go along with rewriting the RI framework to use statement
triggers and TCS state itself. I think people had already muttered
about doing that.

The other question that seems pretty relevant is why the subsidiary
triggers, which are the constraint triggers associated with the
tables' foreign keys, are getting queued with TransitionCaptureState
pointers in the first place. This seems horridly expensive and
unnecessary. It also directly contradicts the claim in
MakeTransitionCaptureState that constraint triggers cannot have
transition tables.

The other part of the attached patch tweaks AfterTriggerSaveEvent
to not store an ats_transition_capture pointer for a deferrable
trigger event. This doesn't have anything directly to do with
the current bug, because although the RI triggers are being stored
with such pointers, they aren't actually dereferencing them. However,
it seems like a good idea anyway, to (1) ensure that we don't have
dangling pointers in the trigger queue, and (2) possibly allow
more merging of shared trigger states.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Tom Lane (#4)

Re: BUG #14808: V10-beta4, backend abort

I wrote:

Now, the ExecEndModifyTable instance for the DELETE supposes that
all TCS-using triggers must have been fired during ExecutorFinish;
but *that doesn't happen if EXEC_FLAG_SKIP_TRIGGERS is set*, which
it is because ri_PerformCheck tells SPI not to fire triggers.

In view of the fact that 10rc1 wraps tomorrow, there doesn't seem
to be time to do anything better about this than my quick hack.
So I went ahead and pushed that, with a regression test case.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Tom Lane (#5)

Re: BUG #14808: V10-beta4, backend abort

On Mon, Sep 11, 2017 at 7:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Now, the ExecEndModifyTable instance for the DELETE supposes that
all TCS-using triggers must have been fired during ExecutorFinish;
but *that doesn't happen if EXEC_FLAG_SKIP_TRIGGERS is set*, which
it is because ri_PerformCheck tells SPI not to fire triggers.

In view of the fact that 10rc1 wraps tomorrow, there doesn't seem
to be time to do anything better about this than my quick hack.
So I went ahead and pushed that, with a regression test case.

Thank you for the testing and report Philippe, and for the analysis and fix Tom.

@@ -2318,8 +2318,14 @@ ExecEndModifyTable(ModifyTableState *node)
{
int i;

-       /* Free transition tables */
-       if (node->mt_transition_capture != NULL)
+       /*
+        * Free transition tables, unless this query is being run in
+        * EXEC_FLAG_SKIP_TRIGGERS mode, which means that it may have
queued AFTER
+        * triggers that won't be run till later.  In that case we'll just leak
+        * the transition tables till end of (sub)transaction.
+        */
+       if (node->mt_transition_capture != NULL &&
+               !(node->ps.state->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
                DestroyTransitionCaptureState(node->mt_transition_capture);

As an idea for a v11 patch, I wonder if it would be better to count
references instead of leaking the TCS as soon as fk-on-cascade
triggers enter the picture. The ModifyTable node would release its
reference in ExecEndModifyTable(), and the queued events' references
would be counted too. I briefly considered a scheme like that before
proposing 501ed02c, but concluded, as it turns out incorrectly, that
there was no way for a trigger referencing node->mt_transition_capture
to fire after that point.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Thomas Munro (#6)

Re: BUG #14808: V10-beta4, backend abort

Thomas Munro <thomas.munro@enterprisedb.com> writes:

As an idea for a v11 patch, I wonder if it would be better to count
references instead of leaking the TCS as soon as fk-on-cascade
triggers enter the picture.

I thought of reference counting as well, but I think it's not really
necessary if we organize things correctly. The key problem here IMO
is that the transition tables need to be attached to a trigger execution
context, rather than supposing that they can be completely managed by
ModifyTable plan nodes.

I dug around in the SQL spec and convinced myself that indeed it does
require all RI updates triggered by a single statement to be presented
in a single transition table. Concretely, it says

A trigger execution context consists of a set of state changes.
Within a trigger execution context, each state change is uniquely
identified by a trigger event, a subject table, and a column list.
The trigger event can be DELETE, INSERT, or UPDATE. A state change
SC contains a set of transitions, a set of statement-level triggers
considered as executed for SC, and a set of row-level triggers,
each paired with the set of rows in SC for which it is considered
as executed.

Note the "uniquely identified" bit --- you aren't allowed to have multiple
SCs for the same table and same kind of event, at least up to the bit
about column lists. I've not fully wrapped my head around the column list
part of it, but certainly all effects of a particular RI constraint will
have the same column list.

Now, the lifespan of a trigger execution context is one SQL-statement,
but that's one user-executed SQL-statement --- the queries that we gin
up for RI enforcement are an implementation detail that don't get their
own context. (The part of the spec that defines RI actions seems pretty
clear that the actions insert state changes into the active statement's
trigger execution context, not create their own context.)

It's also interesting in this context to re-read commit 9cb840976,
which is what created the "skip trigger" business. That exhibits a
practical problem that you hit if you don't do it like this.

So ISTM that basically we need trigger.c to own the transition tables.
ModifyTable, instead of just creating a TCS, needs to ask trigger.c for a
TCS relevant to its target table and command type (and column list if we
decide we need to bother with that). trigger.c would discard TCSes during
AfterTriggerEndQuery, where it closes out a given query_depth level.

This actually seems like it might not be that big a change, other than
the issue of what do the column lists mean exactly. Maybe we should try
to get it done for v10, rather than shipping known-non-spec-compliant
behavior.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera

alvherre@2ndquadrant.com

almost 9 years ago

In reply to: Tom Lane (#7)

Re: BUG #14808: V10-beta4, backend abort

Tom Lane wrote:

This actually seems like it might not be that big a change, other than
the issue of what do the column lists mean exactly. Maybe we should try
to get it done for v10, rather than shipping known-non-spec-compliant
behavior.

I think this means we need to abort RC1 today and get another beta out.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Tom Lane (#7)

Re: BUG #14808: V10-beta4, backend abort

I wrote:

Note the "uniquely identified" bit --- you aren't allowed to have multiple
SCs for the same table and same kind of event, at least up to the bit
about column lists. I've not fully wrapped my head around the column list
part of it, but certainly all effects of a particular RI constraint will
have the same column list.

After further study of the standard, it seems that the column list
business is meant to allow identification of which UPDATE triggers
with column lists ("AFTER UPDATE OF columnList" syntax) are supposed
to be triggered. The spec is confusing because they describe this in
a way that would be impossibly inefficient if implemented verbatim.
They say that, given a row UPDATE affecting a certain set of columns,
you're supposed to create state changes labeled with every possibly
relevant trigger column set:

— Let OC be the set of column names identifying the columns [being updated]
— Let PSC be the set consisting of the empty set and every subset of the
set of column names of [the target table] that has at least one column
that is in OC
- [ create a state change for each element of PSC ]

Then an update trigger is triggered by a particular state change if its
column list exactly matches that state change's list. This seems like a
remarkably stupid way to go about it; you'd end up with many state changes
that correspond to no existing trigger and are never of any use.

However, if I'm reading it right, there is a property of this that is
very significant in the context of RI updates. For an ordinary UPDATE
SQL command, all the row updates have the same set of target columns,
but that's not necessarily true for all the RI updates that a single SQL
command could trigger. If there's more than one RI constraint between
two tables then an update on the referenced table could trigger sets of
updates that affect overlapping, but not identical, sets of rows in the
referencing tables --- and those sets would have different sets of target
columns. So a given column-specific trigger might be interested in some
or all of the RI updates. And if it is interested, and is a statement
trigger, it is supposed to be fired just once with a transition table
containing all the rows it is interested in.

In other words, UPDATE triggers with different column lists potentially
need to see different transition tables, and any given row that was
updated might need to appear in some subset of those tables.

This seems like rather a mess to implement. I wonder whether I'm
reading it right, and whether other DBMSes actually implement it
like that.

I think that what might be a good short-term solution is to refuse
creation of column-specific triggers with transition tables (ie,
if you ask for a transition table then you can only say AFTER UPDATE
not AFTER UPDATE OF columnList). Then, all TT-using triggers are
interested in all modified rows and we don't have to distinguish
different column lists for the purposes of transition tables.
Then the problem reduces to one TCS per target table and event type,
which doesn't seem too hard to do.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Tom Lane (#3)

Re: BUG #14808: V10-beta4, backend abort

On Sat, Sep 9, 2017 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My first instinct is to get rid of DestroyTransitionCaptureState
altogether, on the grounds that the TransitionCaptureState will
go away at transaction cleanup and we can't really get rid of it
any sooner than that.

End of transaction, or end of query? I'm not sure what happens when
triggers are deferred, but I think there are a lot of cases when we
want to throw away the tuplestore immediately, not hold on to it for
the rest of the transaction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Haas (#10)

Re: BUG #14808: V10-beta4, backend abort

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Sep 9, 2017 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

My first instinct is to get rid of DestroyTransitionCaptureState
altogether, on the grounds that the TransitionCaptureState will
go away at transaction cleanup and we can't really get rid of it
any sooner than that.

End of transaction, or end of query? I'm not sure what happens when
triggers are deferred, but I think there are a lot of cases when we
want to throw away the tuplestore immediately, not hold on to it for
the rest of the transaction.

As things stand, it's end of subtransaction, because the TCSes
are allocated in CurTransactionContext. See the argument in
MakeTransitionCaptureState.

And yes, this is inefficient. The quick-hack patch I committed yesterday
only pays the price if you have RI triggers cascading changes into a table
that also has triggers-with-transition-tables, but I can certainly believe
that it could get expensive in such a case.

The fix proposal discussed downthread should fix the inefficiency as well
as the spec compliance problem. But personally I'm far more worried about
the latter.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#12

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Tom Lane (#9)

Re: BUG #14808: V10-beta4, backend abort

On Tue, Sep 12, 2017 at 5:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Note the "uniquely identified" bit --- you aren't allowed to have multiple
SCs for the same table and same kind of event, at least up to the bit
about column lists. I've not fully wrapped my head around the column list
part of it, but certainly all effects of a particular RI constraint will
have the same column list.

Aside from the RI case, the other user visible change in behaviour
will be for statements that update the same table via multiple
ModifyTable nodes (wCTEs). Our regression test has:

with wcte as (insert into table1 values (42))
insert into table2 values ('hello world');

... which demonstrates the fix for the original complaint that table1
and table2 earlier tried to use the same transition table (and
crashed). A new variant inserting into table1 twice would show the
difference. Today we get:

postgres=# with wcte as (insert into table1 values (42))
insert into table1 values (43);
NOTICE: trigger = table1_trig, new table = (43,)
NOTICE: trigger = table1_trig, new table = (42,)
INSERT 0 1

Presumably with your change there will be a single transition table
for inserts into table, holding both (42,) and (43,). But will we
fire the trigger once or twice? There is something fishy about making
it fire twice but show the same tuples to both invocations (for
example, it might break Kevin's proposed counting algorithm which this
feature is intended to support), but firing only once requires some
new inter-node co-ordination.

In other words, UPDATE triggers with different column lists potentially
need to see different transition tables, and any given row that was
updated might need to appear in some subset of those tables.

This seems like rather a mess to implement. I wonder whether I'm
reading it right, and whether other DBMSes actually implement it
like that.

I guess the alternative is storing extra per-tuple metadata,
transferring more work to the reader.

The DB2 documentation has this example[1]https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtrigger.html:

CREATE TRIGGER REORDER
AFTER UPDATE OF ON_HAND, MAX_STOCKED ON PARTS
REFERENCING NEW_TABLE AS NTABLE
FOR EACH STATEMENT MODE DB2SQL
BEGIN ATOMIC
SELECT ISSUE_SHIP_REQUEST(MAX_STOCKED - ON_HAND, PARTNO)
FROM NTABLE
WHERE (ON_HAND < 0.10 * MAX_STOCKED);
END

I can't find any explicit discussion of whether this trigger could
ever see a transition row that results from an update that didn't name
ON_HAND or MAX_STOCKED. I don't have DB2 access and I'm not sure how
I'd test that... maybe with a self-referencing fk declared ON UPDATE
CASCADE?

Thanks to prodding from Peter Geoghegan we tackled the question of
whether the <trigger action time> clause controls just when the
trigger fires or also which transition tuples it sees. By looking at
some wording relating to MERGE we concluded it must be both,
culminating in commit 8c55244a which separates the UPDATEs and INSERTs
resulting from INSERT ... ON CONFLICT DO UPDATE. That had the
annoying result that we had to ban the use of (non-standard) "OR" in
<trigger action time> when transition tables are in play. This FOR
UPDATE OF ... transition business seems sort of similar, but would
create arbitrarily many transition tables and require the executor to
write into all of them, or perhaps store extra meta data along with
captured rows for later filtering during scan.

I think that what might be a good short-term solution is to refuse
creation of column-specific triggers with transition tables (ie,
if you ask for a transition table then you can only say AFTER UPDATE
not AFTER UPDATE OF columnList). Then, all TT-using triggers are
interested in all modified rows and we don't have to distinguish
different column lists for the purposes of transition tables.
Then the problem reduces to one TCS per target table and event type,
which doesn't seem too hard to do.

Presumably would-be authors of triggers-with-transition-tables that
would fire only AFTER UPDATE OF foo already have to deal with the
possibility that you updated foo to the same value. So I don't think
too much is lost, except perhaps some efficiency.

Thinking a bit harder about whether you might have semantic (rather
than performance) reasons to use AFTER UPDATE OF ... with subset TTs,
it occurs to me that there may be implications for transition tables
with inheritance. We decided to disallow transition tables on
non-partition inheritance children anyway (see 501ed02c), but DB2
supports the equivalent. It has a system of inheritance ("typed
tables", possibly conforming to SQL:1999 though I've never looked into
that) but has different rules about when row triggers and statement
triggers fire when you run DML statements on a table hierarchy. Don't
quote me but it's something like our rules plus (1) row triggers of
all supertables of an affected table also fire (unless created with
CREATE TRIGGER ... ONLY), and (2) statement triggers of affected
subtables also fire. Implementing that for our transition tables
would probably require more tuplestores and/or dynamic tuple
conversion and filtering during later scan. Perhaps AFTER UPDATE OF
column_that_only_this_child_and_its_children_have would fire for
direct and subtable updates but not via-the-supertable updates. That
is currently completely irrelevant due to our set of supported
features: different firing rules, and prohibition on children with
transition tables. Some related topics might return soon when people
get more experience with partitions and start wanting to declare row
triggers on partitioned tables (perhaps including foreign key checks)
or implement Kevin's clever batch-mode foreign key check concept.

[1]: https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtrigger.html

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Thomas Munro (#12)

Re: BUG #14808: V10-beta4, backend abort

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Aside from the RI case, the other user visible change in behaviour
will be for statements that update the same table via multiple
ModifyTable nodes (wCTEs). Our regression test has:

with wcte as (insert into table1 values (42))
insert into table2 values ('hello world');

... which demonstrates the fix for the original complaint that table1
and table2 earlier tried to use the same transition table (and
crashed). A new variant inserting into table1 twice would show the
difference. Today we get:

postgres=# with wcte as (insert into table1 values (42))
insert into table1 values (43);
NOTICE: trigger = table1_trig, new table = (43,)
NOTICE: trigger = table1_trig, new table = (42,)
INSERT 0 1

Presumably with your change there will be a single transition table
for inserts into table, holding both (42,) and (43,). But will we
fire the trigger once or twice?

Not necessarily. That would be true only if we allow the WCTE to share
trigger context with the outer query, which I think it does not today.
I've not checked the code, but presumably if we fire the trigger twice
right now, that means there are separate trigger contexts, ie somebody
is calling AfterTriggerBeginQuery/AfterTriggerEndQuery around the WCTE.
If not, maybe we could make it do so. OTOH, looking at the text of
the spec, I think it's darn hard to justify the behavior shown above.

The reason that the RI case would share trigger context with the outer
query is that we'd *not* call AfterTriggerBeginQuery/AfterTriggerEndQuery
around the RI query, which would be driven by the same skip_trigger
logic that exists today.

This seems like rather a mess to implement. I wonder whether I'm
reading it right, and whether other DBMSes actually implement it
like that.

I guess the alternative is storing extra per-tuple metadata,
transferring more work to the reader.

I really don't want any more per-tuple state. Adding the TCS link
was costly enough in terms of how big the tuple queue storage is.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#14

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Tom Lane (#13)

Re: BUG #14808: V10-beta4, backend abort

On Tue, Sep 12, 2017 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OTOH, looking at the text of
the spec, I think it's darn hard to justify the behavior shown above.

Yeah. I assume we always fired statement triggers for each separate
instance of the same table mentioned in a wCTE since they were
invented. I just confirmed that that is the case in 9.6. That may
not have been in the spirit of the spec, but it's hard to say because
the spec doesn't have wCTEs IIUC and it mattered less because they
didn't receive any data.

Now that they can optionally see data resulting from modifications, it
seems pretty hard to use this feature to build anything that consumes
the transition data and has to be reliable (matview state,
replication-like systems etc) if we make any choice other than (1)
each instance of a given table fires a statement trigger separately
and sees only the rows it touched, or (2) the statement trigger is
fired once for all instances of a table, and sees all the transition
tuples. Based on the SQL spec excerpts you've highlighted, I suppose
it seems likely that if the spec had wCTEs it would expect them to
work like (2).

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Thomas Munro (#12)

Re: BUG #14808: V10-beta4, backend abort

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Aside from the RI case, the other user visible change in behaviour
will be for statements that update the same table via multiple
ModifyTable nodes (wCTEs). Our regression test has:

with wcte as (insert into table1 values (42))
insert into table2 values ('hello world');

... which demonstrates the fix for the original complaint that table1
and table2 earlier tried to use the same transition table (and
crashed).

BTW, as I'm digging around in trigger.c, I can't help noticing that
it provides a single "fdw_tuplestore" per trigger query level (a/k/a
trigger execution context). I've not tried to test this, but it
sure looks like a wCTE like your example above, directed at two
separate foreign tables with triggers, would fail for exactly the
same reason. That'd be a bug of pretty long standing.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#16

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Tom Lane (#15)

Re: BUG #14808: V10-beta4, backend abort

On Thu, Sep 14, 2017 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Aside from the RI case, the other user visible change in behaviour
will be for statements that update the same table via multiple
ModifyTable nodes (wCTEs). Our regression test has:

with wcte as (insert into table1 values (42))
insert into table2 values ('hello world');

... which demonstrates the fix for the original complaint that table1
and table2 earlier tried to use the same transition table (and
crashed).

BTW, as I'm digging around in trigger.c, I can't help noticing that
it provides a single "fdw_tuplestore" per trigger query level (a/k/a
trigger execution context). I've not tried to test this, but it
sure looks like a wCTE like your example above, directed at two
separate foreign tables with triggers, would fail for exactly the
same reason. That'd be a bug of pretty long standing.

I had the impression that that fdw_tuplestore was doing something a
bit sneaky that actually works out OK: tuples get enqueued and later
dequeued in exactly the same sequence as the after row trigger events
that need them, so even though it seems to violate at least the POLA
if not the spirit of tuplestores by storing tuples of potentially
different types in one tuplestore, nothing bad should happen. I
suppose it was by copying that coding that Kevin finished up with the
initial bug that wCTEs mix stuff from different wCTEs and it all blows
up, because it has no similar sequencing trick: triggers with
transition tables were seeing all of them, and they weren't even
guaranteed to be of the right type.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#17

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Thomas Munro (#16)

Re: BUG #14808: V10-beta4, backend abort

On Thu, Sep 14, 2017 at 11:00 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Sep 14, 2017 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, as I'm digging around in trigger.c, I can't help noticing that
it provides a single "fdw_tuplestore" per trigger query level (a/k/a
trigger execution context). I've not tried to test this, but it
sure looks like a wCTE like your example above, directed at two
separate foreign tables with triggers, would fail for exactly the
same reason. That'd be a bug of pretty long standing.

I had the impression that that fdw_tuplestore was doing something a
bit sneaky that actually works out OK: tuples get enqueued and later
dequeued in exactly the same sequence as the after row trigger events
that need them, so even though it seems to violate at least the POLA
if not the spirit of tuplestores by storing tuples of potentially
different types in one tuplestore, nothing bad should happen. I
suppose it was by copying that coding that Kevin finished up with the
initial bug that wCTEs mix stuff from different wCTEs and it all blows
up, because it has no similar sequencing trick: triggers with
transition tables were seeing all of them, and they weren't even
guaranteed to be of the right type.

Incidentally, understanding that made me wonder why we don't have a
binary chunk-oriented in-memory-up-to-some-size-then-spill-to-disk
spooling mechanism that could be used for the trigger queue itself
(which currently doesn't know how to spill to disk and therefore can
take your server out), including holding these tuple images directly
(instead of spilling just the tuples in synchronised order with the
in-memory trigger queue).

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#18

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Thomas Munro (#16)

Re: BUG #14808: V10-beta4, backend abort

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Thu, Sep 14, 2017 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, as I'm digging around in trigger.c, I can't help noticing that
it provides a single "fdw_tuplestore" per trigger query level (a/k/a
trigger execution context). I've not tried to test this, but it
sure looks like a wCTE like your example above, directed at two
separate foreign tables with triggers, would fail for exactly the
same reason. That'd be a bug of pretty long standing.

I had the impression that that fdw_tuplestore was doing something a
bit sneaky that actually works out OK: tuples get enqueued and later
dequeued in exactly the same sequence as the after row trigger events
that need them, so even though it seems to violate at least the POLA
if not the spirit of tuplestores by storing tuples of potentially
different types in one tuplestore, nothing bad should happen.

Oh? Now my fear level is up to 11, because it is completely trivial to
cause triggers to fire in a different order than they were enqueued.
All you need is a mix of deferrable and nondeferrable triggers.

In fact, it also seems entirely broken that a per-query-level tuplestore
is being used at all, because deferrable triggers might not get fired
until some outer query level.

[ Pokes around... ] Hm, looks like we get around that by forbidding
constraint triggers on foreign tables, but I don't see anything in the
CREATE TRIGGER man page saying that there's such a prohibition. And
there's certainly no comments in the source code explaining this rickety
set of requirements :-(

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Thomas Munro (#17)

Re: BUG #14808: V10-beta4, backend abort

Thomas Munro <thomas.munro@enterprisedb.com> writes:

Incidentally, understanding that made me wonder why we don't have a
binary chunk-oriented in-memory-up-to-some-size-then-spill-to-disk
spooling mechanism that could be used for the trigger queue itself
(which currently doesn't know how to spill to disk and therefore can
take your server out), including holding these tuple images directly
(instead of spilling just the tuples in synchronised order with the
in-memory trigger queue).

The past discussions about spilling the trigger queue have generally
concluded that by the time your event list was long enough to cause
serious pain, you already had a query that was never gonna complete.
That may be getting less true as time goes on, but I'm not sure ---
seems like RAM capacity is growing faster than CPU speed. Anyway,
that's why it never got done.

Given the addition of transition tables, I suspect there will be
even less motivation to fix it: the right thing to do with mass
updates will be to use a TT with an after-statement trigger, and
that fixes it by putting the bulk data into a spillable tuplestore.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#20

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Thomas Munro (#14)

Re: BUG #14808: V10-beta4, backend abort

On Tue, Sep 12, 2017 at 1:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Sep 12, 2017 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OTOH, looking at the text of
the spec, I think it's darn hard to justify the behavior shown above.

Yeah. I assume we always fired statement triggers for each separate
instance of the same table mentioned in a wCTE since they were
invented. I just confirmed that that is the case in 9.6. That may
not have been in the spirit of the spec, but it's hard to say because
the spec doesn't have wCTEs IIUC and it mattered less because they
didn't receive any data.

Now that they can optionally see data resulting from modifications, it
seems pretty hard to use this feature to build anything that consumes
the transition data and has to be reliable (matview state,
replication-like systems etc) if we make any choice other than (1)
each instance of a given table fires a statement trigger separately
and sees only the rows it touched, or (2) the statement trigger is
fired once for all instances of a table, and sees all the transition
tuples. Based on the SQL spec excerpts you've highlighted, I suppose
it seems likely that if the spec had wCTEs it would expect them to
work like (2).

So I guess there are about 3 parts to this puzzle:

1. Merging the transition tables when there are multiple wCTEs
referencing the same table. Here's one idea: Rename
MakeTransitionCaptureState() to GetTransitionCaptureState() and use a
hash table keyed by table OID in
afterTriggers.transition_capture_states[afterTriggers.query_depth] to
find the TCS for the given TriggerDesc or create it if not found, so
that all wCTEs find the same TransitionCaptureState object. The all
existing callers continue to do what they're doing now, but they'll be
sharing TCSs appropriately with other things in the plan. Note that
TransitionCaptureState already holds tuplestores for each operation
(INSERT, UPDATE, DELETE) so the OID of the table alone is a suitable
key for the hash table (assuming we are ignoring the column-list part
of the spec as you suggested).

2. Hiding the fact that we implement fk CASCADE using another level
of queries. Perhaps we could arrange for
afterTriggers.transition_capture_states[afterTriggers.query_depth] to
point to the same hash table as query_depth - 1, so that the effects
of statements at this implementation-internal level appear to the user
as part of the the level below?

3. Merging the invocation after statement firing so that if you
updated the same table directly and also via a wCTE and also
indirectly via fk ON DELETE/UPDATE trigger then you still only get one
invocation of the after statement trigger. Not sure exactly how...
perhaps using a flag in the TransitionCaptureState to prevent multiple
firings. As I argued in the above-quoted email, if we've merged the
transition tables then we'll need to merge the trigger firing too or
it won't be possible to make coherent integrity, summary, replication
etc systems using TT triggers (even though that's a user-visible
change in after statement firing behaviour for wCTEs compared to
earlier releases).

Does this make any sense?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#21

Thomas Munro

thomas.munro@gmail.com

almost 9 years ago

In reply to: Thomas Munro (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago