Coping with huge deferred-trigger lists

Started by Tom Laneover 24 years ago11 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I had a thought just now about how to deal with the TODO item about
coping with deferred trigger lists that are so long as to overrun
main memory. This might be a bit harebrained, but I offer it for
consideration:

What we need to do, at the end of a transaction in which deferred
triggers were fired, is to find each tuple that was inserted or
updated in the current transaction in each table that has such
triggers. Well, we know where those tuples are: to a first
approximation, they're all near the end of the table. Perhaps instead
of storing each and every trigger-related tuple in memory, we only need
to store one value per affected table: the lowest CTID of any tuple
that we need to revisit for deferred-trigger purposes. At the end of
the transaction, scan forward from that point to the end of the table,
looking for tuples that were inserted by the current xact. Process each
one using the table's list of deferred triggers.

Instead of a list of all tuples subject to deferred triggers, we now
need only a list of all tables subject to deferred triggers, which
should pose no problems for memory consumption. It might be objected
that this means more disk activity --- but in an xact that hasn't
inserted very many tuples, most likely the disk blocks containing 'em
are still in memory and won't need a physical re-read. Once we get to
inserting so many tuples that that's not true, this approach should
require less disk activity overall than the previous idea of writing
(and re-reading) a separate disk file for the tuple list.

I am not sure exactly what the "triggered data change violation" test
does or is good for, but if we want to keep it, I *think* that in these
terms we'd just need to signal error if we come across a tuple that was
both inserted and deleted by the current xact. I'm a bit fuzzy on this
though.

An interesting property of this approach is that if the set of triggers
for the table changes during the xact (which could only happen if this
same xact created or deleted triggers; no other xact can, since changing
triggers requires an exclusive lock on the table), the set of triggers
applied to a tuple is the set that exists at the end of the xact, not
the set that existed when the tuple was modified. Offhand I think this
is a good change.

Comments?

regards, tom lane

#2Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#1)
AW: Coping with huge deferred-trigger lists

Perhaps instead
of storing each and every trigger-related tuple in memory, we only need
to store one value per affected table: the lowest CTID of any tuple
that we need to revisit for deferred-trigger purposes. At the end of
the transaction, scan forward from that point to the end of the table,
looking for tuples that were inserted by the current xact.

I thought that this current placing of new rows at end of file is subject to
change soon (overwrite smgr) ?

I thus think it would be better to remember all ctids per table.
The rest imho sounds great.

Andreas

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#2)
Re: AW: Coping with huge deferred-trigger lists

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Perhaps instead
of storing each and every trigger-related tuple in memory, we only need
to store one value per affected table: the lowest CTID of any tuple
that we need to revisit for deferred-trigger purposes. At the end of
the transaction, scan forward from that point to the end of the table,
looking for tuples that were inserted by the current xact.

I thought that this current placing of new rows at end of file is subject to
change soon (overwrite smgr) ?

Well, the scheme would still *work* if rows were not always placed at
the end of file, though it might get inefficient. But you're right, the
merits of this trigger idea depend a lot on whether we decide to go to
an overwriting smgr, and so we should probably wait till that's decided
before we think about doing this. I just wanted to get the idea
recorded before I forgot about it.

BTW, I don't think the overwriting-smgr idea is a done deal. We haven't
seen any design yet for exactly how it should work. Moreover, I'm
really hesitant to throw away one of the fundamental design choices of
Postgres: overwriting smgr is one of the things that got us to where we
are today. Before we commit to that, we ought to do some serious study
of the alternatives. ISTM the problem with VACUUM is not that you need
to do a regular maintenance procedure, it's that it grabs an exclusive
lock on the table for so long. We could live with VACUUM if it could be
made either incremental (do a few pages and release the lock) or capable
of running in parallel with reader & writer transactions. Vadim's
still-not-integrated LAZY VACUUM code is an indicator that progress
might be made in that direction. (Actually, I suppose if you look at it
in the right way, you might think that a backgroundable VACUUM *is* an
overwriting smgr, just an asynchronous implementation of it...)

I thus think it would be better to remember all ctids per table.

If we do that then we still have a problem with overrunning memory
after a sufficiently large number of tuples. However, that'd improve
the constant factor by at least an order of magnitude, so it might be
worth doing as an intermediate step. Still have to figure out whether
the triggered-data-change business is significant or not.

regards, tom lane

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#3)
Re: AW: Coping with huge deferred-trigger lists

BTW, I don't think the overwriting-smgr idea is a done deal. We haven't
seen any design yet for exactly how it should work. Moreover, I'm
really hesitant to throw away one of the fundamental design choices of
Postgres: overwriting smgr is one of the things that got us to where we
are today. Before we commit to that, we ought to do some serious study
of the alternatives. ISTM the problem with VACUUM is not that you need
to do a regular maintenance procedure, it's that it grabs an exclusive
lock on the table for so long. We could live with VACUUM if it could be
made either incremental (do a few pages and release the lock) or capable
of running in parallel with reader & writer transactions. Vadim's
still-not-integrated LAZY VACUUM code is an indicator that progress
might be made in that direction. (Actually, I suppose if you look at it
in the right way, you might think that a backgroundable VACUUM *is* an
overwriting smgr, just an asynchronous implementation of it...)

I agree overwriting storage manager is not a done deal, and I don't see
us eliminating it entirely. We have to keep the old tuples in scope, so
I assume we would just create new tuples, and reuse the expired tuples
once they were out of scope.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5Hannu Krosing
hannu@tm.ee
In reply to: Zeugswetter Andreas SB (#2)
Re: AW: Coping with huge deferred-trigger lists

Tom Lane wrote:

BTW, I don't think the overwriting-smgr idea is a done deal. We haven't
seen any design yet for exactly how it should work. Moreover, I'm
really hesitant to throw away one of the fundamental design choices of
Postgres: overwriting smgr is one of the things that got us to where we
are today. Before we commit to that, we ought to do some serious study
of the alternatives. ISTM the problem with VACUUM is not that you need
to do a regular maintenance procedure, it's that it grabs an exclusive
lock on the table for so long. We could live with VACUUM if it could be
made either incremental (do a few pages and release the lock) or capable
of running in parallel with reader & writer transactions. Vadim's
still-not-integrated LAZY VACUUM code is an indicator that progress
might be made in that direction. (Actually, I suppose if you look at it
in the right way, you might think that a backgroundable VACUUM *is* an
overwriting smgr, just an asynchronous implementation of it...)

And it allows the writes that need to be done quickly to be kept
together
and the slow part to be asynchronous. I suspect that we will never be
able
to get very good statistics without separate ANALYZE so we will have
asynchronous processes anyhow.

Also, we might want to get time travel back sometime, which I guess is
still
done most effectively with current scheme + having VACUUM keeping some
history
on a per-table basis.

Other than that time travel only ;) needs recording wall-clock-time of
commits
that have modified data + some extended query features.

the (wall-clock-time,xid) table is naturally ordered by said
wall-clock-time so
it won't even need index, just a binary search access method.

------------------
Hannu

#6Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Tom Lane (#3)
Re: AW: Coping with huge deferred-trigger lists

If we do that then we still have a problem with overrunning memory
after a sufficiently large number of tuples. However, that'd improve
the constant factor by at least an order of magnitude, so it might be
worth doing as an intermediate step. Still have to figure out whether
the triggered-data-change business is significant or not.

I think that was part of the misunderstanding of the spec. I think the
spec means it to be within one statement (and its associated immediate
actions) rather than rest of transaction. I think it's mostly to
prevent loop cases A row 1 modifies B row 1 modifies A row 1 modifies ...
However, I only looked at it briefly a while back.

#7Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#2)
Re: AW: Coping with huge deferred-trigger lists

Tom Lane wrote:

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Perhaps instead
of storing each and every trigger-related tuple in memory, we only need
to store one value per affected table: the lowest CTID of any tuple
that we need to revisit for deferred-trigger purposes. At the end of
the transaction, scan forward from that point to the end of the table,
looking for tuples that were inserted by the current xact.

I thought that this current placing of new rows at end of file is subject to
change soon (overwrite smgr) ?

Well, the scheme would still *work* if rows were not always placed at
the end of file, though it might get inefficient.

Even under current smgr, new rows aren't necessarily at the end.

[snip]

BTW, I don't think the overwriting-smgr idea is a done deal. We haven't
seen any design yet for exactly how it should work. Moreover, I'm
really hesitant to throw away one of the fundamental design choices of
Postgres: overwriting smgr is one of the things that got us to where we
are today.

I don't think we could/should introduce an overwriting smgr
in 7.2 unless we give up current level of stablitity/
reliability. We don't have an UNDO functionality yet even
under current simple no overwrite smgr.

Before we commit to that, we ought to do some serious study
of the alternatives. ISTM the problem with VACUUM is not that you need
to do a regular maintenance procedure, it's that it grabs an exclusive
lock on the table for so long. We could live with VACUUM if it could be
made either incremental (do a few pages and release the lock) or capable
of running in parallel with reader & writer transactions. Vadim's
still-not-integrated LAZY VACUUM code is an indicator that progress

might be made in that direction. (Actually, I suppose if you look at it
in the right way, you might think that a backgroundable VACUUM *is* an
overwriting smgr, just an asynchronous implementation of it...)

The backgroundable VACUUM, reuse dead space etc .. could never
be an overwriting smgr. When a tuple is updated corresponding
index tuples must always be inserted.

regrads,
Hiroshi Inoue

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#7)
Re: AW: Coping with huge deferred-trigger lists

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

I thought that this current placing of new rows at end of file is subject to
change soon (overwrite smgr) ?

Even under current smgr, new rows aren't necessarily at the end.

Hmm ... you're right, heap_update will try to store an updated tuple on
the same page as its original.

That doesn't make my suggestion unworkable, however, since this case is
not very likely to occur except on pages at/near the end of file. One
way to deal with it is to keep a list of pages (still not individual
tuples) that contain tuples we need to revisit for deferred triggers.
The list would be of the form "scan these individual pages plus all
pages from point X to the end of file", where point X would be at or
perhaps a little before the end of file as it stood at the start of the
transaction. We'd only need to explicitly store the page numbers for
relatively few pages, usually.

BTW, thanks for pointing that out --- it validates my idea in another
thread that we can avoid locking on every single call to
RelationGetBufferForTuple, if it's OK to store newly inserted tuples
on pages that aren't necessarily last in the file.

regards, tom lane

#9Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#1)
Re: Coping with huge deferred-trigger lists

Tom Lane wrote:

I had a thought just now about how to deal with the TODO item about
coping with deferred trigger lists that are so long as to overrun
main memory. This might be a bit harebrained, but I offer it for
consideration:

What we need to do, at the end of a transaction in which deferred
triggers were fired, is to find each tuple that was inserted or
updated in the current transaction in each table that has such
triggers. Well, we know where those tuples are: to a first
approximation, they're all near the end of the table. Perhaps instead
of storing each and every trigger-related tuple in memory, we only need
to store one value per affected table: the lowest CTID of any tuple
that we need to revisit for deferred-trigger purposes. At the end of
the transaction, scan forward from that point to the end of the table,
looking for tuples that were inserted by the current xact. Process each
one using the table's list of deferred triggers.

Instead of a list of all tuples subject to deferred triggers, we now
need only a list of all tables subject to deferred triggers, which
should pose no problems for memory consumption. It might be objected
that this means more disk activity --- but in an xact that hasn't
inserted very many tuples, most likely the disk blocks containing 'em
are still in memory and won't need a physical re-read. Once we get to
inserting so many tuples that that's not true, this approach should
require less disk activity overall than the previous idea of writing
(and re-reading) a separate disk file for the tuple list.

I am not sure exactly what the "triggered data change violation" test
does or is good for, but if we want to keep it, I *think* that in these
terms we'd just need to signal error if we come across a tuple that was
both inserted and deleted by the current xact. I'm a bit fuzzy on this
though.

The check came from my possible wrong understanding of the
SQL3 specs. The idea I had is that the SUMMARY of all
changes during a transaction counts. If you INSERT a row into
a table and have immediate triggers invoked, a later DELETE
cannot undo the triggers. So the question is did this row
ever exist?

An interesting property of this approach is that if the set of triggers
for the table changes during the xact (which could only happen if this
same xact created or deleted triggers; no other xact can, since changing
triggers requires an exclusive lock on the table), the set of triggers
applied to a tuple is the set that exists at the end of the xact, not
the set that existed when the tuple was modified. Offhand I think this
is a good change.

Comments?

Giving you have two separate, named, deferred constraints on
one table. Now after a couple of INSERTs and UPDATEs you SET
one of them to IMMEDIATE and back to DEFERRED. This has to
run the triggers for one and only one of the constraints now.
If you don't worry about the need of running the checks later
again, it's OK.

The detail I'm wondering about most is how you'd know in an
UPDATE case which two tuples (one deleted during this XACT
and one inserted) are the two for OLD and NEW in the call to
the trigger. Note that the referential action ON UPDATE SET
NULL for example doesn't have to take place if the user
didn't change the referenced key fields.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#9)
Re: Coping with huge deferred-trigger lists

Jan Wieck <JanWieck@Yahoo.com> writes:

The detail I'm wondering about most is how you'd know in an
UPDATE case which two tuples (one deleted during this XACT
and one inserted) are the two for OLD and NEW in the call to
the trigger.

Ugh ... good point. There's no back-link from the updated tuple to
its original on disk, is there?

Back to the drawing board ...

regards, tom lane

#11Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#10)
Re: Coping with huge deferred-trigger lists

Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

The detail I'm wondering about most is how you'd know in an
UPDATE case which two tuples (one deleted during this XACT
and one inserted) are the two for OLD and NEW in the call to
the trigger.

Ugh ... good point. There's no back-link from the updated tuple to
its original on disk, is there?

AFAIK nothing other than the Oid. And that's IMHO a weak one.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com