SELECT * FROM <table> LIMIT 1; is really slow

Started by David Blasbyover 21 years ago59 messages
#1David Blasby
dblasby@refractions.net

I have a table with about 16,000,000 rows in it.

When I do a:

SELECT * FROM <table> LIMIT 1;

it takes about 10 minutes (thats about how long it takes to do a full
sequential scan).

I had originally thought that there might be a large number of
"wasted/retired" tuples in the table so I "vacuum analysed" the
database. It had no effect. I had a "vacuum full" going on the table
for 17 hours before I killed it.

Any idea why its so slow? It used to be instant.

dave

#2Gaetano Mendola
mendola@bigfoot.com
In reply to: David Blasby (#1)
Re: SELECT * FROM <table> LIMIT 1; is really slow

David Blasby wrote:

I have a table with about 16,000,000 rows in it.

When I do a:

SELECT * FROM <table> LIMIT 1;

it takes about 10 minutes (thats about how long it takes to do a full
sequential scan).

I had originally thought that there might be a large number of
"wasted/retired" tuples in the table so I "vacuum analysed" the
database. It had no effect. I had a "vacuum full" going on the table
for 17 hours before I killed it.

Are you sure that the vacuum full was running or sitting there to
wait an idle transaction ?

Regards
Gaetano Mendola

#3David Blasby
dblasby@refractions.net
In reply to: Gaetano Mendola (#2)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Gaetano Mendola wrote:

David Blasby wrote:

I have a table with about 16,000,000 rows in it.

When I do a:

SELECT * FROM <table> LIMIT 1;

it takes about 10 minutes (thats about how long it takes to do a full
sequential scan).

I had originally thought that there might be a large number of
"wasted/retired" tuples in the table so I "vacuum analysed" the
database. It had no effect. I had a "vacuum full" going on the table
for 17 hours before I killed it.

Are you sure that the vacuum full was running or sitting there to
wait an idle transaction ?

It was running at about 90% CPU.

I just did another vacuum analyse on the table:

cwb_prod_5_20=# VACUUM ANALYZE verbose csn_edges;
INFO: vacuuming "public.csn_edges"
INFO: index "csn_edges_group_code_idx" now contains 16289929 row
versions in 75789 pages
INFO: index "csn_edges_edge_id_idx" now contains 16289929 row versions
in 55210 pages
INFO: index "csn_edges_code_idx" now contains 16289929 row versions in
61203 pages
INFO: index "csn_edges_outside_idx" now contains 16289929 row versions
in 75719 pages

INFO: index "csn_edges_the_geom_idx" now contains 16289929 row versions
in 238795 pages
INFO: "csn_edges": found 0 removable, 16289929 nonremovable row
versions in 2783986 pages
INFO: vacuuming "pg_toast.pg_toast_126945560"
INFO: index "pg_toast_126945560_index" now contains 441432 row versions
in 3064 pages
INFO: "pg_toast_126945560": found 0 removable, 441432 nonremovable row
versions in 154691 pages
INFO: analyzing "public.csn_edges"

INFO: "csn_edges": 2783986 pages, 3000 rows sampled, 6724 estimated
total rows
VACUUM

Its still slow!

dave

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Blasby (#1)
Re: SELECT * FROM <table> LIMIT 1; is really slow

David Blasby <dblasby@refractions.net> writes:

I have a table with about 16,000,000 rows in it.
When I do a:

SELECT * FROM <table> LIMIT 1;

it takes about 10 minutes (thats about how long it takes to do a full
sequential scan).

The only explanation that comes to mind is huge amounts of dead space
before the first live row. But the output of "vacuum verbose" would
probably be enough to tell whether this is really the correct diagnosis.

I had originally thought that there might be a large number of
"wasted/retired" tuples in the table so I "vacuum analysed" the
database. It had no effect.

It wouldn't; you'd need vacuum full to collapse out the dead space.
You could also try CLUSTER which is faster than VACUUM FULL when most
of the data has to be moved anyway. (Some days I think we should dump
VACUUM FULL, because it's optimized for a case that's no longer very
interesting...)

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Blasby (#3)
Re: SELECT * FROM <table> LIMIT 1; is really slow

David Blasby <dblasby@refractions.net> writes:

I just did another vacuum analyse on the table:

Ah, here we go:

INFO: "csn_edges": found 0 removable, 16289929 nonremovable row
versions in 2783986 pages

That works out to just under 6 rows per 8K page, which wouldn't be too
bad if the rows are 1K wide on average, but are they? (You might want
to run contrib/pgstattuple to get some exact information about average
tuple size.)

INFO: analyzing "public.csn_edges"
INFO: "csn_edges": 2783986 pages, 3000 rows sampled, 6724 estimated
total rows

This looks like a smoking gun to me. The huge underestimate of number
of rows from ANALYZE is a known failure mode of the existing sampling
method when the early pages of the table are thinly populated. (Manfred
just fixed that for 7.5, btw.)

I think you want to VACUUM FULL or CLUSTER the table, and then take a
look at your FSM settings and routine vacuuming frequency to see if
you need to adjust them to keep this from happening again.

regards, tom lane

#6David Blasby
dblasby@refractions.net
In reply to: Tom Lane (#5)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Tom Lane wrote:

INFO: "csn_edges": found 0 removable, 16289929 nonremovable row
versions in 2783986 pages

That works out to just under 6 rows per 8K page, which wouldn't be too
bad if the rows are 1K wide on average, but are they? (You might want
to run contrib/pgstattuple to get some exact information about average
tuple size.)

The rows are "wide" - there's a PostGIS geometry present.

INFO: analyzing "public.csn_edges"
INFO: "csn_edges": 2783986 pages, 3000 rows sampled, 6724 estimated
total rows

This looks like a smoking gun to me. The huge underestimate of number
of rows from ANALYZE is a known failure mode of the existing sampling
method when the early pages of the table are thinly populated. (Manfred
just fixed that for 7.5, btw.)

I think you want to VACUUM FULL or CLUSTER the table, and then take a
look at your FSM settings and routine vacuuming frequency to see if
you need to adjust them to keep this from happening again.

I'm now clustering - thanks for the help!

The history of this table is quite short - I just created it last week.

The original table had a bigint column that I converted to int (using
the "alter table csn_edges rename to csn_edges_backup;CREATE TABLE
csn_edges AS SELECT a,b,c::int,d,e FROM csn_edges; delete table
csn_edges_backup;" trick). I dont think there were any changes to the
current csn_edges table after it was created.

I have another copy of this table in another database - vacuum analyse
verbose says its "only" 1,500,000 pages (vs 2,800,000). Shouldnt vacuum
know your table is wasting 10Gb of space and fix it for you? Or at
least HINT? Or a "TIDY" command?

Should I be upping my FSM to 2,000,000 pages?

dave

#7Noname
pgsql@mohawksoft.com
In reply to: Tom Lane (#5)
Re: SELECT * FROM <table> LIMIT 1; is really slow

David Blasby <dblasby@refractions.net> writes:

I just did another vacuum analyse on the table:

Ah, here we go:

INFO: "csn_edges": found 0 removable, 16289929 nonremovable row
versions in 2783986 pages

That works out to just under 6 rows per 8K page, which wouldn't be too
bad if the rows are 1K wide on average, but are they? (You might want
to run contrib/pgstattuple to get some exact information about average
tuple size.)

INFO: analyzing "public.csn_edges"
INFO: "csn_edges": 2783986 pages, 3000 rows sampled, 6724 estimated
total rows

This looks like a smoking gun to me. The huge underestimate of number
of rows from ANALYZE is a known failure mode of the existing sampling
method when the early pages of the table are thinly populated. (Manfred
just fixed that for 7.5, btw.)

Tom, is there a way choose between a sample and full?

Show quoted text

I think you want to VACUUM FULL or CLUSTER the table, and then take a
look at your FSM settings and routine vacuuming frequency to see if
you need to adjust them to keep this from happening again.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

#8Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#4)
Re: SELECT * FROM <table> LIMIT 1; is really slow

It wouldn't; you'd need vacuum full to collapse out the dead space.
You could also try CLUSTER which is faster than VACUUM FULL when most
of the data has to be moved anyway. (Some days I think we should dump
VACUUM FULL, because it's optimized for a case that's no longer very
interesting...)

Out of interest, is CLUSTER another fast way of truly removing OIDs from
a table, after going SET WITHOUT OIDS?

Chris

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#8)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

It wouldn't; you'd need vacuum full to collapse out the dead space.
You could also try CLUSTER which is faster than VACUUM FULL when most
of the data has to be moved anyway. (Some days I think we should dump
VACUUM FULL, because it's optimized for a case that's no longer very
interesting...)

Out of interest, is CLUSTER another fast way of truly removing OIDs from
a table, after going SET WITHOUT OIDS?

I think not --- IIRC, CLUSTER just copies the tuples verbatim. It'd do
fine for getting rid of dead tuples and unused space, but not for making
any actual changes in the individual tuples.

What would be useful for that (as of 7.5) is a dummy ALTER TABLE TYPE,
viz

ALTER TABLE tab ALTER COLUMN col TYPE same-type-it-already-has

There not being any special case to short-circuit this, the thing will
sit there and reconstruct the tuples and write them into a fresh table
file (and rebuild the indexes, too). Net effect is to collapse out any
dropped columns (or OIDs) as well as losing dead tuples and free space.

I was just noticing today that the ALTER TABLE documentation suggests
a dummy UPDATE and VACUUM FULL to get rid of the space occupied by a
dropped column. On reflection the ALTER TYPE method is likely to be
an order of magnitude faster. Will update the docs.

A further thought along this line: if you have to do an UPDATE that
affects every row of a large table, the traditional way is, eg,

UPDATE tab SET col = col + 1

which leaves you with N live rows, N dead rows, and lots of pain to get
back down to a less-than-twice-normal-size table. (Traditional way is
VACUUM FULL; CLUSTER is better, but still painful.) As of 7.5 you could
hack this with

ALTER TABLE tab ALTER COLUMN col TYPE same-type USING col + 1

which will have the effect of rewriting a fresh table file containing
the updated rows, and dropping the old file at commit. Sweet. (Peak
disk usage is 2x normal either way, but you don't pay through the nose
to get back down to 1x.) I'm not real sure where to document this
trick but it seems like we ought to mention it someplace.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Blasby (#6)
Re: SELECT * FROM <table> LIMIT 1; is really slow

David Blasby <dblasby@refractions.net> writes:

I have another copy of this table in another database - vacuum analyse
verbose says its "only" 1,500,000 pages (vs 2,800,000).

Hmm ... this is consistent with the idea that you did an UPDATE
affecting every row of the table. That would generate N new rows which
would all get added to the end of the table, if there were no dead space
available within the table. After commit you have a lot of pages worth
of dead tuples followed by a lot of pages worth of live tuples. Plain
VACUUM cannot do much to fix this since it doesn't move rows around.
VACUUM FULL will fix it, but its index-update overhead is high enough
that CLUSTER is a better deal.

See followup message discussing ALTER TABLE as another alternative.

regards, tom lane

#11Paul Ramsey
pramsey@refractions.net
In reply to: Tom Lane (#10)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Tom Lane wrote:

of dead tuples followed by a lot of pages worth of live tuples. Plain
VACUUM cannot do much to fix this since it doesn't move rows around.
VACUUM FULL will fix it, but its index-update overhead is high enough
that CLUSTER is a better deal.

Tom: I was interested in performance improvements from cluster, so I
tried to cluster a table on a spatial index:

dra_working=# \d geomtest
Table "public.geomtest"
Column | Type | Modifiers
----------------+----------+-----------
rd_segment_id | integer |
admit_date | date |
retire_date | date |
most_recent | boolean |
lineargeometry | geometry |
Indexes: geomtest_idx gist (lineargeometry)

dra_working=# cluster geomtest_idx on geomtest;
ERROR: CLUSTER: cannot cluster when index access method does not handle
nulls
You may be able to work around this by marking column "lineargeometry"
NOT NULL
dra_working=# select version();
version
---------------------------------------
PostgreSQL 7.3.6 on i686-pc-linux-gnu

As of quite a while ago (7.2?) the GiST access method was made null-safe
by Teodor and Oleg, I think. Is this a safety wrapper left over from
before the upgrade to GiST?

--
__
/
| Paul Ramsey
| Refractions Research
| Email: pramsey@refractions.net
| Phone: (250) 885-0632
\_

#12Dennis Bjorklund
db@zigo.dhs.org
In reply to: Tom Lane (#9)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Wed, 26 May 2004, Tom Lane wrote:

if you have to do an UPDATE that affects every row of a large table

UPDATE tab SET col = col + 1

which leaves you with N live rows, N dead rows, and lots of pain to get
back down to a less-than-twice-normal-size table. (Traditional way is
VACUUM FULL; CLUSTER is better, but still painful.) As of 7.5 you could
hack this with

ALTER TABLE tab ALTER COLUMN col TYPE same-type USING col + 1

which will have the effect of rewriting a fresh table file containing
the updated rows, and dropping the old file at commit.

I'm not real sure where to document this
trick but it seems like we ought to mention it someplace.

Isn't it better to detect a UPDATE without a where and do that update in
the same way as the alter table above? Then we don't need to document and
learn a new non standard way of doing an update.

--
/Dennis Bj�rklund

#13Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Dennis Bjorklund (#12)
Re: SELECT * FROM <table> LIMIT 1; is really slow

What would be useful for that (as of 7.5) is a dummy ALTER TABLE TYPE,
viz

ALTER TABLE tab ALTER COLUMN col TYPE same-type-it-already-has

I think we should add special syntax for this purpose, since I would like to
(or someone else later on) see all possible cases of "alter column" short
circuited.

The syntax I would see fit is something like:
ALTER TABLE tab [MOVE] TABLESPACE xy;

For the above special case the tablespace would be the same as before.

Andreas

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dennis Bjorklund (#12)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Dennis Bjorklund <db@zigo.dhs.org> writes:

On Wed, 26 May 2004, Tom Lane wrote:

I'm not real sure where to document this
trick but it seems like we ought to mention it someplace.

Isn't it better to detect a UPDATE without a where and do that update in
the same way as the alter table above? Then we don't need to document and
learn a new non standard way of doing an update.

No, because the locking implications are completely different. I don't
want UPDATE to suddenly decide it needs an exclusive lock on the table
based on the shape of the WHERE clause.

regards, tom lane

#15David Blasby
dblasby@refractions.net
In reply to: David Blasby (#6)
Re: SELECT * FROM <table> LIMIT 1; is really slow

I did a CLUSTER - it took just under 12 hours. The original load of the
table (including parsing all the geometries) took about 4 hours (+4 to
build the indexes and vacuum analyse). I dont know why the cluster is
so much slower...

My "SELECT * FROM <table> LIMIT 1;" is now fast.

Thanks for the help!

As a side note, I think the behavior of the Database is a bit
counter-intuitive. If I were to do a:

UPDATE <table> SET i = i + 1;

My table will "leak" 10Gb of disk space, make queries extreamly slow,
and require taking the database off-line for 12 hours to fix it!

Both the VACUUM and the UPDATE manual should be updated to say that this
is whats happening. If I do my above query 100 times, will I be leaking
a Terrabyte of disk space?

Will increasing my max_fsm_pages to 2,000,000 solve my problem?

dave

#16Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#4)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Wed, 26 May 2004 18:17:55 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(Some days I think we should dump
VACUUM FULL, because it's optimized for a case that's no longer very
interesting...)

So we still have to stick with VACUUM FULL for some time, right?

The next set of compatibility breakers I'm currently working on requires
a change in VACUUM FULL behaviour. I would only move tuples that are
visible to all running transactions. OTOH I wouldn't stop at the first
unmovable tuple.

With
X active tuple
. free space or dead tuple
y new tuple, not yet visible to a running transaction
z deleted tuple, still visible to a running transaction

the current implementation transforms this relation

XXXX.... XXXX.... XXXX.... XXXX.... XXyX.... XXXX.... XzXX.... XXXX....

into

XXXXXXXX XXXXXzXX XXXXXXXX XXXXXXyX

The new implementation would leave it as

XXXXXXXX XXXXXXXX XXXXXXXX XXXXXX.. ..y..... ........ .z......

If there are concurrent long-running transactions, the new VACUUM FULL
wouldn't truncate the relation as aggressively as it does now. It could
leave the relation with lots of free space near the end. This was
absolutely unacceptable at the time when VACUUM FULL was designed. But
now we could use lazy VACUUM as an excuse for VACUUM FULL not working so
hard. After the transaction still seeing z terminates, VACUUM (without
FULL) can truncate the relation to

XXXXXXXX XXXXXXXX XXXXXXXX XXXXXX.. ..y.....

and when y is updated the new version will be stored in a lower block
and plain VACUUM can truncate the relation again:

XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXY.

AFAICS this would make vacuum.c much simpler (no more chain moves).
Clearly this change alone doesn't have any merit. But would such a
patch have any chance of being accepted, if it facilitates improvements
in other areas?

Servus
Manfred

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#16)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Manfred Koizar <mkoi-pg@aon.at> writes:

XXXXXXXX XXXXXXXX XXXXXXXX XXXXXX.. ..y.....

and when y is updated the new version will be stored in a lower block

Oh? What makes you think that? I see no guarantee of it.

AFAICS this would make vacuum.c much simpler (no more chain moves).

How will it do that? I think you'd have to not move *any* updated
tuples to be sure you don't need the chain-move mechanism. Not moving
the outdated tuple isn't sufficient, you can't move the one it points at
either.

Clearly this change alone doesn't have any merit. But would such a
patch have any chance of being accepted, if it facilitates improvements
in other areas?

I'm disinclined to mess with VACUUM FULL without a clearer explanation
of where you're headed. So far as I can see, you're simply going to
make VACUUM FULL less effective with no stated benefit.

(BTW, it now occurs to me that CLUSTER and ALTER TABLE in their present
forms may be broken, because they only copy rows that are valid
according to SnapshotNow; this means that rows that are still visible to
old transactions could be lost. The lack of any attempt to preserve
update chain relationships seems ungood as well, if we might have old
transactions come along and try to update the table later.)

regards, tom lane

#18Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#17)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Thu, 27 May 2004 14:23:07 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:

and when y is updated the new version will be stored in a lower block

Oh? What makes you think that? I see no guarantee of it.

You're right, I see only a tendency, because the majority of free space
is before the last block (obviously). But don't we try to store the new
version on the same block as the old version? That'd weaken my argument
a bit.

I think you'd have to not move *any* updated
tuples to be sure you don't need the chain-move mechanism.

Yes, per definitionem (move only tuples that are visible to all).

I'm disinclined to mess with VACUUM FULL without a clearer explanation
of where you're headed.

Have no fear. I won't change anything in the near term. As you were
talking about the future of VACUUM FULL, I thought this might be a good
opportunity to ask. The fact that you didn't outright reject the idea
is good enough for now.

I have no clear explanation at the moment, just some fuzzy ideas that
are beginning to crystallise. I'm messing around with heap tuple
headers again, and the Xvac field seems to get in the way, unless I can
cut down the number of different scenarios where it is needed.

Servus
Manfred

#19Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Manfred Koizar (#18)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Thu, May 27, 2004 at 09:52:30PM +0200, Manfred Koizar wrote:

I have no clear explanation at the moment, just some fuzzy ideas that
are beginning to crystallise. I'm messing around with heap tuple
headers again, and the Xvac field seems to get in the way, unless I can
cut down the number of different scenarios where it is needed.

Now you are on the subject, can I ask you to take a peek at what I did
regarding tuple headers?

At first I thought I'd have to add back Xmax as a field on its own, but
later (in chat with Bruce) I arrived to the conclusion that it isn't
really necessary, and I only added a bit to the infomask to flag when
the Cmin is overridden with Xmax.

However I'm not convinced that this is enough --- is there a situation
on which we should need to peek at Cmin after setting Xmax for a
particusar tuple?

The problem was

BEGIN;
insert into foo values (1)
begin
delete from foo
rollback
-- at this point the tuple shold be visible,
-- but it has my Xid as Xmin and Cmin was
-- overriden with Xmax
commit

I'd appreciate your (Manfred's and Tom's) comments on the topic.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"And as an added bonus, now my computer goes to the toilet for me, leaving me
free to spend time on more useful activities! yay slug codefests!" (C. Parker)

#20Manfred Koizar
mkoi-pg@aon.at
In reply to: Alvaro Herrera (#19)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Thu, 27 May 2004 16:50:24 -0400, Alvaro Herrera
<alvherre@dcc.uchile.cl> wrote:

Now you are on the subject, can I ask you to take a peek at what I did
regarding tuple headers?

I did read your patch, but I didn't understand it. :-(

At first I thought I'd have to add back Xmax as a field on its own

Veto! This would increase heap tuple header size ==> less tuples per
page ==> more pages per table ==> more I/O ==> performance loss.

is there a situation
on which we should need to peek at Cmin after setting Xmax for a
particusar tuple?

http://archives.postgresql.org/pgsql-hackers/2002-05/msg00090.php

Servus
Manfred

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#20)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Manfred Koizar <mkoi-pg@aon.at> writes:

On Thu, 27 May 2004 16:50:24 -0400, Alvaro Herrera

At first I thought I'd have to add back Xmax as a field on its own

Veto! This would increase heap tuple header size ==> less tuples per
page ==> more pages per table ==> more I/O ==> performance loss.

If putting back xmax is the price we must pay for nested transactions,
then we *will* pay that price. Maybe not in this release, but it will
inevitably happen. Don't bother hollering "veto" ;-)

I suspect that in fact this may be necessary. The justification for
overlapping cmin and xmax hinges on two things:
1. within the creating transaction, we can handle the
xmax=xmin case specially;
2. another xact could only want to store xmax into a committed
tuple, therefore the original xact is done and we don't need
cmin anymore.
However this breaks down with nested xacts. For instance imagine
this situation:

* Outer transaction creates a tuple.

* Inner transaction deletes this tuple (which it presumably can see).

* Inner transaction rolls back.

The tuple must still be visible to the outer xact. However now we have
a problem: we've wiped out its cmin, which we need for visibility tests
in the outer xact.

We could possibly avoid this particular issue with sufficiently complex
visibility rules. (I am thinking that we might be able to say that the
inner xact can't see the tuple in question unless the creating command
was "done" in the terms of the outer transaction, in which case perhaps
we don't need its cmin anymore. But I fear that that won't work either.
For instance a serializable cursor opened before the tuple was created
should not be able to see it, so it sure seems like we need cmin.)
And I don't feel confident that there are no other, even harder-to-avoid,
cases to worry about.

Something that just now occurred to me: could we identify
subtransactions with commands? That is, cmin *is* the subtransaction
ID, and xmin/xmax are always the parent xact? I'm not sure this works
either, but it might be something to think about.

regards, tom lane

#22Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#21)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Fri, May 28, 2004 at 02:47:01PM -0400, Tom Lane wrote:

We could possibly avoid this particular issue with sufficiently complex
visibility rules. (I am thinking that we might be able to say that the
inner xact can't see the tuple in question unless the creating command
was "done" in the terms of the outer transaction, in which case perhaps
we don't need its cmin anymore. But I fear that that won't work either.
For instance a serializable cursor opened before the tuple was created
should not be able to see it, so it sure seems like we need cmin.)
And I don't feel confident that there are no other, even harder-to-avoid,
cases to worry about.

Hm, the serializable cursor was the example I was looking for to show
why the current idea does not work.

Something that just now occurred to me: could we identify
subtransactions with commands? That is, cmin *is* the subtransaction
ID, and xmin/xmax are always the parent xact? I'm not sure this works
either, but it might be something to think about.

This seems a nice idea. We wouldn't need pg_subtrans at all, for
starters -- no multiple Xids for a transaction tree. And the cmin/cmax
test would only be done inside the backend running the xact tree, so it
doesn't need to be stored permanently, nor shared.

We would need to be able to mark individual CommandIds as aborted, and
while checking Cmin and Cmax, not only see how they compare to the
CurrentCommandId, but also whether they aborted.

It looks simpler to me than the current design.

The only problem would be _how_ to mark a bunch of CommandIds as
aborted -- keeping them all in memory seems too heavy. A bitmap could
be an interesting idea, but for a very big transaction we could need at
most 2^32 bits, which is way too much. Runlength encoding maybe? Any
graphic-library hacker around here with knowledge about compressing
bit strings? I know nothing of this stuff.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Si no sabes adonde vas, es muy probable que acabes en otra parte.

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#22)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Fri, May 28, 2004 at 02:47:01PM -0400, Tom Lane wrote:

Something that just now occurred to me: could we identify
subtransactions with commands? That is, cmin *is* the subtransaction
ID, and xmin/xmax are always the parent xact? I'm not sure this works
either, but it might be something to think about.

It looks simpler to me than the current design.

The only problem would be _how_ to mark a bunch of CommandIds as
aborted -- keeping them all in memory seems too heavy. A bitmap could
be an interesting idea, but for a very big transaction we could need at
most 2^32 bits, which is way too much.

Not at all. In the first place, if there was anyone out there using
anything like 2^32 commands per xact, we'd have heard loud complaints
about the limit. ("My xact ran for a week, and then fell over because
of this stupid *%#*!*# limit! Fix it!") Since we've never ever heard
such a complaint, I think it's safe to assume there is plenty of
headroom there. I think ordinary cases are very unlikely to have more
than a few thousand commands per xact, and I don't think this will
change much with the availability of subtransactions. So I'd go for a
plain bitmap. (I *think* we only need one bit per CID, since we only
really care about "aborted or not" --- the visibility rules are simpler
than for concurrent xacts.)

We'd still need a plain CommandCounterIncrement facility, which means
that actually a subtransaction would have to be a group of CIDs not just
one. So there'd also need to be a data structure showing the CIDs
associated with each open subtransaction --- this is what you'd consult
to go and set the "aborted" bits if the subxact rolls back.

regards, tom lane

#24Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#23)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Fri, May 28, 2004 at 03:19:29PM -0400, Tom Lane wrote:

We'd still need a plain CommandCounterIncrement facility, which means
that actually a subtransaction would have to be a group of CIDs not just
one.

Right, this is why I suggested runlength (the group is contiguous).

So there'd also need to be a data structure showing the CIDs
associated with each open subtransaction --- this is what you'd
consult to go and set the "aborted" bits if the subxact rolls back.

Right. We only need to store the "borders" though. Not even that: only
the start, because the end is what is current at AbortSubTransaction()
time.

I'll try this.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#24)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Fri, May 28, 2004 at 03:19:29PM -0400, Tom Lane wrote:

We'd still need a plain CommandCounterIncrement facility, which means
that actually a subtransaction would have to be a group of CIDs not just
one.

Right, this is why I suggested runlength (the group is contiguous).

Not necessarily.

Right. We only need to store the "borders" though. Not even that: only
the start, because the end is what is current at AbortSubTransaction()
time.

Nope. Think about sub-subtransactions.

A runlength encoding might be worth using, though, since you're right
that subxacts would tend to get runs of consecutive CIDs.

In theory we could also use a runlength representation for the master
bitmap of aborted CIDs, but that would be costly since you would have to
do a search, and not just a trivial index, to check the state of a
particular CID. My inclination would be to stick with a bitmap for now.
It'd be easy enough to revise the implementation later if that gut feel
proves out wrong.

regards, tom lane

#26Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#25)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Fri, May 28, 2004 at 03:48:11PM -0400, Tom Lane wrote:

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Fri, May 28, 2004 at 03:19:29PM -0400, Tom Lane wrote:

We'd still need a plain CommandCounterIncrement facility, which means
that actually a subtransaction would have to be a group of CIDs not just
one.

Right. We only need to store the "borders" though. Not even that: only
the start, because the end is what is current at AbortSubTransaction()
time.

Nope. Think about sub-subtransactions.

They are all aborted if the parent is, so

BEGIN;
BEGIN; -- cid = 1
BEGIN; -- cid = 2
INSERT INTO foo VALUES (1) -- cid=3
COMMIT;
ROLLBACK; -- aborts from Cid 1 to Cid 3
-- here we can't see the tuple because Xmin == my-xid
-- and Cmin=1 is aborted
COMMIT;

I assume this is how you think it works, isn't it? Because if not I am
in serious trouble. But if it is, then CommandIds from 1 ("start Cid")
and 3 ("current Cid") are aborted. And this can be kept in the master
bitmap --- in fact, I didn't think about another one.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Un poeta es un mundo encerrado en un hombre" (Victor Hugo)

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#26)
Re: SELECT * FROM <table> LIMIT 1; is really slow

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Fri, May 28, 2004 at 03:48:11PM -0400, Tom Lane wrote:

Nope. Think about sub-subtransactions.

They are all aborted if the parent is, so

BEGIN;
BEGIN; -- cid = 1
BEGIN; -- cid = 2
INSERT INTO foo VALUES (1) -- cid=3
COMMIT;
ROLLBACK; -- aborts from Cid 1 to Cid 3
-- here we can't see the tuple because Xmin == my-xid
-- and Cmin=1 is aborted
COMMIT;

I assume this is how you think it works, isn't it?

[ thinks about it for a while... ] Yeah, I guess you are right. Since
we don't have threading, an outer transaction cannot assign any new CIDs
while a subtransaction is in progress. Therefore, when a subtransaction
ends, all CIDs from its start to current belong to either itself or its
subtransactions. On abort we can just mark *all* of these as aborted.
If we had to do anything at subtrans commit, we'd need more state, but
we don't have to do anything at subtrans commit.

So you're right, the per-open-subtrans state is just its starting CID.
Slick.

However, I just remembered why we rejected this idea to start with :-(.
If we do it this way then when the overall xact commits, we no longer
have state that tells which particular tuples are good or not. We would
have to trawl for tuples written by aborted subtransactions and mark
them dead before committing, else other transactions would think they
were good.

What this says is that we still need persistent pg_subtrans status.
I'm not sure if we can use CIDs as subtrans IDs this way and still have
a reasonably efficient storage representation for the global pg_subtrans
table.

regards, tom lane

#28Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#27)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Fri, May 28, 2004 at 04:45:28PM -0400, Tom Lane wrote:

However, I just remembered why we rejected this idea to start with :-(.
If we do it this way then when the overall xact commits, we no longer
have state that tells which particular tuples are good or not. We would
have to trawl for tuples written by aborted subtransactions and mark
them dead before committing, else other transactions would think they
were good.

What this says is that we still need persistent pg_subtrans status.
I'm not sure if we can use CIDs as subtrans IDs this way and still have
a reasonably efficient storage representation for the global pg_subtrans
table.

I'm not sure if I understand your last assertion. We can in no way use
the CID as subtrans ID in pg_subtrans, of course (it starts from 0 at
each main transaction start).

So pg_subtrans remains the same, and we assign a new Xid to each
subtransaction. Each tuple gets Xmin/Xmax according to the Xid of the
current subtransaction. Within the transaction tree we don't use the
Xid to check for visibility, but Cmin/Cmax and the abort bitmap.

When the Xmin/xmax does not belong to our transaction tree, we use
pg_subtrans and pg_clog.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"We are who we choose to be", sang the goldfinch
when the sun is high (Sandman)

#29Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Alvaro Herrera (#28)
Re: SELECT * FROM <table> LIMIT 1; is really slow

I wrote a while ago

So pg_subtrans remains the same, and we assign a new Xid to each
subtransaction. Each tuple gets Xmin/Xmax according to the Xid of the
current subtransaction. Within the transaction tree we don't use the
Xid to check for visibility, but Cmin/Cmax and the abort bitmap.

... and meanwhile I forgot why this subthread started: I haven't solved
the original problem, which is that I want to avoid enlarging the
HeapTupleHeader. But currently there is need for storing both a Cmin
and a Xmax :-(

Manfred: the code you read adds a single bit to the infomask, which is
there basically to be able to say whether the Cmin has been overwritten
with a Xmax. So the assumption was that when we see that this has
happenned, the Cmin is no longer important (== every future command can
already see the tuple), and we have to check the Xmax in pg_clog to see
if the deleting subtransaction has aborted (in which case the tuple is
visible to us).

This idea was a result of a couple of hour of chat with Bruce where I
explained that I didn't want to enlarge the HeapTupleHeader, and he came
up with the bit idea.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo l�gico y coherente. Pero el universo real se halla siempre
un paso m�s all� de la l�gica" (Irulan)

#30Manfred Koizar
mkoi-pg@aon.at
In reply to: Alvaro Herrera (#29)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Fri, 28 May 2004 21:59:53 -0400, Alvaro Herrera
<alvherre@dcc.uchile.cl> wrote:

So the assumption was that when we see that this has
happenned, the Cmin is no longer important (== every future command can
already see the tuple)

If it was that simple, we wouldn't have had to invent the CMIN_IS_CMAX
flag. This has been discussed two years ago. Did you follow the link I
posted last week? "Every future command" is not enough. You have to
consider the current command and even commands started before this.

Servus
Manfred

#31Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#21)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Fri, 28 May 2004 14:47:01 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If putting back xmax is the price we must pay for nested transactions,
then we *will* pay that price. Maybe not in this release, but it will
inevitably happen.

"we" = every Postgres user, even those that do not use subtransactions.

"price" = 2% to 5% performance loss for databases where the working set
is larger than main memory.

Don't bother hollering "veto" ;-)

Ok, I'll shut up till I have something concrete to support my opinion.

Servus
Manfred

#32Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Manfred Koizar (#30)
Re: SELECT * FROM <table> LIMIT 1; is really slow

On Tue, Jun 01, 2004 at 02:37:37PM +0200, Manfred Koizar wrote:

On Fri, 28 May 2004 21:59:53 -0400, Alvaro Herrera
<alvherre@dcc.uchile.cl> wrote:

So the assumption was that when we see that this has
happenned, the Cmin is no longer important (== every future command can
already see the tuple)

If it was that simple, we wouldn't have had to invent the CMIN_IS_CMAX
flag. This has been discussed two years ago. Did you follow the link I
posted last week? "Every future command" is not enough. You have to
consider the current command and even commands started before this.

Yes, I did follow it (you mean XMAX_IS_XMIN, right? I suppose no tuple
can really have Cmin == Cmax). I'm not claiming I understood it fully
though. But as you see, since the assumption is not valid we have to
drop the idea and put back the Xmax as a field on its own on
HeapTupleHeader (which is what I had done before Bruce persuaded me not
to). I don't really like this idea but I don't see other way out.

A couple of days ago I was going to propose putting Xmax as a separate
field only as needed, in a way similar to the way Oid is handled ---
thus we would enlarge the tuple if and only if the creating transaction
deletes it. This would be nice because one would expect that there are
not that many tuples created and deleted by the same transaction, so
we'd localize the inefficiency of storing both fields (Cmin and Xmax)
only on tuples that need it. While I was writing the proposal I
realised that it'd mean enlarging tuples that are already on disk, and
there's no way we can do that.

If you have other ideas I'm all ears. I'm the last one to want that
nested xacts make everything else work slower.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

#33Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alvaro Herrera (#32)
Nested transactions and tuple header info

Alvaro Herrera wrote:

Yes, I did follow it (you mean XMAX_IS_XMIN, right? I suppose no tuple
can really have Cmin == Cmax). I'm not claiming I understood it fully
though. But as you see, since the assumption is not valid we have to
drop the idea and put back the Xmax as a field on its own on
HeapTupleHeader (which is what I had done before Bruce persuaded me not
to). I don't really like this idea but I don't see other way out.

A couple of days ago I was going to propose putting Xmax as a separate
field only as needed, in a way similar to the way Oid is handled ---
thus we would enlarge the tuple if and only if the creating transaction
deletes it. This would be nice because one would expect that there are
not that many tuples created and deleted by the same transaction, so
we'd localize the inefficiency of storing both fields (Cmin and Xmax)
only on tuples that need it. While I was writing the proposal I
realised that it'd mean enlarging tuples that are already on disk, and
there's no way we can do that.

I have read the archives and I think understand the issue. Before
subtransactions, the only transaction that could see and hence delete a
tuple created by an open transaction was the transaction itself, and to
keep the cmin and cmax, we created a separate tuple bit which indicated
the xmin and xmax were the same.

With subtransactions, other xids (subtransaction ids) can see and delete
tuples created by earlier parts of the main transaction, and the tuple
bit cmin=cmax doesn't work.

So, we need a way to record the xmin and xmax while keeping cmin and
cmax in the tuple header. My idea is for subtransactions to create
additional xid's that represent the opposite of the commit state for
changing tuples created by earlier subtransactions.

BEGIN; xid=1
INSERT a;
BEGIN; xid=2
INSERT b;
DELETE a; xid=3
COMMIT;
COMMIT;

When "DELETE a" happens, we remove the xmin=1 from the tuple header and
replace it with xmin=3. xid=3 will be marked as committed if xid2
aborts, and will be marked as aborted if xid3 commits.

So, if xid2 aborts, the insert of xid1 should be honored, and xid3 is
marked as committed, and the opposite if xid2 commits.

We would have to use pg_subtrans so these phantom xids could point to
the base xid and a list would have to be maintained so higher-level
subtransactions aborting would trigger changes in these phantom xids,
that is, if xid1 aborts, xid2 should abort as well.

Anyway, this is more of a sketch of an possible way to do this without
extending the tuple header for all transactions.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#34Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Bruce Momjian (#33)
Re: Nested transactions and tuple header info

On Tue, Jun 01, 2004 at 06:40:07PM -0400, Bruce Momjian wrote:

So, we need a way to record the xmin and xmax while keeping cmin and
cmax in the tuple header. My idea is for subtransactions to create
additional xid's that represent the opposite of the commit state for
changing tuples created by earlier subtransactions.

Hmm, interesting idea. What seems more interesting is that the change
seems to be confined to HeapTupleHeaderSetXmax. Every transaction and
subtransaction will need two Xids (I think we can even optimize it so
the "abort xid" is taken only as needed).

I don't see anything immediately that would invalidate this idea. I'll
marinate it while I write the trigger stuff, and wait for other
comments. If nothing bad arises I'll try an implementation and report
back.

Thanks,

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Nunca confiar� en un traidor. Ni siquiera si el traidor lo he creado yo"
(Bar�n Vladimir Harkonnen)

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#33)
Re: Nested transactions and tuple header info

Bruce Momjian <pgman@candle.pha.pa.us> writes:

BEGIN; xid=1
INSERT a;
BEGIN; xid=2
INSERT b;
DELETE a; xid=3
COMMIT;
COMMIT;

When "DELETE a" happens, we remove the xmin=1 from the tuple header and
replace it with xmin=3.

You can't change xmin --- this would break visibility tests. Consider
a cursor opened in the outer transaction after the INSERT a. It should
be able to see the "a" row (note that this depends on recognizing xid
equality and then comparing cid's within the outer transaction). If the
subtransaction mangles xmin then it is no longer possible to make this
test correctly.

This is exactly the same argument as not being able to overwrite cmin.

regards, tom lane

#36Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#35)
Re: Nested transactions and tuple header info

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

BEGIN; xid=1
INSERT a;
BEGIN; xid=2
INSERT b;
DELETE a; xid=3
COMMIT;
COMMIT;

When "DELETE a" happens, we remove the xmin=1 from the tuple header and
replace it with xmin=3.

You can't change xmin --- this would break visibility tests. Consider
a cursor opened in the outer transaction after the INSERT a. It should
be able to see the "a" row (note that this depends on recognizing xid
equality and then comparing cid's within the outer transaction). If the
subtransaction mangles xmin then it is no longer possible to make this
test correctly.

This is exactly the same argument as not being able to overwrite cmin.

Basically the phantom xid's are a shorthand for saying the tuple was
created by xid1 and deleted by xid2, both part of the same main
transaction.

A cursor looking at the rows has to recognize the xid is a phantom (via
pg_subtrans) and look up the creation xid.

Also, we will need a phantom xid for every xid1/xid2 pair. You can't
just create one phantom xid per subtransaction because you must be able
to control independently commit/rollback rows based on the status of the
insert transaction.

In this case:

BEGIN;
BEGIN; xid=1
INSERT a;
BEGIN; xid=2
INSERT b;
BEGIN; xid=3
DELETE a; xid=4
DELETE b; xid=5
COMMIT;
COMMIT;
COMMIT;
COMMIT;

xid4 and xid5 has to be adjusted based on that status of xid1 and xid2.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#36)
Re: Nested transactions and tuple header info

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

You can't change xmin --- this would break visibility tests.

Basically the phantom xid's are a shorthand for saying the tuple was
created by xid1 and deleted by xid2, both part of the same main
transaction.

That would be fine if the shorthand were readable, but it's not.

A cursor looking at the rows has to recognize the xid is a phantom (via
pg_subtrans) and look up the creation xid.

And it will find that how? Imagine that the creating transaction is
itself a subtransaction, and the deleting one is a few nesting levels
further down. I don't see how the tuple is going to carry enough
information to let you determine what's what, if the deleting subxact
overwrites the creating one's XID.

regards, tom lane

#38Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#37)
Re: Nested transactions and tuple header info

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

You can't change xmin --- this would break visibility tests.

Basically the phantom xid's are a shorthand for saying the tuple was
created by xid1 and deleted by xid2, both part of the same main
transaction.

That would be fine if the shorthand were readable, but it's not.

A cursor looking at the rows has to recognize the xid is a phantom (via
pg_subtrans) and look up the creation xid.

And it will find that how? Imagine that the creating transaction is
itself a subtransaction, and the deleting one is a few nesting levels
further down. I don't see how the tuple is going to carry enough
information to let you determine what's what, if the deleting subxact
overwrites the creating one's XID.

The backend who created _and_ expired the tuple has to do a lookup to
find the creation or expire xid.

We need two things. First, we need to control the visibility of the
tuple once the entire transaction is done. This does that.

Second, we need to be able to find the creation and expire xid, and that
information is only required to be visible by the main transation and
its subtransactions. On commit we can adjust these xid status to show
the proper visibility.

How do they do the lookup? Well, one idea would be to just create a
local backend hash of these xids and their creation/expire xids.
Another idea is that pg_subtrans already points to a parent xid. We
could use the same method and point to both creation and expire xids.

Why does this not work?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#39Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Bruce Momjian (#36)
Re: Nested transactions and tuple header info

On Tue, Jun 01, 2004 at 11:17:40PM -0400, Bruce Momjian wrote:

Basically the phantom xid's are a shorthand for saying the tuple was
created by xid1 and deleted by xid2, both part of the same main
transaction.

Hmm... this would spread the ugliness elsewhere (hopefully only
HeapTupleHeaderGetXmin).

A cursor looking at the rows has to recognize the xid is a phantom (via
pg_subtrans) and look up the creation xid.

No need to look at pg_subtrans because we know all the Xids of our
transaction tree. I think this can be kept in local memory.

Also, we will need a phantom xid for every xid1/xid2 pair. You can't
just create one phantom xid per subtransaction because you must be able
to control independently commit/rollback rows based on the status of the
insert transaction.

Oh, sure. This could get huge pretty fast.

We still need to think on the effects this could have on crash recovery
though -- we'd have to write the phantom Xids to Xlog somehow
(indicating which ones are committed and which are aborted). And we
still don't know what effect it would have on CPU cost for every
visibility check.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Los dioses no protegen a los insensatos. �stos reciben protecci�n de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

#40Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alvaro Herrera (#39)
Re: Nested transactions and tuple header info

Alvaro Herrera wrote:

Also, we will need a phantom xid for every xid1/xid2 pair. You can't
just create one phantom xid per subtransaction because you must be able
to control independently commit/rollback rows based on the status of the
insert transaction.

Oh, sure. This could get huge pretty fast.

We still need to think on the effects this could have on crash recovery
though -- we'd have to write the phantom Xids to Xlog somehow
(indicating which ones are committed and which are aborted). And we
still don't know what effect it would have on CPU cost for every
visibility check.

As I understand, this overhead would only be needed for subtransactions.
I also don't think there will be a lot of them because it is only for
creation/expire in the same main transaction, and it is only needed for
unique creation/expire combinations, which should be pretty small.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#36)
Re: Nested transactions and tuple header info

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

This is exactly the same argument as not being able to overwrite cmin.

Basically the phantom xid's are a shorthand for saying the tuple was
created by xid1 and deleted by xid2, both part of the same main
transaction.

A cursor looking at the rows has to recognize the xid is a phantom (via
pg_subtrans) and look up the creation xid.

You still don't see the point. Consider

BEGIN;
DECLARE CURSOR c1 FOR SELECT * FROM a ...;
INSERT INTO a VALUES(...); -- call this row x
DECLARE CURSOR c2 FOR SELECT * FROM a ...;
BEGIN;
DELETE FROM a WHERE ...; -- assume this deletes row x
ROLLBACK;
FETCH FROM c1; -- must NOT see row x
FETCH FROM c2; -- must see row x

AFAICS your proposal does not support this. The two cursors' snapshots
will differ only in the recorded current-cid for the outer transaction.
If the subtrans has overwritten xmin/cmin, there is no way to make that
decision correctly.

regards, tom lane

#42Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#41)
Re: Nested transactions and tuple header info

On Wed, Jun 02, 2004 at 09:52:28AM -0400, Tom Lane wrote:

BEGIN;
DECLARE CURSOR c1 FOR SELECT * FROM a ...;
INSERT INTO a VALUES(...); -- call this row x
DECLARE CURSOR c2 FOR SELECT * FROM a ...;
BEGIN;
DELETE FROM a WHERE ...; -- assume this deletes row x
ROLLBACK;
FETCH FROM c1; -- must NOT see row x
FETCH FROM c2; -- must see row x

AFAICS your proposal does not support this. The two cursors' snapshots
will differ only in the recorded current-cid for the outer transaction.
If the subtrans has overwritten xmin/cmin, there is no way to make that
decision correctly.

Why would it overwrite cmin? Only a new xmin is needed (and cmax and
xmax, but the cursors don't care about those)

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La fuerza no est� en los medios f�sicos
sino que reside en una voluntad indomable" (Gandhi)

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#42)
Re: Nested transactions and tuple header info

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Wed, Jun 02, 2004 at 09:52:28AM -0400, Tom Lane wrote:

AFAICS your proposal does not support this. The two cursors' snapshots
will differ only in the recorded current-cid for the outer transaction.
If the subtrans has overwritten xmin/cmin, there is no way to make that
decision correctly.

Why would it overwrite cmin? Only a new xmin is needed (and cmax and
xmax, but the cursors don't care about those)

If you overwrite xmin and not cmin, you've created a nonsensical
situation. How do you distinguish this tuple from tuples created by the
subxact itself? More generally, cmin is only meaningful in combination
with a particular transaction ID; you can't just arbitrarily replace
xmin without changing cmin.

I've been trying to think of ways to solve these problems by having a
main xact and all its subxacts share a common CID sequence (ie, a
subxact would have its own xid but would not start CID over at one).
If you assume that, then Bruce's idea may indeed work, since you would
never replace xmin in a way that would shift the interpretation of cmin
into a different CID sequence. But I suspect there is a simpler way to
solve it given that constraint.

regards, tom lane

#44Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#41)
Re: Nested transactions and tuple header info

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

This is exactly the same argument as not being able to overwrite cmin.

Basically the phantom xid's are a shorthand for saying the tuple was
created by xid1 and deleted by xid2, both part of the same main
transaction.

A cursor looking at the rows has to recognize the xid is a phantom (via
pg_subtrans) and look up the creation xid.

You still don't see the point. Consider

BEGIN;
DECLARE CURSOR c1 FOR SELECT * FROM a ...;
INSERT INTO a VALUES(...); -- call this row x
DECLARE CURSOR c2 FOR SELECT * FROM a ...;
BEGIN;
DELETE FROM a WHERE ...; -- assume this deletes row x
ROLLBACK;
FETCH FROM c1; -- must NOT see row x
FETCH FROM c2; -- must see row x

AFAICS your proposal does not support this. The two cursors' snapshots
will differ only in the recorded current-cid for the outer transaction.
If the subtrans has overwritten xmin/cmin, there is no way to make that
decision correctly.

I do not overwrite cmin or cmax. If xid=1 creates a row:

xmin=1
cmin=1

and xid=2 goes to expire it, we get:

xmin=3 (phantom for xmin=1, xmax=2)
cmin=1
cmax=1

and we set a phantom bit on the tuple. When we see it later and need to
know the xmin or xmax, we look it up in local memory. (Maybe we don't
even need a hash, just a List because I can't imagine more than a few of
these phantoms being used.) Other backends see the tuple with a xmin as
"in progress" so they don't need to look any further.

Then, on commit, we decide if a tuple created by xid=1 and expired by
xid=2 should appear created or not, and mark the phantom commit status
accordingly.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#45Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#43)
Re: Nested transactions and tuple header info

Tom Lane wrote:

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

On Wed, Jun 02, 2004 at 09:52:28AM -0400, Tom Lane wrote:

AFAICS your proposal does not support this. The two cursors' snapshots
will differ only in the recorded current-cid for the outer transaction.
If the subtrans has overwritten xmin/cmin, there is no way to make that
decision correctly.

Why would it overwrite cmin? Only a new xmin is needed (and cmax and
xmax, but the cursors don't care about those)

If you overwrite xmin and not cmin, you've created a nonsensical
situation. How do you distinguish this tuple from tuples created by the
subxact itself? More generally, cmin is only meaningful in combination
with a particular transaction ID; you can't just arbitrarily replace
xmin without changing cmin.

I've been trying to think of ways to solve these problems by having a
main xact and all its subxacts share a common CID sequence (ie, a
subxact would have its own xid but would not start CID over at one).
If you assume that, then Bruce's idea may indeed work, since you would
never replace xmin in a way that would shift the interpretation of cmin
into a different CID sequence. But I suspect there is a simpler way to
solve it given that constraint.

I thought about using a global command counter. The problem there is
that there is no way to control the visibility of tuples by other
transactions on commit except going back end fixing up tuples, which is
unacceptable.

By creating phantoms, we can decide if an specific xmin/xmax pair should
be appear as committed and set it accordingly on commit.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#45)
Re: Nested transactions and tuple header info

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

I've been trying to think of ways to solve these problems by having a
main xact and all its subxacts share a common CID sequence (ie, a
subxact would have its own xid but would not start CID over at one).
If you assume that, then Bruce's idea may indeed work, since you would
never replace xmin in a way that would shift the interpretation of cmin
into a different CID sequence. But I suspect there is a simpler way to
solve it given that constraint.

I thought about using a global command counter. The problem there is
that there is no way to control the visibility of tuples by other
transactions on commit except going back end fixing up tuples, which is
unacceptable.

No, I said own xid --- so the "phantom xid" part is still there. But
your idea definitely does *not* work unless you use a single CID
sequence for the whole main xact; and I'm still wondering if there's
not a simpler implementation possible given that assumption.

regards, tom lane

#47Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#46)
Re: Nested transactions and tuple header info

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

I've been trying to think of ways to solve these problems by having a
main xact and all its subxacts share a common CID sequence (ie, a
subxact would have its own xid but would not start CID over at one).
If you assume that, then Bruce's idea may indeed work, since you would
never replace xmin in a way that would shift the interpretation of cmin
into a different CID sequence. But I suspect there is a simpler way to
solve it given that constraint.

I thought about using a global command counter. The problem there is
that there is no way to control the visibility of tuples by other
transactions on commit except going back end fixing up tuples, which is
unacceptable.

No, I said own xid --- so the "phantom xid" part is still there. But
your idea definitely does *not* work unless you use a single CID
sequence for the whole main xact; and I'm still wondering if there's
not a simpler implementation possible given that assumption.

I don't understand why a single counter is needed for phantom xids. We
keep the cmin/cmax on the tuple already, and our own backend can look up
the xmin/xmax that goes with the phantom.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#47)
Re: Nested transactions and tuple header info

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I don't understand why a single counter is needed for phantom xids. We
keep the cmin/cmax on the tuple already, and our own backend can look up
the xmin/xmax that goes with the phantom.

Oh, so you're thinking of an internal table that provides a mapping back
to the replaced xmin? Ugh. Perhaps it could be made to work, but it's
a lot of mechanism, and it will slow down visibility checks (since
AFAICS you'd have to check every subxid against the phantoms table).
If we go with a global CID counter then we don't have to add that step.

A global CID counter would also simplify other visibility tests. Alvaro
hasn't said anything about how he's doing visibility checks across
different subxacts of the same main xact, but without global CID there
would need to be some pretty ugly checks to determine whether a subxact
happened before or after the CID cutoff your outer xact is interested
in.

regards, tom lane

#49Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#48)
Re: Nested transactions and tuple header info

On Wed, Jun 02, 2004 at 11:12:31AM -0400, Tom Lane wrote:

A global CID counter would also simplify other visibility tests. Alvaro
hasn't said anything about how he's doing visibility checks across
different subxacts of the same main xact, but without global CID there
would need to be some pretty ugly checks to determine whether a subxact
happened before or after the CID cutoff your outer xact is interested
in.

Yes, I'm using a global CID counter.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La vida es para el que se aventura"

#50Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#48)
Re: Nested transactions and tuple header info

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I don't understand why a single counter is needed for phantom xids. We
keep the cmin/cmax on the tuple already, and our own backend can look up
the xmin/xmax that goes with the phantom.

Oh, so you're thinking of an internal table that provides a mapping back
to the replaced xmin? Ugh. Perhaps it could be made to work, but it's
a lot of mechanism, and it will slow down visibility checks (since
AFAICS you'd have to check every subxid against the phantoms table).

My idea was to have a tuple bit indicating the xid is a phantom.

If we go with a global CID counter then we don't have to add that step.

Seems Alvaro is already using a global counter.

A global CID counter would also simplify other visibility tests. Alvaro
hasn't said anything about how he's doing visibility checks across
different subxacts of the same main xact, but without global CID there
would need to be some pretty ugly checks to determine whether a subxact
happened before or after the CID cutoff your outer xact is interested
in.

If a global counter will reduce the number of phantom checks, then good.
However, I assume Alvaro has to access the creation/expire xid to
determine if the subtransaction committed, so I didn't think a global
counter would help reduce the number of lookups.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#51Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Bruce Momjian (#50)
Re: Nested transactions and tuple header info

On Wed, Jun 02, 2004 at 12:23:37PM -0400, Bruce Momjian wrote:

Tom Lane wrote:

If we go with a global CID counter then we don't have to add that step.

Seems Alvaro is already using a global counter.

I think I stated already that I'm in fact using it. Not sure why it
didn't show up in the list.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"�Que diferencia tiene para los muertos, los hu�rfanos, y aquellos que han
perdido su hogar, si la loca destrucci�n ha sido realizada bajo el nombre
del totalitarismo o del santo nombre de la libertad y la democracia?" (Gandhi)

#52Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alvaro Herrera (#51)
Re: Nested transactions and tuple header info

Alvaro Herrera wrote:

On Wed, Jun 02, 2004 at 12:23:37PM -0400, Bruce Momjian wrote:

Tom Lane wrote:

If we go with a global CID counter then we don't have to add that step.

Seems Alvaro is already using a global counter.

I think I stated already that I'm in fact using it. Not sure why it
didn't show up in the list.

Yes, I saw it. I wrote "seems" when I should have wrote it as a fact.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#53Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Bruce Momjian (#47)
Re: Nested transactions and tuple header info

On Wed, Jun 02, 2004 at 10:57:05AM -0400, Bruce Momjian wrote:

Tom Lane wrote:

No, I said own xid --- so the "phantom xid" part is still there. But
your idea definitely does *not* work unless you use a single CID
sequence for the whole main xact; and I'm still wondering if there's
not a simpler implementation possible given that assumption.

I don't understand why a single counter is needed for phantom xids. We
keep the cmin/cmax on the tuple already, and our own backend can look up
the xmin/xmax that goes with the phantom.

Not sure either way (maybe you are right), but I use the global counter
anyway because some tests would become a very ugly mess if I didn't. I
think the phantom idea is also simpler with the global counter.

And I see no reason to use local counter. We certainly are not
hitting the limit with the global counter, as Tom pointed out recently
in a thread about the aborted CID bitmaps.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Crear es tan dif�cil como ser libre" (Elsa Triolet)

#54Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alvaro Herrera (#53)
Re: Nested transactions and tuple header info

Alvaro Herrera wrote:

On Wed, Jun 02, 2004 at 10:57:05AM -0400, Bruce Momjian wrote:

Tom Lane wrote:

No, I said own xid --- so the "phantom xid" part is still there. But
your idea definitely does *not* work unless you use a single CID
sequence for the whole main xact; and I'm still wondering if there's
not a simpler implementation possible given that assumption.

I don't understand why a single counter is needed for phantom xids. We
keep the cmin/cmax on the tuple already, and our own backend can look up
the xmin/xmax that goes with the phantom.

Not sure either way (maybe you are right), but I use the global counter
anyway because some tests would become a very ugly mess if I didn't. I
think the phantom idea is also simpler with the global counter.

And I see no reason to use local counter. We certainly are not
hitting the limit with the global counter, as Tom pointed out recently
in a thread about the aborted CID bitmaps.

Sure, use the global counter if it helps. My only point was that I
didn't see how a global counter could reduce the number of times we have
to look up the phantom xid to find the xmin/xmax.

However, if you have a map showing which command counter goes with which
subtransaction, then you _certainly_ could reduce the number of phantom
lookups. (The command counter cmin/cmax tells you the xmin/xmax). In
fact, in that case, the phatom xid's are only used on commit so you can
make certain tuples visible/invisible to other transactions.

To find the phantom xid idea, I had to wander around the house for a few
hours thinking of what was in that tuple header and how this could be
accomplished. I am sorry I led Alvaro initially into thinking he didn't
need to keep the cmin.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#55Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Bruce Momjian (#33)
Re: Nested transactions and tuple header info

On Tue, Jun 01, 2004 at 06:40:07PM -0400, Bruce Momjian wrote:

When "DELETE a" happens, we remove the xmin=1 from the tuple header and
replace it with xmin=3. xid=3 will be marked as committed if xid2
aborts, and will be marked as aborted if xid3 commits.

So, if xid2 aborts, the insert of xid1 should be honored, and xid3 is
marked as committed, and the opposite if xid2 commits.

Ok, I've been looking at implementing this. However it just occurred to
me that a transaction, different from the one modifying the tuple, could
try to see its xmax.

Since the xmin signals the tuple as being updated concurrently by
another transaction (it's in progress), this can only happen if the
other transaction tries to read it using SnapshotDirty.

One such possible caller is EvalPlanQual. It could go to sleep using
XactLockTableWait() on the SnapshotDirty's xmax. But the tuple has
something strange in its xmax -- it's the tuple's cmin actually.
Leaving this would be probably a bug.

However, if the tuple is new, then EvalPlanQual won't even try to see
it. Or maybe it will. And then, maybe there are other callers (I can
only see _bt_check_unique).

Do I have to worry about this? Maybe (probably) it's not a problem, but
I want to be sure.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hern�ndez-Novich)

#56Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Alvaro Herrera (#55)
Re: Nested transactions and tuple header info

I said

One such possible caller is EvalPlanQual. It could go to sleep using
XactLockTableWait() on the SnapshotDirty's xmax. But the tuple has
something strange in its xmax -- it's the tuple's cmin actually.
Leaving this would be probably a bug.

However, if the tuple is new, then EvalPlanQual won't even try to see
it. Or maybe it will. And then, maybe there are other callers (I can
only see _bt_check_unique).

Right, this is a worse problem than I thought :-( I can't get past
initdb because _bt_check_unique calls XactLockTableWait() with it's own
Xid, which we refuse. Actually not it's own Xid, but the phantom Xid
that was just assigned to some tuple, in the same transaction.

Not sure how to fix this -- I'm not even sure what the exact problem is,
because it's trying to insert the oid of a toast table in
pg_class_oid_index during the first ALTER TABLE ... CREATE TOAST TABLE.
Why would it see an old tuple with the same value, I don't know.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Always assume the user will do much worse than the stupidest thing
you can imagine." (Julien PUYDT)

#57Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alvaro Herrera (#55)
Re: Nested transactions and tuple header info

Alvaro Herrera wrote:

On Tue, Jun 01, 2004 at 06:40:07PM -0400, Bruce Momjian wrote:

When "DELETE a" happens, we remove the xmin=1 from the tuple header and
replace it with xmin=3. xid=3 will be marked as committed if xid2
aborts, and will be marked as aborted if xid3 commits.

So, if xid2 aborts, the insert of xid1 should be honored, and xid3 is
marked as committed, and the opposite if xid2 commits.

Ok, I've been looking at implementing this. However it just occurred to
me that a transaction, different from the one modifying the tuple, could
try to see its xmax.

First, I assume it isn't a problem to change the cmin because we have a
lock on the tuple while we are modifying it.

Since the xmin signals the tuple as being updated concurrently by
another transaction (it's in progress), this can only happen if the
other transaction tries to read it using SnapshotDirty.

As far as someone trying to read the xmax when it isn't there, I assumed
we had a tuple bit set indicating we have xmin/cmin/cmax, and that the
outside transaction doesn't need to look up the xmax.

However, for ourselves, we need to look that phantom xid up in our local
memory and find the xmin/xmax for ourselves, right?

One such possible caller is EvalPlanQual. It could go to sleep using
XactLockTableWait() on the SnapshotDirty's xmax. But the tuple has
something strange in its xmax -- it's the tuple's cmin actually.
Leaving this would be probably a bug.

Again, the bit is set, everyone has to look up the phantom xid in their
own phantom xid list, or look in pg_subtrans to find out if they own
that xid, and if so, then lookup the xmin/xmax in their local memory.

As far as SnapshotDirty(), can you explain why that is used and if a
tuple being created in an open transaction is subject to that? I assume
it would be if we were updating a tuple and need to sleep on it. Can't
we use pg_subtrans to find the main transaction xid and sleep on that?
Once the main transaction is done, we then need to relook at the phantom
xid to see if it was marked as committed/aborted because it might not
match the main transactions status.

Sorry I didn't reply earlier. I had to think on this one.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#56)
Re: Nested transactions and tuple header info

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

Not sure how to fix this -- I'm not even sure what the exact problem is,
because it's trying to insert the oid of a toast table in
pg_class_oid_index during the first ALTER TABLE ... CREATE TOAST TABLE.
Why would it see an old tuple with the same value, I don't know.

This is probably an UPDATE operation not an INSERT. There are quite a
few paths through CREATE TABLE that involve repeated updates of the
new pg_class row.

I would think however that these changes would occur as successive
commands of a single transaction, not as subtransactions, unless you've
done something odd to CommandCounterIncrement.

regards, tom lane

#59Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#58)
Re: Nested transactions and tuple header info

On Sun, Jun 13, 2004 at 01:22:05PM -0400, Tom Lane wrote:

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

Not sure how to fix this -- I'm not even sure what the exact problem is,
because it's trying to insert the oid of a toast table in
pg_class_oid_index during the first ALTER TABLE ... CREATE TOAST TABLE.
Why would it see an old tuple with the same value, I don't know.

This is probably an UPDATE operation not an INSERT. There are quite a
few paths through CREATE TABLE that involve repeated updates of the
new pg_class row.

Huh, right.

I would think however that these changes would occur as successive
commands of a single transaction, not as subtransactions, unless you've
done something odd to CommandCounterIncrement.

Right, but I've taken the XMAX_IS_XMIN bit and replaced it with the
phantom Ids idea. Probably this problem wouldn't have shown up if I
hadn't done that, but we need to cope anyway.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La principal caracter�stica humana es la tonter�a"
(Augusto Monterroso)