CLUSTER and MVCC

Started by Heikki Linnakangasabout 19 years ago36 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me
that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

I feel that I must missing something, or someone would've already fixed
it a long time ago...

Csaba, you mentioned recently
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that
you're actually using the MVCC-violation to clean up tables during a
backup. Can you tell us a bit more about that? Would you be upset if we
shut that backdoor?

In any case, the MVCC-violation needs to be documented. I'll send a doc
patch to pgsql-patches shortly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#2Csaba Nagy
nagy@ecircle-ag.com
In reply to: Heikki Linnakangas (#1)
Re: CLUSTER and MVCC

On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote:

Csaba, you mentioned recently
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that
you're actually using the MVCC-violation to clean up tables during a
backup. Can you tell us a bit more about that? Would you be upset if we
shut that backdoor?

My use case: a queue-like table (in fact a 'task' table) which is very
frequently inserted/updated/deleted. This table tends to be bloated in
the presence of any long running transaction... the only transactional
behavior we need from this table is to make sure that when we insert
something in this table in a transaction (possibly together with other
actions) and then commit/rollback, it commits/rolls back the insert.
CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be
able to lock the table if another transaction inserted something in it
(the inserting transaction will have a lock on the table). Selections on
this table are not critical for us, it just doesn't matter which job
processor is getting which task and in what order... (actually it does
matter, but CLUSTER won't affect that either).

So what I do is execute CLUSTER once in 5 minutes on this table. This
works just fine, and keeps the table size small even if I have long
running transactions in progress. The DB backup is one of such
unavoidable long running transactions, and I use the table exclusion
switch to exclude this task table from the backup so it won't get locked
by it and let CLUSTER still do it's job (I had a rudimentary patch to do
this even before the feature was introduced to pg_dump). The table can
be dumped separately which is a brief operation, but I would have anyway
to clear it on a crash...

Now I could try and disable the CLUSTER cron job and see if i get
problems, as last it was disabled with postgres 7.4, maybe something
changed in between... but I can tell for sure that last time I enabled
it it really fixed our load on the DB server...

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

Cheers,
Csaba.

#3Hannu Krosing
hannu@tm.ee
In reply to: Heikki Linnakangas (#1)
Re: CLUSTER and MVCC

Ühel kenal päeval, R, 2007-03-09 kell 11:29, kirjutas Heikki
Linnakangas:

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me
that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

I feel that I must missing something, or someone would've already fixed
it a long time ago...

Probably it is not MVCC safe because the relation is swapped out from
under the pg_class.

That is, it can be possible , that older and newer transactions read
different datafiles and so simle MVCC does not work.

Csaba, you mentioned recently
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that
you're actually using the MVCC-violation to clean up tables during a
backup. Can you tell us a bit more about that? Would you be upset if we
shut that backdoor?

In any case, the MVCC-violation needs to be documented. I'll send a doc
patch to pgsql-patches shortly.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#1)
Re: CLUSTER and MVCC

On Fri, 2007-03-09 at 11:29 +0000, Heikki Linnakangas wrote:

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me
that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

It's trivial to fix now in this way, but it would break HOT, since an
indexscan only returns one row per index entry.

I feel that I must missing something, or someone would've already fixed
it a long time ago...

Csaba, you mentioned recently
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that
you're actually using the MVCC-violation to clean up tables during a
backup. Can you tell us a bit more about that? Would you be upset if we
shut that backdoor?

On that thread I suggested we add NOWAIT syntax to allow the existing
behaviour to continue, as Csaba requested. The default should be to wait
for other transactions to complete, like CREATE INDEX CONCURRENTLY, when
the command is run outside of a transaction block.

If you do this any other way, you'll need to fix it for HOT.

In any case, the MVCC-violation needs to be documented. I'll send a doc
patch to pgsql-patches shortly.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#5Bruce Momjian
bruce@momjian.us
In reply to: Csaba Nagy (#2)
Re: CLUSTER and MVCC

"Csaba Nagy" <nagy@ecircle-ag.com> writes:

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

Doing it in CLUSTER would be weird. However perhaps it would be useful to have
some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
have to be super-user-only and carry big warning labels saying it breaks MVCC.

But it would be useful any time you have a table that you want to exempt a
particular table from serializable snapshots. Basically a per-table way to
force a read-committed snapshot on. Though, actually it's not quite a
read-committed snapshot is it? Anyone using an old serializable snapshot will
see what, no tuples at all?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#6Csaba Nagy
nagy@ecircle-ag.com
In reply to: Bruce Momjian (#5)
Re: CLUSTER and MVCC

On Fri, 2007-03-09 at 13:42, Gregory Stark wrote:

"Csaba Nagy" <nagy@ecircle-ag.com> writes:

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

Doing it in CLUSTER would be weird. However perhaps it would be useful to have
some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
have to be super-user-only and carry big warning labels saying it breaks MVCC.

Well, the current behavior of CLUSTER is just perfect for what I'm using
it. If anything else would do the job, I would be happy to use it
instead...

But it would be useful any time you have a table that you want to exempt a
particular table from serializable snapshots. Basically a per-table way to
force a read-committed snapshot on. Though, actually it's not quite a
read-committed snapshot is it? Anyone using an old serializable snapshot will
see what, no tuples at all?

I'm afraid what I need has nothing to do with serializable snapshots...
I still want the table to be completely transactional except if somebody
can get an exclusive lock on it, it can be compacted regardless of other
running transactions. I'm not sure how to express this in other way...
it means something like: no transaction cares about the content of the
table until it gets some kind of lock on it. In other words the table's
state is not connected with the state of other tables until I actually
do something on it...

Cheers,
Csaba.

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Csaba Nagy (#2)
Re: CLUSTER and MVCC

Csaba Nagy wrote:

On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote:

Csaba, you mentioned recently
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that
you're actually using the MVCC-violation to clean up tables during a
backup. Can you tell us a bit more about that? Would you be upset if we
shut that backdoor?

My use case: a queue-like table (in fact a 'task' table) which is very
frequently inserted/updated/deleted. This table tends to be bloated in
the presence of any long running transaction... the only transactional
behavior we need from this table is to make sure that when we insert
something in this table in a transaction (possibly together with other
actions) and then commit/rollback, it commits/rolls back the insert.
CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be
able to lock the table if another transaction inserted something in it
(the inserting transaction will have a lock on the table). Selections on
this table are not critical for us, it just doesn't matter which job
processor is getting which task and in what order... (actually it does
matter, but CLUSTER won't affect that either).

Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you
just want to remove old tuples.

As a long term solution, it would be nice if we had more fine-grained
bookkeeping of snapshots that are in use in the system. In your case,
there's a lot of tuples that are not visible to pg_dump because xmin is
too new, and also not visible to any other transaction because xmax is
too old. If we had a way to recognize situations like that, and vacuum
those tuples, much of the problem with long-running transactions would
go away.

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

I guess we could, but I don't see why should encourage using CLUSTER for
that. A more aggressive, MVCC-breaking version of VACUUM would make more
sense to me, but I don't like the idea of adding "break-MVCC" flags to
any commands.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#5)
Re: CLUSTER and MVCC

Gregory Stark wrote:

"Csaba Nagy" <nagy@ecircle-ag.com> writes:

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

Doing it in CLUSTER would be weird. However perhaps it would be useful to have
some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
have to be super-user-only and carry big warning labels saying it breaks MVCC.

But it would be useful any time you have a table that you want to exempt a
particular table from serializable snapshots. Basically a per-table way to
force a read-committed snapshot on. Though, actually it's not quite a
read-committed snapshot is it? Anyone using an old serializable snapshot will
see what, no tuples at all?

Unless you used FrozenTransactionId ...

But I'm not really seeing the problem here. Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place? If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#9Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#4)
Re: CLUSTER and MVCC

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems

to

me that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

It's trivial to fix now in this way, but it would break HOT,
since an indexscan only returns one row per index entry.

Well, with SnapshotAny HOT should probably return all possibly visible
tuples
with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
for reading a whole table a seq scan and sort is usually cheaper, at
least when the clustering is so bad that a CLUSTER is needed.)

Andreas

#10Csaba Nagy
nagy@ecircle-ag.com
In reply to: Heikki Linnakangas (#7)
Re: CLUSTER and MVCC

Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you
just want to remove old tuples.

... and then restart the app so all my pooled connections drop their
cached plans ;-)

Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.

As a long term solution, it would be nice if we had more fine-grained
bookkeeping of snapshots that are in use in the system. In your case,
there's a lot of tuples that are not visible to pg_dump because xmin is
too new, and also not visible to any other transaction because xmax is
too old. If we had a way to recognize situations like that, and vacuum
those tuples, much of the problem with long-running transactions would
go away.

In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

I guess we could, but I don't see why should encourage using CLUSTER for
that. A more aggressive, MVCC-breaking version of VACUUM would make more
sense to me, but I don't like the idea of adding "break-MVCC" flags to
any commands.

Well, if there would be any other way to avoid the table bloat I would
agree.

Cheers,
Csaba.

#11Csaba Nagy
nagy@ecircle-ag.com
In reply to: Alvaro Herrera (#8)
Re: CLUSTER and MVCC

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here. Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place? If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...

Cheers,
Csaba.

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Csaba Nagy (#11)
Re: CLUSTER and MVCC

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here. Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place? If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...

It does, HOT won't help you here. A long-running transaction is just as
much of a problem with HOT as without. Besides, I don't recall that
you're doing updates in the first place.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Florian Pflug
fgp@phlo.org
In reply to: Heikki Linnakangas (#12)
Re: CLUSTER and MVCC

Heikki Linnakangas wrote:

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here. Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place? If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...

It does, HOT won't help you here. A long-running transaction is just as
much of a problem with HOT as without. Besides, I don't recall that
you're doing updates in the first place.

Couldn't HOT in principle deal with this? Let's say you have two long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they
are not currently visible, nor will they ever become visible because they are
already deleted.

greetings, Florian Pflug

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Csaba Nagy (#10)
Re: CLUSTER and MVCC

Csaba Nagy wrote:

Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you
just want to remove old tuples.

... and then restart the app so all my pooled connections drop their
cached plans ;-)

Yeah, though Tom's working on plan invalidation for 8.3, so that
wouldn't be an issue.

Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.

Hmm. How about:

1. LOCK TABLE queue_table
2. SELECT * INTO temp_table FROM queue_table
3. TRUNCATE queue_table
4. INSERT INTO queue_table SELECT * FROM temp_table

That way you're copying the rows twice, but if there isn't many live
tuples it shouldn't matter too much.

As a long term solution, it would be nice if we had more fine-grained
bookkeeping of snapshots that are in use in the system. In your case,
there's a lot of tuples that are not visible to pg_dump because xmin is
too new, and also not visible to any other transaction because xmax is
too old. If we had a way to recognize situations like that, and vacuum
those tuples, much of the problem with long-running transactions would
go away.

In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.

We wouldn't clean up tuples that are visible to a transaction, but if
you have one long-running transaction like pg_dump in a database with
otherwise short transaction, you'll have a lot of tuples that are not
vacuumable because of the long-running process, but are not in fact
visible to any transaction. That's transactions that were inserted too
late to be seen by the old transaction, and deleted too long time ago to
be seen by any other transaction. Let me illustrate this with a timeline:

xmin1 xmax1
| |
-----+--X-X+X-+ooooooooooooooXoooooXoXoXXo+------>now
| |
xmin2 xmax2

xmin1 and xmax1 are the xmin and xmax of an old, long-running
serializable transaction, like pg_dump. The Xs between them are xids of
transactions that the old transaction sees as in-progress, IOW the
SnapshotData.xip-array.

xmin2 and xmax2 are the xmin and xmax of a newer transaction. Because of
the old-running transaction, xmin2 is far behind xmax2, but there's a
wide gap between that and the next transaction that the newer
transaction sees as in-progress.

The current rule to determine if a tuple is dead or not is to check that
tuple's xmax < oldestxmin. Oldestxmin is in this case xmin1. But in
addition to that, any tuple with an xmin > xmax1 and xmax that's not in
the xip-array of any snapshot in use (marked with o above), isn't
visible to any current or future transaction and can therefore be safely
vacuumed.

The implementation problem is that we don't have a global view of all
snapshots in the system. If we solve that, we can be more aggressive
with vacuuming in presence of long-running transactions. It's not an
easy problem, we don't want to add a lot of accounting overhead, but
maybe we could have some kind of an approximation of the global state
with little overhead, that would give most of the benefit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#9)
Re: CLUSTER and MVCC

Zeugswetter Andreas ADI SD wrote:

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems

to

me that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

It's trivial to fix now in this way, but it would break HOT,
since an indexscan only returns one row per index entry.

Well, with SnapshotAny HOT should probably return all possibly visible
tuples
with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
for reading a whole table a seq scan and sort is usually cheaper, at
least when the clustering is so bad that a CLUSTER is needed.)

Yes, it does an indexscan (last time I checked, at least). I think if a
performance improvement is demonstrated, we would accept a patch ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Florian Pflug (#13)
Re: CLUSTER and MVCC

Florian G. Pflug wrote:

Couldn't HOT in principle deal with this? Let's say you have two
long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C
(before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum,
because they
are not currently visible, nor will they ever become visible because
they are
already deleted.

Yes, you could detect that but you'd need a global view of all snapshots
in the system. I just posted a reply in this thread with more details..

It's not just with HOT, it's the way we determine that a tuple is
vacuumable in general.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Mike Rylander
mrylander@gmail.com
In reply to: Florian Pflug (#13)
Re: CLUSTER and MVCC

On 3/9/07, Florian G. Pflug <fgp@phlo.org> wrote:

Heikki Linnakangas wrote:

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here. Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place? If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...

It does, HOT won't help you here. A long-running transaction is just as
much of a problem with HOT as without. Besides, I don't recall that
you're doing updates in the first place.

Couldn't HOT in principle deal with this? Let's say you have two long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they
are not currently visible, nor will they ever become visible because they are
already deleted.

Couldn't they (or at least one of them) become visible due to
SAVEPOINT rollback?

greetings, Florian Pflug

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Mike Rylander (#17)
Re: CLUSTER and MVCC

Mike Rylander wrote:

On 3/9/07, Florian G. Pflug <fgp@phlo.org> wrote:

Couldn't HOT in principle deal with this? Let's say you have two
long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B,
C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum,
because they
are not currently visible, nor will they ever become visible because
they are
already deleted.

Couldn't they (or at least one of them) become visible due to
SAVEPOINT rollback?

You wouldn't remove tuples with an uncommited xmax, of course.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: CLUSTER and MVCC

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me
that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

The reason it's not trivial is that you also have to preserve the t_ctid
links of update chains. If you look into VACUUM FULL, a very large part
of its complexity is that it moves update chains as a unit to make that
possible. (BTW, I believe the problem Pavan Deolasee reported yesterday
is a bug somewhere in there --- it looks to me like sometimes the same
update chain is getting copied multiple times.)

regards, tom lane

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#15)
Re: CLUSTER and MVCC

On Fri, 2007-03-09 at 12:48 -0300, Alvaro Herrera wrote:

Zeugswetter Andreas ADI SD wrote:

Is there a particular reason why CLUSTER isn't MVCC-safe? It seems

to

me that it would be trivial to fix, by using SnapshotAny instead of
SnapshotNow, and not overwriting the xmin/xmax with the xid of the
cluster command.

It's trivial to fix now in this way, but it would break HOT,
since an indexscan only returns one row per index entry.

Well, with SnapshotAny HOT should probably return all possibly visible
tuples
with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
for reading a whole table a seq scan and sort is usually cheaper, at
least when the clustering is so bad that a CLUSTER is needed.)

Yes, it does an indexscan (last time I checked, at least). I think if a
performance improvement is demonstrated, we would accept a patch ...

Again, right now, most things people do here will break HOT. At this
late stage before freeze, please everybody be careful to look and plan
for patch conflicts. (That isn't stay away, just be careful). Thanks.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#19)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#14)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#22)
#24Florian Pflug
fgp@phlo.org
In reply to: Heikki Linnakangas (#14)
#25Florian Pflug
fgp@phlo.org
In reply to: Tom Lane (#22)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#21)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#26)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#27)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#26)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#29)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#30)
#32Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#22)
#33Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#35)