Re: TimeOf(Subselects|Joins)FromLargeTables?

Started by Dann Corbitover 21 years ago21 messages

DCorbit@connx.com

over 21 years ago

-----Original Message-----
From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
Sent: Friday, June 04, 2004 12:37 PM
To: Dann Corbit
Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?

Dear Dann,

However, the pgsql developers do not recommend to use hash
indexes, the speed doubled using hash indexes.
("Note: Testing has shown PostgreSQL's hash indexes to
perform no better than B-tree indexes, and the index size and
build time for hash indexes is much worse. For these reasons,
hash index use is presently discouraged.")

Question:
If I created hash indexes, the query planner did not use
them; it used the b-tree indexes. I had to drop the b-tree
indexes to use the hash indexes. Why? Can I have 2 types of
indexes on the same column and force the planner to use what I want?

Hash indexes and btree indexes work very differently. Normally, hash
indexes are best for equal joins and btree indexes are best for any join
that is not based on exact matching. So (for instance):

WHERE colname = some_constant
Or
WHERE colname IN (constant1, constant2, constant3,...,constantk)

Would be better handled by a hash index

And
WHERE colname <= some_constant
Or
WHERE colname LIKE 'const%'

Would be better handled by a btree

In most database systems I have used, you could make any sorts of
indexes on any sorts of columns in any sorts of combinations up to some
limits (e.g. 16 columns max).

The planner should be smart enough to pick the best thing. I think that
the planner is simply making a mistake here. Perhaps the PostgreSQL
gurus can help somehow forcing the plan.

There seems to be something seriously defective with hash indexes in old
versions of PostgreSQL. I thought that it had been repaired in the
latest versions of PostgreSQL starting with 7.4 and above but maybe it
is not completely fixed yet.

Show quoted text

Thanks,
Tamas

===================================================================
explain ANALYZE select p.name, p.seq from prots p, kwx k
where p.fid=k.fid AND k.kw_acc=812;

QUERY PLAN
--------------------------------------------------------------
--------------------------------------------------------------
-----------
Nested Loop (cost=0.00..662253.63 rows=69180 width=344)
(actual time=43.337..36076.045 rows=78050 loops=1)
-> Index Scan using ix_kwx_acc on kwx k
(cost=0.00..245382.38 rows=69179 width=4) (actual
time=0.109..422.159 rows=78050 loops=1)
Index Cond: (kw_acc = 812)
-> Index Scan using prt_fid_ix on prots p
(cost=0.00..6.01 rows=1 width=348) (actual time=0.414..0.450
rows=1 loops=78050)
Index Cond: (p.fid = "outer".fid)
Total runtime: 36134.105 ms
===================================================================

-----Original Message-----
From: Dann Corbit [mailto:DCorbit@connx.com]
Sent: Thursday, June 03, 2004 7:36 PM
To: Hegedus, Tamas .
Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?

With a subselect, it limits the optimizers choice of which
table to process first.

It might be worth trying a hashed index on fid for both
tables, since you are doing an equal join.

-----Original Message-----
From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
Sent: Thursday, June 03, 2004 7:33 PM
To: Dann Corbit
Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?

A little bit better. But I do not understand why?

EXPLAIN ANALYZE select p.name, p.seq from prots p, kwx k
where p.fid=k.fid AND k.kw_acc=812;

QUERY PLAN
--------------------------------------------------------------
--------------------------------------------------------------
--------------------
Merge Join (cost=0.00..160429.66 rows=84473 width=349)
(actual time=0.263..69192.828 rows=78050 loops=1)
Merge Cond: ("outer".fid = "inner".fid)
-> Index Scan using ix_kwx_fid on kwx k
(cost=0.00..44987.55 rows=84473 width=4) (actual
time=0.137..5675.701 rows=78050 loops=1)
Filter: (kw_acc = 812)
-> Index Scan using prots_pkey on prots p
(cost=0.00..112005.24 rows=981127 width=353) (actual
time=0.059..61816.725 rows=1210377 loops=1) Total runtime:
69251.488 ms

-----Original Message-----
From: Dann Corbit [mailto:DCorbit@connx.com]
Sent: Thursday, June 03, 2004 6:59 PM
To: Hegedus, Tamas .; pgsql-general@postgresql.org
Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?

How does this query perform:

SELECT p.name, p.seq
FROM prots p, kwx k
WHERE
p.fid=k.fid
AND
k.kw_acc=812
;

-----Original Message-----
From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
Sent: Thursday, June 03, 2004 6:48 PM
To: 'pgsql-general@postgresql.org'
Subject: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?

Dear All,

I am a biologist and I do not know what to expect from an RDB
(PgSQL). I have large tables: 1215607 rows in prots,

2184596 rows in

kwx (see table details below). I would like to do something like
that:

SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM

kwx WHERE

kw_acc=812);

After executing this (either as a subquery or joins) the
best/fastest result what I had (SET enable_seqscan=off):

83643.482

ms (see EXPLAIN ANALYZE below).

The two (similar) parts of this query are executed much
faster: SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms,
n(rows)=78050 SELECT name, seq FROM prots WHERE fid < 80000
-- takes 1969.231 ms

Is this realistic? OK?
If not: how can I increase the speed by fine tuning of the RDB
(indexes, run-time parameters) or my SQL query? (It came

now into my

mind: if I decrease the number of columns in the prots table (to
have only 3 fields (fid, name, seq) instead of 20

columns), than the

prots table will have smaller file size on disk, than

this table may

need less disk page fetches, queries may be faster. Is this true?)

Thanks for your help!
Tamas

===============================================
Table "public.prots"
Column | Type | Modifiers
-----------+----------------------+----------
fid | integer | not null
name | character varying(10) | not null
[...other 17 columns...]
seq | text |
Indexes:
"prots_pkey" primary key, btree (fid)
"ix_prots_acc" unique, btree (acc)
"ix_prots_name" unique, btree (name)
"ix_prots_class" btree ("class")
===============================================
Table "public.kwx"
Column | Type | Modifiers
--------+--------+----------
fid | integer |
kw_acc | integer |
Indexes:
"ix_kwx_acc" btree (kw_acc)
"ix_kwx_fid" btree (fid)
Foreign-key constraints:
"fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
"fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
===============================================

EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on
(prots.fid=kwx.fid) where kwx.kw_acc = 812;

QUERY PLAN

--------------------------------------------------------------
--------------------------------------------------------------
------------------
Merge Join (cost=0.00..160429.66 rows=84473 width=349) (actual
time=29.039..83505.629 rows=78050 loops=1)
Merge Cond: ("outer".fid = "inner".fid)
-> Index Scan using ix_kwx_fid on kwx
(cost=0.00..44987.55 rows=84473 width=4) (actual
time=18.893..5730.468 rows=78050 loops=1)
Filter: (kw_acc = 812)
-> Index Scan using prots_pkey on prots
(cost=0.00..112005.24 rows=981127 width=353) (actual
time=0.083..76059.235 rows=1210377 loops=1) Total runtime:
83643.482 ms (6 rows)

---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Dann Corbit (#1)

Why hash indexes suck

"Dann Corbit" <DCorbit@connx.com> writes:

There seems to be something seriously defective with hash indexes in old
versions of PostgreSQL.

They still suck; I'm not aware of any situation where I'd recommend hash
over btree indexes in Postgres. I think we have fixed the hash indexes'
deadlock problems as of 7.4, but there's still no real performance
advantage.

I just had an epiphany as to the probable reason why the performance sucks.
It's this: the hash bucket size is the same as the page size (ie, 8K).

This means that if you have only one or a few items per bucket, the
information density is awful, and you lose big on I/O requirements
compared to a btree index. On the other hand, if you have enough
items per bucket to make the storage density competitive, you will
be doing linear searches through dozens if not hundreds of items that
are all in the same bucket, and you lose on CPU time (compared to btree
which can do binary search to find an item within a page).

It would probably be interesting to look into making the hash bucket
size be just a fraction of a page, with the intent of having no more
than a couple dozen items per bucket. I'm not sure what the
implications would be for intra-page storage management or index locking
conventions, but offhand it seems like there wouldn't be any
insurmountable problems.

I'm not planning on doing this myself, just throwing it out as a
possible TODO item for anyone who's convinced that hash indexes ought
to work better than they do.

regards, tom lane

Sailesh Krishnamurthy

sailesh@cs.berkeley.edu

over 21 years ago

In reply to: Tom Lane (#2)

Re: Why hash indexes suck

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> This means that if you have only one or a few items per
Tom> bucket, the information density is awful, and you lose big on
Tom> I/O requirements compared to a btree index. On the other
Tom> hand, if you have enough items per bucket to make the storage
Tom> density competitive, you will be doing linear searches
Tom> through dozens if not hundreds of items that are all in the
Tom> same bucket, and you lose on CPU time (compared to btree
Tom> which can do binary search to find an item within a page).

This is probably a crazy idea, but is it possible to organize the data
in a page of a hash bucket as a binary tree ? Then you wouldn't lose
wrt CPU time at least.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Sailesh Krishnamurthy (#3)

Re: Why hash indexes suck

Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:

This is probably a crazy idea, but is it possible to organize the data
in a page of a hash bucket as a binary tree ?

Only if you want to require a hash opclass to supply ordering operators,
which sort of defeats the purpose I think. Hash is only supposed to
need equality not ordering.

regards, tom lane

Jeff Davis

jdavis-pgsql@empires.org

over 21 years ago

In reply to: Tom Lane (#4)

Re: Why hash indexes suck

On Sat, 2004-06-05 at 13:31, Tom Lane wrote:

Only if you want to require a hash opclass to supply ordering operators,
which sort of defeats the purpose I think. Hash is only supposed to
need equality not ordering.

Is it possible to assume some kind of ordering (i.e. strcmp() the binary
data of the type) as long as it's consistent?

Regards,
Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Jeff Davis (#5)

Re: Why hash indexes suck

Jeff Davis <jdavis-pgsql@empires.org> writes:

On Sat, 2004-06-05 at 13:31, Tom Lane wrote:

Only if you want to require a hash opclass to supply ordering operators,
which sort of defeats the purpose I think. Hash is only supposed to
need equality not ordering.

Is it possible to assume some kind of ordering (i.e. strcmp() the binary
data of the type) as long as it's consistent?

Not really; that would assume that equality of the datatype is the same
as bitwise equality, which is not the case in general (consider -0
versus +0 in IEEE floats, or any datatype with pad bits in the struct).
Some time ago we got rid of the assumption that hash should hash on all
the bits without any type-specific intervention, and I don't want to
reintroduce those bugs.

We could safely sort on the hash value, but I'm not sure how effective
that would be, considering that we're talking about values that already
hashed into the same bucket --- there's likely not to be very many
distinct hash values there.

regards, tom lane

Noname

pgsql@mohawksoft.com

over 21 years ago

In reply to: Tom Lane (#4)

Re: Why hash indexes suck

Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> writes:

This is probably a crazy idea, but is it possible to organize the data
in a page of a hash bucket as a binary tree ?

Only if you want to require a hash opclass to supply ordering operators,
which sort of defeats the purpose I think. Hash is only supposed to
need equality not ordering.

A btree is frequently used within the buckets of a hash table, expecially
if you expect to have a large number of items in each bucket.

If PostgreSQL could create a hash table index which is a single top level
hash table with each hash bucket being a btree index. You can eliminate a
number of btree searches by hashing, and then fall into btree performance
after the first hash lookup. The administrator should be able to gather
statistics about the population of the hash buckets and rehash if
performance begins to behave like a btree or the data is not distributed
evenly. Given proper selection of the initial number of buckets, a hash
table could blow btree out of the water. Given a poor selection of the
number of buckets, i.e. 1, a hash will behave no worse than a btree.

Also, it would be helpful to be able to specify a hash function during the
create or rehash, given a specific class of data, extraction of the random
elements can be more efficient and/or effective given knowledge of the
data. Think something like "bar codes," there are portions of the data
which are ususally the same and portions of the data which are usually
different. Focusing on the portions of the data which tend to be different
will generally provide a more evenly distributed hash.

Zeugswetter Andreas SB SD

ZeugswetterA@spardat.at

over 21 years ago

In reply to: Noname (#7)

Re: Why hash indexes suck

We could safely sort on the hash value, but I'm not sure how effective
that would be, considering that we're talking about values that already
hashed into the same bucket --- there's likely not to be very many
distinct hash values there.

I think we can safely put that on the todo list.
The existing hash algorithm is very good. So I would on the
contrary beleive that only a few keys share a hash value per pagesized bucket.
For the equal keys case it does not matter since we want all of the rows anyways.
For the equal hash value case it would probably be best to sort by ctid.

TODO ?: order heap pointers inside hash index pages by hash value and ctid

Andreas

Import Notes

Resolved by subject fallback

Bruce Momjian

pgman@candle.pha.pa.us

over 21 years ago

In reply to: Zeugswetter Andreas SB SD (#8)

Re: Why hash indexes suck

Added to TODO:

* Order heap pointers on hash index pages by hash value and ctid

---------------------------------------------------------------------------

Zeugswetter Andreas SB SD wrote:

We could safely sort on the hash value, but I'm not sure how effective
that would be, considering that we're talking about values that already
hashed into the same bucket --- there's likely not to be very many
distinct hash values there.

I think we can safely put that on the todo list.
The existing hash algorithm is very good. So I would on the
contrary beleive that only a few keys share a hash value per pagesized bucket.
For the equal keys case it does not matter since we want all of the rows anyways.
For the equal hash value case it would probably be best to sort by ctid.

TODO ?: order heap pointers inside hash index pages by hash value and ctid

Andreas

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#10

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Bruce Momjian (#9)

Re: Why hash indexes suck

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Added to TODO:
* Order heap pointers on hash index pages by hash value and ctid

[blink] This seems to miss out on the actual point of the thread (hash
bucket size shouldn't be a disk page) in favor of an entirely
unsupported sub-suggestion.

regards, tom lane

#11

Bruce Momjian

pgman@candle.pha.pa.us

over 21 years ago

In reply to: Tom Lane (#10)

Re: Why hash indexes suck

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Added to TODO:
* Order heap pointers on hash index pages by hash value and ctid

[blink] This seems to miss out on the actual point of the thread (hash
bucket size shouldn't be a disk page) in favor of an entirely
unsupported sub-suggestion.

Yes, I was unsure of the text myself. I have changed it to:

* Allow hash buckets to fill disk pages, rather than being
sparse

If we sorted the keys, how do we insert new entries efficiently?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#12

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Bruce Momjian (#11)

Re: Why hash indexes suck

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

[blink] This seems to miss out on the actual point of the thread (hash
bucket size shouldn't be a disk page) in favor of an entirely
unsupported sub-suggestion.

Yes, I was unsure of the text myself. I have changed it to:
* Allow hash buckets to fill disk pages, rather than being
sparse

OK, though maybe "pack hash index buckets onto disk pages more
efficiently" would be clearer.

If we sorted the keys, how do we insert new entries efficiently?

That is why I called it "unsupported". I'm not clear what would happen
in buckets that overflow onto multiple pages --- do we try to maintain
ordering across all the pages, or just within a page, or what? How much
does this buy us versus what it costs to maintain? Maybe there's a win
there but I think it's pretty speculative ...

regards, tom lane

#13

Bruce Momjian

pgman@candle.pha.pa.us

over 21 years ago

In reply to: Tom Lane (#12)

Re: Why hash indexes suck

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

[blink] This seems to miss out on the actual point of the thread (hash
bucket size shouldn't be a disk page) in favor of an entirely
unsupported sub-suggestion.

Yes, I was unsure of the text myself. I have changed it to:
* Allow hash buckets to fill disk pages, rather than being
sparse

OK, though maybe "pack hash index buckets onto disk pages more
efficiently" would be clearer.

OK, updated.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#14

Kenneth Marshall

ktm@it.is.rice.edu

over 21 years ago

In reply to: Tom Lane (#2)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

"Min Xu (Hsu)" <xu@cs.wisc.edu> writes:

It seems to me this is an interesting phenomena of interactions between
frequent events of transaction commits and infrequent events of system
checkpoints. A potential alternative solution to adding a new shared
lock to the frequent commit operation is to let the infrequent
checkpoint operation take more overhead. I suppose acquiring/releasing
an extra lock for each commit would incur extra performance overhead,
even when the lock is not contented. On the other hand, let the
checkpoing operation acquire some existing locks (exclusively) to
effectively disallowing committing transactions to interfere with the
checkpoint process might be a better solution since it incur higher
overhead only when necessary.

Unfortunately, there isn't any pre-existing lock that will serve.
A transaction that is between XLogInsert'ing its COMMIT record and
updating the shared pg_clog data area does not hold any lock that
could be used to prevent a checkpoint from starting. (Or it didn't
until yesterday's patch, anyway.)

I looked briefly at reorganizing the existing code so that we'd do the
COMMIT XLogInsert while we're holding lock on the shared pg_clog data,
which would solve the problem without adding any new lock acquisition.
But this seemed extremely messy to do. Also it would be optimizing
transaction commit at the cost of pessimizing other uses of pg_clog,
which might have to wait longer to get at the shared data. Adding the
new lock has the advantage that we can be sure it's not blocking
anything we don't want it to block.

Thanks for thinking about the problem though ...

regards, tom lane

One problem with a high-traffic LWLock is that they require a write
to shared memory for both the shared lock and the exclusive lock. On
the increasingly prevalent SMP machines, this will cause the invalidation
of the cache-line containing the lock and the consequent reload and its
inherent delay. Would it be possible to use a latch + version number in
this case to minimize this problem by allowing all but the checkpoint to
perform a read-only action on the latch? This should eliminate the cache-line
shenanigans on SMP machines.

Ken Marshall

#15

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Kenneth Marshall (#14)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

Kenneth Marshall <ktm@is.rice.edu> writes:

Would it be possible to use a latch + version number in
this case to minimize this problem by allowing all but the checkpoint to
perform a read-only action on the latch?

How would a read-only action work to block out the checkpoint?

More generally, though, this lock is hardly the one I'd be most
concerned about in an SMP situation. It's only taken once per
transaction, while there are others that may be taken many times.
(At least two of these, the WALInsertLock and the lock on shared
pg_clog, will need to be taken again in the process of recording
transaction commit.)

What I'd most like to find is a way to reduce contention for the
BufMgrLock --- there are at least some behavioral patterns in which
that is demonstrably a dominant cost. See past discussions in the
archives ("context swap storm" should find you some recent threads).

regards, tom lane

#16

Kenneth Marshall

ktm@it.is.rice.edu

over 21 years ago

In reply to: Tom Lane (#15)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:

Kenneth Marshall <ktm@is.rice.edu> writes:

Would it be possible to use a latch + version number in
this case to minimize this problem by allowing all but the checkpoint to
perform a read-only action on the latch?

How would a read-only action work to block out the checkpoint?

More generally, though, this lock is hardly the one I'd be most
concerned about in an SMP situation. It's only taken once per
transaction, while there are others that may be taken many times.
(At least two of these, the WALInsertLock and the lock on shared
pg_clog, will need to be taken again in the process of recording
transaction commit.)

What I'd most like to find is a way to reduce contention for the
BufMgrLock --- there are at least some behavioral patterns in which
that is demonstrably a dominant cost. See past discussions in the
archives ("context swap storm" should find you some recent threads).

regards, tom lane

The latch+version number is use by the checkpoint process. The
other processes can do a read of the latch to determine if it has
been set. This does not cause a cache invalidation hit. If the
latch is set, the competing processes read until it has been
cleared and the version updated. This makes the general case of
no checkpoint not incur a write and the consequent cache-line
invalidation and reload by all processors on an SMP system.

Ken

#17

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Kenneth Marshall (#16)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

Kenneth Marshall <ktm@is.rice.edu> writes:

On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:

How would a read-only action work to block out the checkpoint?

The latch+version number is use by the checkpoint process. The
other processes can do a read of the latch to determine if it has
been set. This does not cause a cache invalidation hit. If the
latch is set, the competing processes read until it has been
cleared and the version updated. This makes the general case of
no checkpoint not incur a write and the consequent cache-line
invalidation and reload by all processors on an SMP system.

Except that reading the latch and finding it clear offers no guarantee
that a checkpoint isn't about to start. The problem is that we are
performing two separate actions (write a COMMIT xlog record and update
transaction status in clog) and we have to prevent a checkpoint from
starting in between those actions. I don't see that there's any way to
do that with a read-only latch.

regards, tom lane

#18

Kenneth Marshall

ktm@it.is.rice.edu

over 21 years ago

In reply to: Tom Lane (#17)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote:

Kenneth Marshall <ktm@is.rice.edu> writes:

On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:

How would a read-only action work to block out the checkpoint?

The latch+version number is use by the checkpoint process. The
other processes can do a read of the latch to determine if it has
been set. This does not cause a cache invalidation hit. If the
latch is set, the competing processes read until it has been
cleared and the version updated. This makes the general case of
no checkpoint not incur a write and the consequent cache-line
invalidation and reload by all processors on an SMP system.

Except that reading the latch and finding it clear offers no guarantee
that a checkpoint isn't about to start. The problem is that we are
performing two separate actions (write a COMMIT xlog record and update
transaction status in clog) and we have to prevent a checkpoint from
starting in between those actions. I don't see that there's any way to
do that with a read-only latch.

regards, tom lane

Yes, you are correct. I missed that part of the previous thread. When
I saw "exclusive lock" I thought latch since that is what I am investigating
to solve other performance issues that I am addressing.

Ken

#19

Simon@2ndquadrant.com

simon@2ndquadrant.com

over 21 years ago

In reply to: Kenneth Marshall (#18)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote:

Kenneth Marshall <ktm@is.rice.edu> writes:

On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:

How would a read-only action work to block out the checkpoint?

The latch+version number is use by the checkpoint process. The
other processes can do a read of the latch to determine if it has
been set. This does not cause a cache invalidation hit. If the
latch is set, the competing processes read until it has been
cleared and the version updated. This makes the general case of
no checkpoint not incur a write and the consequent cache-line
invalidation and reload by all processors on an SMP system.

Except that reading the latch and finding it clear offers no guarantee
that a checkpoint isn't about to start. The problem is that we are
performing two separate actions (write a COMMIT xlog record and update
transaction status in clog) and we have to prevent a checkpoint from
starting in between those actions. I don't see that there's any way to
do that with a read-only latch.

...just caught up on this.

ISTM that more heavily loading the checkpoint process IS possible if the
checkpoint uses a two-phase lock. That would replace 1 write lock with 2
lock reads...which is likely to be beneficial for SMP, given I have faith
that the other two problems you mention will succumb to some solution in the
mid-term. The first lock is an "intent lock" followed by a second,
heavyweight lock just as you now have it.

Comitter:
1. prior to COMMIT: reads for an intent lock, if found then it attempts to
take heavyweight lock...if that is not possible, then the commit waits until
after the checkpoint, just as you currently suggest
2. prior to update clog: reads for an intent lock, if found then takes
heavyweight lock...if that is not possible, then report a server error

Checkpointer: (straight to step 4 for a shutdown checkpoint)
1. writes an intent lock (it always can)
2. wait for the group commit timeout
3. wait for 0.5 second more
4. begins to wait on an exclusive heavyweight lock, before starting
checkpoint proper

This is not a provably correct state machine, but the error message should
not occur under current "normal" situations. (It is possible that an intent
lock could be written by Checkpointer (step 1), after a Committer reads for
it (step 1), then a very long delay occurs before Committer's step 2), such
that Checkpointer step 4 begins before Committer step 2.) It is very likely
that this would be noticed by Comitter step 2 and reported upon, in the
unlikely event that it occurs.

Is a longer term solution for pg to use a background log writer? That would
make group commit much easier to perform automatically without the
false-delay model currently available.

Best Regards, Simon Riggs

#20

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Simon@2ndquadrant.com (#19)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:

This is not a provably correct state machine

I think the discussion ends right there. You are assuming that the
commit is guaranteed to finish in X amount of time, when it is not
possible to make any such guarantee. We are not putting in an
unreliable commit mechanism in order to save a small amount of lock
contention. (As I tried to point out already, but it doesn't seem
to have sunk in: this newly-added lock is not likely to be that much
more contention added to the commit path, seeing that the path of
control it protects already involves taking at least two exclusive
LWLocks. Those locks will likely each cause as much or more SMP
cache thrashing as this one.)

What we could use is a better way to build LWLocks in general. I do not
know how to do that, though, in the face of SMP machines that seem to
fundamentally not have any cheap locking mechanisms...

regards, tom lane

#21

Simon@2ndquadrant.com

simon@2ndquadrant.com

over 21 years ago

In reply to: Tom Lane (#20)

Re: Re: We have got a serious problem with pg_clog/WAL synchronization

Tom Lane

"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:

This is not a provably correct state machine

I think the discussion ends right there.

Yes...

Negative results are worth documenting too, IMHO.

Best Regards, Simon Riggs