Process local hint bit cache
In a previous thread
(http://postgresql.1045698.n5.nabble.com/Set-hint-bits-upon-eviction-from-BufMgr-td4264323.html)
I was playing with the idea of granting the bgwriter the ability to
due last chance hint bit setting before evicting the page out. I still
think might be a good idea, and it might be an especially good idea,
especially in scenarios where you get set the PD_ALL_VISIBLE bit when
writing out the page, a huge win in some cases. That said, it bugged
me that it didn't really fix the general case of large data insertions
and the subsequent i/o problems setting out the bits, which is my
primary objective in the short term.
So I went back to the drawing board, reviewed the archives, and came
up with a new proposal. I'd like to see a process local clog page
cache of around 1-4 pages (8-32kb typically) that would replace the
current TransactionLogFetch last xid cache. It should be small,
because I doubt more would really help much (see below) and we want to
keep this data in the tight cpu caches since it's going to be
constantly scanned.
The cache itself will be the clog pages and a small header per page
which will contain the minimum information necessary to match an xid
against a page to determine a hit, and a count of hits. Additionally
we keep a separate small array (say 100) of type xid (int) that we
insert write into in a cache miss.
So, cache lookup algorithm basically is:
scan clog cache headers and check for hit
if found (xid in range covered in clog page),
header.hit_count++;
else
miss_array[miss_count++] = xid;
A cache hit is defined about getting useful information from the page,
that is a transaction being committed or invalid.
When the miss count array fills, we sort it and determine the most
commonly hit clog page that is not in the cache and use that
information to replace pages in the cache if necessary, then reset the
counts. Maybe we can add a minimum threshold of hits, say 5-10% if
miss_array size for a page to be deemed interesting enough to be
loaded into the cache.
Interaction w/set hint bits:
*) If a clog lookup faults through the cache, we basically keep the
current behavior. That is, the hint bits are set and the page is
marked BM_DIRTY and the hint bits get written back
*) If we get a clog cache hit, that is the hint bits are not set but
we pulled the transaction status from the cache, the hint bits are
recorded on the page *but the page is not written back*, at least on
hint bit basis alone. This behavior branch is more or less the
BM_UNTIDY as suggested by haas (see archives), except it's only seen
in 'cache hit' scenarios. We are not writing pages back because the
cache is suggesting there is little/not benefit to write them back.
Thus, if a single backend is scanning a lot of pages with transactions
touching a very small number of clog pages, hint bits are generally
not written back because they are not needed and in fact not helping.
However, if the xid are spread around a large number of clog pages, we
get the current behavior more or less (plus the overhead of cache
maintenance).
With the current code base, hint bits are very beneficial when the xid
entropy is high and the number of repeated scan is high, and not so
good when the xid entropy is low and the number of repeated scans is
low. The process local cache attempts to redress this without
disadvantaging the already good cases. Furthermore, if it can be
proven that the cache overhead is epsilon, it's pretty unlikely to
negatively impact anyone negatively, at lest, that's my hope. Traffic
to clog will reduce (although not much, since i'd wager the current
'last xid' cache works pretty well), but i/o should be reduced, in
some cases quite significantly for a tiny cpu cost (although that
remains to be proven).
merlin
On Tue, Mar 29, 2011 at 4:34 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
In a previous thread
(http://postgresql.1045698.n5.nabble.com/Set-hint-bits-upon-eviction-from-BufMgr-td4264323.html)
I was playing with the idea of granting the bgwriter the ability to
due last chance hint bit setting before evicting the page out. I still
think might be a good idea, and it might be an especially good idea,
especially in scenarios where you get set the PD_ALL_VISIBLE bit when
writing out the page, a huge win in some cases. That said, it bugged
me that it didn't really fix the general case of large data insertions
and the subsequent i/o problems setting out the bits, which is my
primary objective in the short term.So I went back to the drawing board, reviewed the archives, and came
up with a new proposal. I'd like to see a process local clog page
cache of around 1-4 pages (8-32kb typically) that would replace the
current TransactionLogFetch last xid cache. It should be small,
because I doubt more would really help much (see below) and we want to
keep this data in the tight cpu caches since it's going to be
constantly scanned.The cache itself will be the clog pages and a small header per page
which will contain the minimum information necessary to match an xid
against a page to determine a hit, and a count of hits. Additionally
we keep a separate small array (say 100) of type xid (int) that we
insert write into in a cache miss.So, cache lookup algorithm basically is:
scan clog cache headers and check for hit
if found (xid in range covered in clog page),
header.hit_count++;
else
miss_array[miss_count++] = xid;A cache hit is defined about getting useful information from the page,
that is a transaction being committed or invalid.When the miss count array fills, we sort it and determine the most
commonly hit clog page that is not in the cache and use that
information to replace pages in the cache if necessary, then reset the
counts. Maybe we can add a minimum threshold of hits, say 5-10% if
miss_array size for a page to be deemed interesting enough to be
loaded into the cache.Interaction w/set hint bits:
*) If a clog lookup faults through the cache, we basically keep the
current behavior. That is, the hint bits are set and the page is
marked BM_DIRTY and the hint bits get written back*) If we get a clog cache hit, that is the hint bits are not set but
we pulled the transaction status from the cache, the hint bits are
recorded on the page *but the page is not written back*, at least on
hint bit basis alone. This behavior branch is more or less the
BM_UNTIDY as suggested by haas (see archives), except it's only seen
in 'cache hit' scenarios. We are not writing pages back because the
cache is suggesting there is little/not benefit to write them back.Thus, if a single backend is scanning a lot of pages with transactions
touching a very small number of clog pages, hint bits are generally
not written back because they are not needed and in fact not helping.
However, if the xid are spread around a large number of clog pages, we
get the current behavior more or less (plus the overhead of cache
maintenance).With the current code base, hint bits are very beneficial when the xid
entropy is high and the number of repeated scan is high, and not so
good when the xid entropy is low and the number of repeated scans is
low. The process local cache attempts to redress this without
disadvantaging the already good cases. Furthermore, if it can be
proven that the cache overhead is epsilon, it's pretty unlikely to
negatively impact anyone negatively, at lest, that's my hope. Traffic
to clog will reduce (although not much, since i'd wager the current
'last xid' cache works pretty well), but i/o should be reduced, in
some cases quite significantly for a tiny cpu cost (although that
remains to be proven).
A couple of details I missed:
*) clog lookups that return no cacheable information will not have
their xid inserted into the 'missed' array -- this will prevent a clog
page returning 'in progress' type states for transactions from pushing
out pages that are returning useful information. In other words, an
in progress transaction is neither a hit or miss from the point of
view of the cache -- it's nothing.
*) If we fault to the clog and get useful information
(commit/invalid), and the clog page is already cached -- either the
particular bit is set or the entire cache page is refreshed (not sure
which is the better way to go yet)
*) The clog cache might be useful in other places like during page
eviction hint bet setting scenarios I mentioned earlier. In non
bgwriter scenarios it's almost certainly a win to at least check the
cache and set hint bits for BM_HEAP pages since you are leveraging the
work already paid during scans. In the bgwriter case, you would have
to build the cache by checking clog pages. I'm somewhat skeptical if
this would actually help the bgwriter though since it involves
tradeoffs that are hard to estimate.
*) Maybe the shared buffer cache currently being maintained over the
clog can be scrapped. I'm going to leave it alone for now, but I'm
quite skeptical it provides much benefit even without local process
cache. clog page have a very nice property that you don't have to
worry about what else is going on from other processes and thus no
complicated locking or invalidation issues when considering cache
structure. IMNSHO -- this makes a local cache a much better fit even
if you have to keep it smaller for memory usage reasons.
merlin
On 30.03.2011 17:05, Merlin Moncure wrote:
*) Maybe the shared buffer cache currently being maintained over the
clog can be scrapped. I'm going to leave it alone for now, but I'm
quite skeptical it provides much benefit even without local process
cache. clog page have a very nice property that you don't have to
worry about what else is going on from other processes and thus no
complicated locking or invalidation issues when considering cache
structure. IMNSHO -- this makes a local cache a much better fit even
if you have to keep it smaller for memory usage reasons.
A related idea I've sometimes pondered is to mmap() the clog in every
process.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Mar 29, 2011 at 10:34 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
So I went back to the drawing board, reviewed the archives, and came
up with a new proposal. I'd like to see a process local clog page
cache of around 1-4 pages (8-32kb typically) that would replace the
current TransactionLogFetch last xid cache. It should be small,
because I doubt more would really help much (see below) and we want to
keep this data in the tight cpu caches since it's going to be
constantly scanned.
How is this different from the existing clog SLRU? It seems like the
fundamental difference is that you plan to defer updating the hint
bits rather than update them every time the row is accessed. That
doesn't seem like a net win, it'll just defer the i/o, not eliminate
it. I suppose the existing clog SLRU is page-based whereas this could
cache individual xids in a btree so that it could have a higher hit
rate. Or we could just increase the clog SLRU size if there's any
evidence there are often cache misses on it. I suggested having the
SLRU share memory pages with the buffer cache so it would
automatically size itself rather than having to be statically sized.
There is something to be gained by trying to update *all* the hint
bits on a page whenever any row is updated. And there might be
something to be gained by not dirtying the page so we only update the
hint bits on disk if the page is dirtied for some other reason.
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.
--
greg
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark <gsstark@mit.edu> wrote:
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.
I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Mar 30, 2011 at 9:40 AM, Greg Stark <gsstark@mit.edu> wrote:
On Tue, Mar 29, 2011 at 10:34 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
So I went back to the drawing board, reviewed the archives, and came
up with a new proposal. I'd like to see a process local clog page
cache of around 1-4 pages (8-32kb typically) that would replace the
current TransactionLogFetch last xid cache. It should be small,
because I doubt more would really help much (see below) and we want to
keep this data in the tight cpu caches since it's going to be
constantly scanned.How is this different from the existing clog SLRU? It seems like the
fundamental difference is that you plan to defer updating the hint
bits rather than update them every time the row is accessed. That
doesn't seem like a net win, it'll just defer the i/o, not eliminate
It is very different -- the slru layer is in shared memory and
requires locks to access. The entire point is trying to avoid
accessing this structure in tight code paths. I'm actually very
skeptical the slru layer provides any benefit at all. I bet it would
be cheaper just mmap in the pages directly (as Heikki is also
speculating).
Regarding deferring i/o, you have good chance of eliminating i/o if
the tuples are deleted or touched before a particular backend gets to
a page without already having seen the transaction (very high chance
under certain workloads). Worst case scenario, you get the old
behavior plus the overhead of maintaining the cache (which is why i'm
keen on bucketing at the page level -- so it's ultra cheap to scan and
memory efficient).
merlin
Excerpts from Merlin Moncure's message of mié mar 30 12:14:20 -0300 2011:
It is very different -- the slru layer is in shared memory and
requires locks to access. The entire point is trying to avoid
accessing this structure in tight code paths. I'm actually very
skeptical the slru layer provides any benefit at all. I bet it would
be cheaper just mmap in the pages directly (as Heikki is also
speculating).
Maybe it would be useful to distinguish the last SLRU page(s) (the one
where clog writing actually takes place) from the older ones (which only
ever see reading). You definitely need locks to be able to access the
active pages, but all the rest could be held as mmapped and accessed
without locks because they never change (except to be truncated away).
You know that any page behind RecentXmin is not going to be written
anymore, so why go through all the locking hoops?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Mar 30, 2011 at 10:43 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Merlin Moncure's message of mié mar 30 12:14:20 -0300 2011:
It is very different -- the slru layer is in shared memory and
requires locks to access. The entire point is trying to avoid
accessing this structure in tight code paths. I'm actually very
skeptical the slru layer provides any benefit at all. I bet it would
be cheaper just mmap in the pages directly (as Heikki is also
speculating).Maybe it would be useful to distinguish the last SLRU page(s) (the one
where clog writing actually takes place) from the older ones (which only
ever see reading). You definitely need locks to be able to access the
active pages, but all the rest could be held as mmapped and accessed
without locks because they never change (except to be truncated away).
You know that any page behind RecentXmin is not going to be written
anymore, so why go through all the locking hoops?
hm, that's a good thought -- cheap and easy -- and good to implement
irrespective of other changes. any improvement helps here, although
from my perspective the largest hobgoblins are shared memory access
generally and the extra i/o.
merlin
On 30.03.2011 18:02, Robert Haas wrote:
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote:
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.
I did a little experiment: I hacked SetHintBits() to do nothing, so that
hint bits are never set. I then created a table with 100000 rows in it:
postgres=# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer |
postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
INSERT 0 100000
And ran queries on the table:
postgres=# do $$
declare
i int4;
begin
loop
perform COUNT(*) FROM foo;
end loop;
end;
$$;
This test case is designed to exercise the visibility tests as much as
possible. However, all the tuples are loaded in one transaction, so the
one-element cache in TransactionLogFetch is 100% effective.
I ran oprofile on that. It shows that about 15% of the time is spent in
HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
$ opreport -c --global-percent
CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
...
-------------------------------------------------------------------------------
2143 0.4419 postgres postgres
heapgettup_pagemode
73277 15.1091 postgres postgres
heapgetpage
31858 6.5688 postgres postgres
HeapTupleSatisfiesMVCC
31858 6.5688 postgres postgres
HeapTupleSatisfiesMVCC [self]
12809 2.6411 postgres postgres
TransactionIdIsInProgress
12091 2.4931 postgres postgres
XidInMVCCSnapshot
7150 1.4743 postgres postgres
TransactionIdIsCurrentTransactionId
7056 1.4549 postgres postgres
TransactionIdDidCommit
1839 0.3792 postgres postgres
TransactionIdPrecedes
1467 0.3025 postgres postgres
SetHintBits
1155 0.2382 postgres postgres
TransactionLogFetch
-------------------------------------------------------------------------------
...
I then ran the same test again with an unpatched version, to set the
hint bits. After the hint bits were set, I ran oprofile again:
-------------------------------------------------------------------------------
275 0.4986 postgres heapgettup_pagemode
4459 8.0851 postgres heapgetpage
3005 5.4487 postgres HeapTupleSatisfiesMVCC
3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
1620 2.9374 postgres XidInMVCCSnapshot
110 0.1995 postgres TransactionIdPrecedes
-------------------------------------------------------------------------------
So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the
total CPU time, and without hint bits, 15%.
Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like
TransactionIdIsInProgress() and TransactionIdIsCurrentTransactionId().
Speeding up clog access is important; when the one-element cache isn't
saving you the clog access becomes a dominant factor. But you have to
address that other overhead too before you can get rid of hint bits.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like TransactionIdIsInProgress()
and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
important; when the one-element cache isn't saving you the clog access
becomes a dominant factor. But you have to address that other overhead too
before you can get rid of hint bits.
Yes, note I am not getting rid of the hint bits -- either you get them
directly from the tuple or the cache. The clog access layers are:
1. hint bit
2. backend local clog cache (my proposal)
====shared memory layer ====
3. slru
4. os file cache
5. clog file
My idea working hinges on rigging a cache (#2) that is not materially
slower than the raw hint bit check. If you think out all the cases
around what i'm suggesting, there is no way you hit #3 that you
wouldn't ottherwise with the old behavior, since when you fault
through #2 you mark the page dirty, but there are cases were full
faults to clog are avoided. I'm not optimizing clog accesses, I'm
avoiding them.
Basically, i'm extending the last xid cache check in
TransactionLogFetch so it holds >1 xid, and not setting BM_DIRTY *if
and only if* you got xid status from that cache.
merlin
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 30.03.2011 18:02, Robert Haas wrote:
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote:
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.I did a little experiment: I hacked SetHintBits() to do nothing, so that
hint bits are never set. I then created a table with 100000 rows in it:postgres=# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer |postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
INSERT 0 100000And ran queries on the table:
postgres=# do $$
declare
i int4;
begin
loop
perform COUNT(*) FROM foo;
end loop;
end;
$$;This test case is designed to exercise the visibility tests as much as
possible. However, all the tuples are loaded in one transaction, so the
one-element cache in TransactionLogFetch is 100% effective.I ran oprofile on that. It shows that about 15% of the time is spent in
HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:$ opreport -c --global-percent
CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
...
-------------------------------------------------------------------------------
2143 0.4419 postgres postgres heapgettup_pagemode
73277 15.1091 postgres postgres heapgetpage
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
[self]
12809 2.6411 postgres postgres
TransactionIdIsInProgress
12091 2.4931 postgres postgres XidInMVCCSnapshot
7150 1.4743 postgres postgres
TransactionIdIsCurrentTransactionId
7056 1.4549 postgres postgres TransactionIdDidCommit
1839 0.3792 postgres postgres TransactionIdPrecedes
1467 0.3025 postgres postgres SetHintBits
1155 0.2382 postgres postgres TransactionLogFetch
-------------------------------------------------------------------------------
...I then ran the same test again with an unpatched version, to set the hint
bits. After the hint bits were set, I ran oprofile again:-------------------------------------------------------------------------------
275 0.4986 postgres heapgettup_pagemode
4459 8.0851 postgres heapgetpage
3005 5.4487 postgres HeapTupleSatisfiesMVCC
3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
1620 2.9374 postgres XidInMVCCSnapshot
110 0.1995 postgres TransactionIdPrecedes
-------------------------------------------------------------------------------So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
CPU time, and without hint bits, 15%.Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like TransactionIdIsInProgress()
and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
important; when the one-element cache isn't saving you the clog access
becomes a dominant factor. But you have to address that other overhead too
before you can get rid of hint bits.
Here is a patch demonstrating the caching action, but without the
cache table, which isn't done yet -- It only works using the very last
transaction id fetched. I used macros so I could keep the changes
quite localized.
The payoff is obvious:
stock postgres:
postgres=# create table v as select generate_series(1,50000000) v;
select count(*) from v;
SELECT 50000000
Time: 70010.160 ms
select count(*) from v;
run 1: 64.5 seconds <-- !
run 2: 21.3 seconds
run 3: 19.3 seconds
hint bit patch:
run 1: 19.2 seconds <-- the 'money shot'
run 2: 20.7 seconds
run 3: 19.3 seconds
Of course, until I get the cache table mechanism finished, you only
see real benefit if you have significant # of pages all the same
transaction. otoh, checking the last transaction has cost of 0, so
your worst case performance is the old behavior. I'm pretty sure I
can make a cache that is cheap to check and maintain because I'm
completely free from locks, side effects, etc.
btw I haven't forgotten your idea to move TransactionIdInProgress
Down. I think this is a good idea, and will experiment with it pre and
post cache.
merlin
Attachments:
hbcache.patchapplication/octet-stream; name=hbcache.patchDownload+34-22
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
btw I haven't forgotten your idea to move TransactionIdInProgress
Down. I think this is a good idea, and will experiment with it pre and
post cache.
aside:
Moving TransactionIdInProgress below TransactionIdDidCommit can help
in once sense: TransactionIdDidCommit grabs the XidStatus but discards
the knowledge if the transaction is known aborted. If it is in fact
aborted you can immediately set the hint bits and punt. This should
save an awful lot of calls to TransactionIdInProgress when scanning
unhinted dead tuples.
otoh, checking commit status first can burn you if you are repeatedly
tuples that are in transaction, especially your own. Probably this
could be mitigated by splitting TransactionIdIsInProgress so that you
can do just the local checks and not shared memory, so that you could:
1. is this transaction mine? (if so, we are done)
2. get xid status if commit/abort done
3. do ProcArray portions of TransactionIdIsInProgress
merlin
On Wed, Mar 30, 2011 at 6:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
btw I haven't forgotten your idea to move TransactionIdInProgress
Down. I think this is a good idea, and will experiment with it pre and
post cache.aside:
Moving TransactionIdInProgress below TransactionIdDidCommit can help
in once sense: TransactionIdDidCommit grabs the XidStatus but discards
the knowledge if the transaction is known aborted. If it is in fact
aborted you can immediately set the hint bits and punt. This should
save an awful lot of calls to TransactionIdInProgress when scanning
unhinted dead tuples.
Yeah, it might make sense to have a function that returns
commit/abort/unsure, where unsure can only happen if the transaction
ID follows RecentXmin. You might also want to rearrange things so
that that function starts by checking the passed-in XID against
cachedFetchXid so we don't lose the benefit of that one-element cache.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 30.03.2011 18:02, Robert Haas wrote:
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote:
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.I did a little experiment: I hacked SetHintBits() to do nothing, so that
hint bits are never set. I then created a table with 100000 rows in it:postgres=# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer |postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
INSERT 0 100000And ran queries on the table:
postgres=# do $$
declare
i int4;
begin
loop
perform COUNT(*) FROM foo;
end loop;
end;
$$;This test case is designed to exercise the visibility tests as much as
possible. However, all the tuples are loaded in one transaction, so the
one-element cache in TransactionLogFetch is 100% effective.I ran oprofile on that. It shows that about 15% of the time is spent in
HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:$ opreport -c --global-percent
CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
...
-------------------------------------------------------------------------------
2143 0.4419 postgres postgres heapgettup_pagemode
73277 15.1091 postgres postgres heapgetpage
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
[self]
12809 2.6411 postgres postgres
TransactionIdIsInProgress
12091 2.4931 postgres postgres XidInMVCCSnapshot
7150 1.4743 postgres postgres
TransactionIdIsCurrentTransactionId
7056 1.4549 postgres postgres TransactionIdDidCommit
1839 0.3792 postgres postgres TransactionIdPrecedes
1467 0.3025 postgres postgres SetHintBits
1155 0.2382 postgres postgres TransactionLogFetch
-------------------------------------------------------------------------------
...I then ran the same test again with an unpatched version, to set the hint
bits. After the hint bits were set, I ran oprofile again:-------------------------------------------------------------------------------
275 0.4986 postgres heapgettup_pagemode
4459 8.0851 postgres heapgetpage
3005 5.4487 postgres HeapTupleSatisfiesMVCC
3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
1620 2.9374 postgres XidInMVCCSnapshot
110 0.1995 postgres TransactionIdPrecedes
-------------------------------------------------------------------------------So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
CPU time, and without hint bits, 15%.Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like TransactionIdIsInProgress()
and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
important; when the one-element cache isn't saving you the clog access
becomes a dominant factor. But you have to address that other overhead too
before you can get rid of hint bits.Here is a patch demonstrating the caching action, but without the
cache table, which isn't done yet -- It only works using the very last
transaction id fetched. I used macros so I could keep the changes
quite localized.
While playing with the data structure to implement an xid cache inside
TransactionLogFetch, I learned a nasty lesson.
HeapTupleSatisifiesMVCC is super sensitive to any changes and even
jumping through a couple of calls to TransactionLogFetch was
measurable in a couple of performance test case I came up with. I
hauntingly recalled Tom's warnings to beware unexpected performance
downsides.
In short, irregardless of implementation, I unfortunately don't think
any caching as or under the clog abstraction is going to work without
impacting at least some cases.
OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
provide the same benefit I'm going after here. If you save off the
the last xid that came back committed in a static variable, you can
check that at the same level as if it were a hint bit. From my
testing, this is well and truly 'free' in the sense that it can
defer/eliminate clog checks and hint bit i/o.
The patch attached gives all the advantages I was after in the
previous patch without the downsides I found (extra calls to
TransactionIdInProgress and inefficient calls out to
TransactionLogFetch).
It remains to be determined if it's possible to make a structure that
is fast enough to expand the above to more than 1 xid. For starters,
all the calls should be inline. Considering we are only after the
commit bits, a super tight hashing/bucketing algorithm might fit the
bill.
merlin
diff -x '*.sql' -x '*.o' -x '*.txt' -rupN
postgresql-9.1alpha4/src/backend/utils/time/tqual.c
postgresql-9.1alpha4_hbcache/src/backend/utils/time/tqual.c
--- postgresql-9.1alpha4/src/backend/utils/time/tqual.c 2011-03-09
08:19:24.000000000 -0600
+++ postgresql-9.1alpha4_hbcache/src/backend/utils/time/tqual.c
2011-03-31 17:03:43.135195002 -0500
@@ -912,7 +912,10 @@ bool
HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
Buffer buffer)
{
- if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
+ static TransactionId LastCommitXid = InvalidTransactionId;
+
+ if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)
+ && LastCommitXid != HeapTupleHeaderGetXmin(tuple))
{
if (tuple->t_infomask & HEAP_XMIN_INVALID)
return false;
@@ -985,8 +988,11 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader t
else if
(TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple)))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
+ {
+ LastCommitXid = HeapTupleHeaderGetXmin(tuple);
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetXmin(tuple));
+ }
else
{
/* it must have aborted or crashed */
On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 30.03.2011 18:02, Robert Haas wrote:
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote:
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.I did a little experiment: I hacked SetHintBits() to do nothing, so that
hint bits are never set. I then created a table with 100000 rows in it:postgres=# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer |postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
INSERT 0 100000And ran queries on the table:
postgres=# do $$
declare
i int4;
begin
loop
perform COUNT(*) FROM foo;
end loop;
end;
$$;This test case is designed to exercise the visibility tests as much as
possible. However, all the tuples are loaded in one transaction, so the
one-element cache in TransactionLogFetch is 100% effective.I ran oprofile on that. It shows that about 15% of the time is spent in
HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:$ opreport -c --global-percent
CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
...
-------------------------------------------------------------------------------
2143 0.4419 postgres postgres heapgettup_pagemode
73277 15.1091 postgres postgres heapgetpage
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
[self]
12809 2.6411 postgres postgres
TransactionIdIsInProgress
12091 2.4931 postgres postgres XidInMVCCSnapshot
7150 1.4743 postgres postgres
TransactionIdIsCurrentTransactionId
7056 1.4549 postgres postgres TransactionIdDidCommit
1839 0.3792 postgres postgres TransactionIdPrecedes
1467 0.3025 postgres postgres SetHintBits
1155 0.2382 postgres postgres TransactionLogFetch
-------------------------------------------------------------------------------
...I then ran the same test again with an unpatched version, to set the hint
bits. After the hint bits were set, I ran oprofile again:-------------------------------------------------------------------------------
275 0.4986 postgres heapgettup_pagemode
4459 8.0851 postgres heapgetpage
3005 5.4487 postgres HeapTupleSatisfiesMVCC
3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
1620 2.9374 postgres XidInMVCCSnapshot
110 0.1995 postgres TransactionIdPrecedes
-------------------------------------------------------------------------------So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
CPU time, and without hint bits, 15%.Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like TransactionIdIsInProgress()
and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
important; when the one-element cache isn't saving you the clog access
becomes a dominant factor. But you have to address that other overhead too
before you can get rid of hint bits.Here is a patch demonstrating the caching action, but without the
cache table, which isn't done yet -- It only works using the very last
transaction id fetched. I used macros so I could keep the changes
quite localized.While playing with the data structure to implement an xid cache inside
TransactionLogFetch, I learned a nasty lesson.
HeapTupleSatisifiesMVCC is super sensitive to any changes and even
jumping through a couple of calls to TransactionLogFetch was
measurable in a couple of performance test case I came up with. I
hauntingly recalled Tom's warnings to beware unexpected performance
downsides.In short, irregardless of implementation, I unfortunately don't think
any caching as or under the clog abstraction is going to work without
impacting at least some cases.OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
provide the same benefit I'm going after here. If you save off the
the last xid that came back committed in a static variable, you can
check that at the same level as if it were a hint bit. From my
testing, this is well and truly 'free' in the sense that it can
defer/eliminate clog checks and hint bit i/o.The patch attached gives all the advantages I was after in the
hm, this isn't quite right -- if you get a 'last xid cache hit', the
hint bit tuples should be set if they are not already (but the page
should not be dirtied).
working on exanding the cache to # xid > 1.
merlin
On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 30.03.2011 18:02, Robert Haas wrote:
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote:
But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.I did a little experiment: I hacked SetHintBits() to do nothing, so that
hint bits are never set. I then created a table with 100000 rows in it:postgres=# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer |postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
INSERT 0 100000And ran queries on the table:
postgres=# do $$
declare
i int4;
begin
loop
perform COUNT(*) FROM foo;
end loop;
end;
$$;This test case is designed to exercise the visibility tests as much as
possible. However, all the tuples are loaded in one transaction, so the
one-element cache in TransactionLogFetch is 100% effective.I ran oprofile on that. It shows that about 15% of the time is spent in
HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:$ opreport -c --global-percent
CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
...
-------------------------------------------------------------------------------
2143 0.4419 postgres postgres heapgettup_pagemode
73277 15.1091 postgres postgres heapgetpage
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
[self]
12809 2.6411 postgres postgres
TransactionIdIsInProgress
12091 2.4931 postgres postgres XidInMVCCSnapshot
7150 1.4743 postgres postgres
TransactionIdIsCurrentTransactionId
7056 1.4549 postgres postgres TransactionIdDidCommit
1839 0.3792 postgres postgres TransactionIdPrecedes
1467 0.3025 postgres postgres SetHintBits
1155 0.2382 postgres postgres TransactionLogFetch
-------------------------------------------------------------------------------
...I then ran the same test again with an unpatched version, to set the hint
bits. After the hint bits were set, I ran oprofile again:-------------------------------------------------------------------------------
275 0.4986 postgres heapgettup_pagemode
4459 8.0851 postgres heapgetpage
3005 5.4487 postgres HeapTupleSatisfiesMVCC
3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
1620 2.9374 postgres XidInMVCCSnapshot
110 0.1995 postgres TransactionIdPrecedes
-------------------------------------------------------------------------------So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
CPU time, and without hint bits, 15%.Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like TransactionIdIsInProgress()
and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
important; when the one-element cache isn't saving you the clog access
becomes a dominant factor. But you have to address that other overhead too
before you can get rid of hint bits.Here is a patch demonstrating the caching action, but without the
cache table, which isn't done yet -- It only works using the very last
transaction id fetched. I used macros so I could keep the changes
quite localized.While playing with the data structure to implement an xid cache inside
TransactionLogFetch, I learned a nasty lesson.
HeapTupleSatisifiesMVCC is super sensitive to any changes and even
jumping through a couple of calls to TransactionLogFetch was
measurable in a couple of performance test case I came up with. I
hauntingly recalled Tom's warnings to beware unexpected performance
downsides.In short, irregardless of implementation, I unfortunately don't think
any caching as or under the clog abstraction is going to work without
impacting at least some cases.OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
provide the same benefit I'm going after here. If you save off the
the last xid that came back committed in a static variable, you can
check that at the same level as if it were a hint bit. From my
testing, this is well and truly 'free' in the sense that it can
defer/eliminate clog checks and hint bit i/o.The patch attached gives all the advantages I was after in the
hm, this isn't quite right -- if you get a 'last xid cache hit', the
hint bit tuples should be set if they are not already (but the page
should not be dirtied).working on exanding the cache to # xid > 1.
patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;
Still needs some more code docs and refactoring/cleanup (I'm not even
gonna pretend this is up to pg coding standards), but it passes
regression and gets the performance i'm looking for w/o the downsides
I was seeing with other implementation.
maybe some extra cycles could be eeked out by doing aligned reads on
machine bit size (32 or 64). It's pretty fast as is though, at least
on x86. Feel free to take a look.
merlin
Attachments:
hbcache_v2.patchapplication/octet-stream; name=hbcache_v2.patchDownload+189-2
Merlin Moncure <mmoncure@gmail.com> writes:
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
btw I haven't forgotten your idea to move TransactionIdInProgress
Down. I think this is a good idea, and will experiment with it pre and
post cache.
The reason it's done in that order is to avoid race conditions. If you
change the order you will get wrong behavior if the other transaction
ends between the TransactionIdDidCommit and the TransactionIdInProgress
tests. I suppose you could recheck TransactionIdDidCommit a second
time, but that hardly seems likely to result in performance gains.
aside:
Moving TransactionIdInProgress below TransactionIdDidCommit can help
in once sense: TransactionIdDidCommit grabs the XidStatus but discards
the knowledge if the transaction is known aborted.
Doesn't the single-entry cache help with that?
regards, tom lane
Merlin Moncure <mmoncure@gmail.com> writes:
On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
working on exanding the cache to # xid > 1.
patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;
This seems like a mess. Why is the criterion for caching commit states
different from whether the hint bit can be set? And why are you
cluttering tqual.c with it? Based on the initial description I'd
expected to find this enlarging the single-entry cache in transam.c
to have multiple entries.
regards, tom lane
On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Merlin Moncure <mmoncure@gmail.com> writes:
On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
working on exanding the cache to # xid > 1.
patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;This seems like a mess. Why is the criterion for caching commit states
different from whether the hint bit can be set? And why are you
The hint bits are always set. The page is flagged dirty if and only
if you have to dip below the process local cache to get it. The
rationale for this is simple: it caps your downside because unlike
hint bit removal or BM_UNTIDY, you only have to dip into clog and pay
the i/o once per page -- so that your downside is limited to pre-cache
behavior (minus the overhead of maintaining the cache).
cluttering tqual.c with it? Based on the initial description I'd
expected to find this enlarging the single-entry cache in transam.c
to have multiple entries.
This was my original idea --- abstracting it all under transam. For
simplicity's sake I was playing with a dynahash of the transaction
commit status, hit count, and log position keyed around the xid, and a
small sweeper that came around every N misses and purged out the
entries that did not have enough hit counts (or a defined minimum).
The problem with that approach at least according to my initial
testing was that putting two non-inline functions after the hint bit
test into HeapTuplesSatisifiesMVCC ate up a lot of the savings I was
looking for. I was testing on a VM, which maybe penalizes out of line
code execution more than on hardware, but I wasn't happy with it and
tried working the hint bit check angle. If you look upthread you can
see my skepticism that a non-line cache lookup is going to work.
You can trivially test this yourself if you're curious, my test was to just run:
drop table t; create table t as select v from generate_series(1,500000) v;
and repeatedly do select count(*) from v;
If you disable the hint bit action, you should see a non insignificant
difference in time even though the current last xid cache in transam
is there.
hm. If the exposure to transam is not to your liking (I'm not happy
with it either), I could make an inline stub to TransactionIdDidCommit
which checks the cache and failing anything useful there, moves to out
of line logic.
merlin
On Sun, Apr 3, 2011 at 6:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Merlin Moncure <mmoncure@gmail.com> writes:
On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
working on exanding the cache to # xid > 1.
patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;This seems like a mess. Why is the criterion for caching commit states
different from whether the hint bit can be set? And why are youThe hint bits are always set. The page is flagged dirty if and only
if you have to dip below the process local cache to get it. The
rationale for this is simple: it caps your downside because unlike
hint bit removal or BM_UNTIDY, you only have to dip into clog and pay
the i/o once per page -- so that your downside is limited to pre-cache
behavior (minus the overhead of maintaining the cache).
I might have missed part of the thrust of your question. In the
patch, the commit status is only stored if the LSN for the xid is
known safe (instead of checking it just before setting the bit as
SetHintBits does). This is indeed different, and it was done so as to
avoid having to a. repeatedly check XLogNeedsFlush or b. cache the xid
LSN as transam.c does. Perhaps this is weak sauce, so I think the
first thing to do is have another go at doing it at the transam level.
Stay tuned...
merlin