Reduce pinning in btree indexes

Started by Kevin Grittnerabout 11 years ago41 messageshackers

Kevin.Grittner@wicourts.gov

about 11 years ago

We have a customer who was unable to migrate from a well-known
commercial database product to pg because they have a very large
software base that holds cursors open for very long periods of
time, preventing GlobalXmin values from advancing, leading to
bloat. They have a standard test that exercises hundreds of client
connections for days at a time, and saw disk space and CPU usage
climbing to unacceptable levels[1]This test maintains a certain transaction rate, with user think times on the connections, so performance isn't measured by a tps rate but by how much CPU is consumed to maintain that rate.. The other database product had
no such problems. We suggested the obvious solutions that we
always suggest on the community lists, and they had perfectly good
reasons why those were not viable for them -- we either needed to
prevent bloat under their current software or they could not
migrate.

What they wanted was what happened in the other database product --
if a snapshot got sufficiently old that cleaning up the MVCC data
was a problem *and* the snapshot was used again *and* it read a
page which had been modified far enough back that it was not
possible to return correct data, then they wanted to receive a
"snapshot too old" error. I wrote a patch to do that, but they
didn't seem much improvement, because all (auto)vacuum processes
blocked indefinitely on the btree index pages where a cursor was
parked. So they still got massive bloat and consequent problems.

It seemed odd to me that a btree implementation based on the
Lehman & Yao techniques would block anything, because the concept
is that it reads everything it needs off the index page and pauses
"between" pages. We sorta do that, but the "interlock" between
vacuum processes and other index usages basically involves holding
a pin on the page we just read until we are done using the values
we read off that page, and treating the pin as a lighter lock than
a shared (or READ) lock -- which only conflicts with a "super
exclusive" lock, which consists of getting an exclusive lock only
once there are no other processes holding a pin on the page. This
ensures that at the end of a vacuum pass over a btree index there
are no TIDs in process-local memory from before the start of the
pass, and consequently no scan can read a new tuple assigned to the
same TID value after the rest of the vacuum phases run. So a pin
on a btree page blocks a vacuum process indefinitely.

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago. The customer's cursors which were causing the
problems all use MVCC snapshots, so I went looking to see whether
we couldn't take advantage of this fact. For the most part it
seemed we were OK if we dropped pins with the READ lock for a btree
scan which used an MVCC snapshot. I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting. That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those. Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

There was also a problem with the fact that the code conflated the
notion of having a valid scan position with holding a pin on a
block in the index. Those two concepts needed to be teased apart,
which I did using some new macros and figuring out which one to use
where. Without a pin on the block, we also needed to remember what
block we had been processing to be able to do the LP_DEAD hinting;
for that the block number was added to the structure which tracks
scan position.

Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements. The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit. In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes. But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

In combination with the snapshot-too-old patch the 2-day customer
benchmark ran without problem levels of bloat, controlling disk
space growth and keeping CPU usage level. Without the other patch
this patch would at least allow autovacuum to maintain statistics,
which probably makes it worthwhile even without the other patch,
but will probably not have a very big impact on bloat without both.

At least two README files and a lot of comments need to be modified
before commit to reflect the change in approach if this is accepted
by the community. Since this work has been very recent I haven't
had time to do that yet.

git diff --stat tells me this patch has:

4 files changed, 208 insertions(+), 128 deletions(-)

... and the related "snapshot too old" patch has:

16 files changed, 145 insertions(+), 19 deletions(-)

Given that these are going into the last CF and the related patch
is really still at "proof of concept" phase, it may be too late to
consider them for PostgreSQL 9.5, but I would still like a serious
review now in case people feel strongly about controlling bloat in
the face of old snapshots, and to know whether to work on this as
just for the Postgres Plus Advanced Server fork or for the
community.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: This test maintains a certain transaction rate, with user think times on the connections, so performance isn't measured by a tps rate but by how much CPU is consumed to maintain that rate.
times on the connections, so performance isn't measured by a tps
rate but by how much CPU is consumed to maintain that rate.

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 11 years ago

In reply to: Kevin Grittner (#1)

Re: Reduce pinning in btree indexes

On 02/15/2015 02:19 AM, Kevin Grittner wrote:

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.

The first question is: Do we really need that interlock for the non-MVCC
snapshots either?

If we do: For non-MVCC snapshots, we need to ensure that all index scans
that started before the VACUUM did complete before the VACUUM does. I
wonder if we could find a different mechanism to enforce that. Using the
pin-interlock for that has always seemed a bit silly to me. Perhaps grab
a new heavy-weight lock on the table whenever a non-MVCC index scan on
the table begins, and have VACUUM wait on it.

I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting. That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.

Or you could use GetFakeLSNForUnloggedRel().

Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

Why does an index-only scan need to hold the pin?

Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements. The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit. In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

Hmm. Did your test case actually exercise mark/restore? The memcpy()s
are not that small. Especially if it's an index-only scan, you're
copying a large portion of the page. Some scans call markpos on every
tuple, so that could add up.

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes. But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

Understood. Still, I'd like to see if we can easily get rid of the
pinning altogether.
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Heikki Linnakangas (#2)

Re: Reduce pinning in btree indexes

On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

Why does an index-only scan need to hold the pin?

Suppose there's a dead tuple in the heap someplace, and an index
pointer pointing to that dead tuple. An index scan reaches the
relevant page and copies all of the index pointers. VACUUM removes
the index pointer and marks the heap page all-visible. The scan now
uses the index pointers copied to backend-local memory and notes that
the heap-page is all-visible, so the scan sees the tuple even though
that tuple is totally gone from the heap by that point. Holding the
pin prevents this, because we can't reach the second heap pass until
the scan is done with the page, and therefore the dead line pointer in
the heap page can't be marked unused, and therefore the page can't be
marked all-visible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Heikki Linnakangas (#2)

Re: Reduce pinning in btree indexes

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote:

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.

The first question is: Do we really need that interlock for the non-MVCC

snapshots either?

We don't need the interlock for non-MVCC snapshots if the conditions we
limit or filter on at the index level are rechecked when we get to the heap
tuple; otherwise we do.

If we do: For non-MVCC snapshots, we need to ensure that all index scans

that started before the VACUUM did complete before the VACUUM does.

Isn't it enough to be sure that if we're not going to recheck any index
tests against the heap tuple that any process-local copies of index entries
which exist at the start of the index scan phase of the vacuum are not used
after the vacuum enters the phase of freeing line pointers -- that is, any
scan which has read a page when the vacuum starts to scan the index moves
on to another page before vacuum finishes scanning that index?

I wonder if we could find a different mechanism to enforce that. Using the

pin-interlock for that has always seemed a bit silly to me.

It certainly is abusing the semantics of the pin to treat it as a type of
lock on the contents.

Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index
scan on the table begins, and have VACUUM wait on it.

-1

The problem I'm most interested in fixing based on user feedback is a vacuum
blocking when a cursor which is using an index scan is sitting idle. Not
only does the vacuum of that table stall, but if it is an autovacuum worker,
that worker is prevented from cleaning up any other tables. I have seen all
autovacuum workers blocked in exactly this way, leading to massive bloat.
What you suggest is just using a less awkward way to keep the problem.

I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting. That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.

Or you could use GetFakeLSNForUnloggedRel().

I'm unfamiliar with that, but will take a look. (I will be back at
my usual development machine late tomorrow.)

Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.

Why does an index-only scan need to hold the pin?

Robert already answered this one, but there is a possible solution
-- provide some way to check the heap's visibility map for the TIDs
read from an index page before releasing the read lock on that page.

Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements. The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit. In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.

Hmm. Did your test case actually exercise mark/restore? The memcpy()s
are not that small. Especially if it's an index-only scan, you're
copying a large portion of the page. Some scans call markpos on every
tuple, so that could add up.

I have results from the `make world` regression tests and a 48-hour
customer test. Unfortunately I don't know how heavily either of those
exercise this code. Do you have a suggestion for a way to test whether
there is a serious regression for something approaching a "worst case"?

At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes. But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.

Understood. Still, I'd like to see if we can easily get rid of the
pinning altogether.

That would be great, but I felt that taking care of the easy cases in
on patch and following with patches to handle the trickier cases as
separate follow-on patches made more sense than trying to do everything
at once. Assuming we sort out the mark/restore issues for the initial
patch, it provides infrastructure to keep the other cases simpler.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 11 years ago

In reply to: Kevin Grittner (#4)

Re: Reduce pinning in btree indexes

Hello, I measured the performance of this patch considering
markpos/restorepos. The loss seems to be up to about 10%
unfortunately.

At Thu, 26 Feb 2015 17:49:23 +0000 (UTC), Kevin Grittner <kgrittn@ymail.com> wrote in <440831854.629116.1424972963735.JavaMail.yahoo@mail.yahoo.com>

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote:

Hmm. Did your test case actually exercise mark/restore? The memcpy()s
are not that small. Especially if it's an index-only scan, you're
copying a large portion of the page. Some scans call markpos on every
tuple, so that could add up.

I have results from the `make world` regression tests and a 48-hour
customer test. Unfortunately I don't know how heavily either of those
exercise this code. Do you have a suggestion for a way to test whether
there is a serious regression for something approaching a "worst case"?

ammarkpos/amrestrpos are called in merge joins. By the steps
shown below, I had 1M times of markpos and no restorepos for 1M
result rows, and had 500k times of markpos and the same number of
times of restorepos for 2M rows result by a bit different
configuration. I suppose we can say that they are the worst case
considering maskpos/restrpos. The call counts can be taken using
the attached patch.

Both binaries ware compiled with -O2. shared_buffers=1GB and all
shared pages used in the query were hit when measuring.

The numbers were taken 12 times for each cofiguration and took
averages and stddevs of 10 numbers excluding best and worst.

Case 1. 500k markpos/restorepos for 2M result rows.

Index scan: The patched loses about 2%. (1.98%)
master: 6166 ms (std dev = 3.2 ms)
Patched: 6288 ms (std dev = 3.7 ms)

IndesOnly scan: The patches loses about 2%. (2.14%)
master: 5946 ms (std dev = 5.0 ms)
Patched: 6073 ms (std dev = 3.3 ms)

The patched version is slower by about 2%. Of course all of it is
not the effect of memcpy but I don't know the distribution.

Case 2. 1M markpos, no restorepos for 1M result rows.

IndesOnly scan: The patches loses about 10%.
master: 3655 ms (std dev = 2.5 ms)
Patched: 4038 ms (std dev = 2.6 ms)

The loss is about 10%. The case looks a bit impractical but
unfortunately the number might be unignorable. The distribution
of the loss is unknown, too.

regards,

===========
CREATE TABLE t1 (a int);
CREATE TABLE t2 (a int);
delete from t1;
delete from t2;
-- This causes 1M times of markpos and no restorepos
INSERT INTO t1 (SELECT a FROM generate_series(0, 999999) a);
INSERT INTO t2 (SELECT a FROM generate_series(0, 999999) a);
-- This causes 500k times of markpos and the same number of restorepos
-- INSERT INTO t1 (SELECT a/2 FROM generate_series(0, 999999) a);
-- INSERT INTO t2 (SELECT a/2 FROM generate_series(0, 999999) a);
CREATE INDEX it1 ON t1 (a);
CREATE INDEX it2 ON t2 (a);
ANALYZE t1;
ANALYZE t1;
SET enable_seqscan to false;
SET enable_material to false;
SET enable_hashjoin to false;
SET enable_nestloop to false;
SET enable_indexonlyscan to false; -- omit this to do indexonly scan

EXPLAIN (ANALYZE) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a);

QUERY PLAN
--------------------------------------------------------------------------
Merge Join (cost=2.83..322381.82 rows=3031231 width=8) (actual time=0.013..5193.566 rows=2000000 loops=1)
Merge Cond: (t1.a = t2.a)
-> Index Scan using it1 on t1 (cost=0.43..65681.87 rows=1666667 width=4) (actual time=0.004..727.557 rows=1000000
oops=1)
-> Index Scan using it2 on t2 (cost=0.43..214642.89 rows=3031231 width=4) (actual time=0.004..1478.361 rows=19999
loops=1)
Planning time: 25.585 ms
Execution time: 6299.521 ms
(6 rows)

--
Kyotaro Horiguchi
NTT Open Source Software Center

Andrew Gierth

andrew@tao11.riddles.org.uk

about 11 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Reduce pinning in btree indexes

"Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Kyotaro> ammarkpos/amrestrpos are called in merge joins. By the steps
Kyotaro> shown below, I had 1M times of markpos and no restorepos for
Kyotaro> 1M result rows, and had 500k times of markpos and the same
Kyotaro> number of times of restorepos for 2M rows result by a bit
Kyotaro> different configuration. I suppose we can say that they are
Kyotaro> the worst case considering maskpos/restrpos. The call counts
Kyotaro> can be taken using the attached patch.

You might want to try running the same test, but after patching
ExecSupportsMarkRestore to return false for index scans. This will cause
the planner to insert a Materialize node to handle the mark/restore.

This way, you can get an idea of how much gain (if any) the direct
support of mark/restore in the scan is actually providing.

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 11 years ago

In reply to: Andrew Gierth (#6)

Re: Reduce pinning in btree indexes

Hi,

At Fri, 27 Feb 2015 05:56:18 +0000, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote in <874mq77vuu.fsf@news-spur.riddles.org.uk>

"Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Kyotaro> ammarkpos/amrestrpos are called in merge joins. By the steps
Kyotaro> shown below, I had 1M times of markpos and no restorepos for
Kyotaro> 1M result rows, and had 500k times of markpos and the same
Kyotaro> number of times of restorepos for 2M rows result by a bit
Kyotaro> different configuration. I suppose we can say that they are
Kyotaro> the worst case considering maskpos/restrpos. The call counts
Kyotaro> can be taken using the attached patch.

You might want to try running the same test, but after patching
ExecSupportsMarkRestore to return false for index scans. This will cause
the planner to insert a Materialize node to handle the mark/restore.

Mmm? The patch bt-nopin-v1.patch seems not contain the change for
ExecSupportMarkRestore and the very simple function remain
looking to return true for T_Index(Only)Scan after the patch
applied.

Explain shows that the plan does not involve materializing step
and addition to it, the counters I put into both btmarkpos() and
btrestrpos() showed that they are actually called during the
execution, like this, for both unpatched and patched.

| LOG: markpos=1000000, restrpos=0
| STATEMENT: EXPLAIN (ANALYZE,BUFFERS) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a);

To make sure, I looked into btmarkpos and btrestrpos to confirm
that they really does the memcpy() we're talking about, then
recompiled whole the source tree.

This way, you can get an idea of how much gain (if any) the direct
support of mark/restore in the scan is actually providing.

Does "the scan" mean T_Material node? But no material in plan and
counters in bt*pos were incremented in fact as mentioned above..

Could you point out any other possible mistake in my thought?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Gierth

andrew@tao11.riddles.org.uk

about 11 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: Reduce pinning in btree indexes

"Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

You might want to try running the same test, but after patching
ExecSupportsMarkRestore to return false for index scans. This will
cause the planner to insert a Materialize node to handle the
mark/restore.

Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the change
Kyotaro> for ExecSupportMarkRestore and the very simple function remain
Kyotaro> looking to return true for T_Index(Only)Scan after the patch
Kyotaro> applied.

Right. I'm suggesting you change that, in order to determine what
performance cost, if any, would result from abandoning the idea of doing
mark/restore entirely.

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 11 years ago

In reply to: Andrew Gierth (#8)

Re: Reduce pinning in btree indexes

Just a reminder, but I am not the author of this patch:)

At Fri, 27 Feb 2015 07:26:38 +0000, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote in <87zj7z6ckc.fsf@news-spur.riddles.org.uk>

"Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

You might want to try running the same test, but after patching
ExecSupportsMarkRestore to return false for index scans. This will
cause the planner to insert a Materialize node to handle the
mark/restore.

Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the change
Kyotaro> for ExecSupportMarkRestore and the very simple function remain
Kyotaro> looking to return true for T_Index(Only)Scan after the patch
Kyotaro> applied.

Right. I'm suggesting you change that, in order to determine what
performance cost, if any, would result from abandoning the idea of doing
mark/restore entirely.

I understand that you'd like to see the net drag of performance
by the memcpy(), right?

That don't seem to be possible without breaking (a part of) the
patch's function, but, concerning btmarkpos() only case, the mark
struct can have garbage so only changing refcount would be viable
to see the pure loss by the memcpy().

The attached patch is the destruction I did, and the result was
like below.

Case 2. 1M markpos, no restorepos for 1M result rows.

IndesOnly scan: The patches loses about 10%.
master: 3655 ms (std dev = 2.5 ms)
Patched: 4038 ms (std dev = 2.6 ms)

Patched: 4036 ms (std dev = 3.5 ms)
Broken: 3718 ms (std dev = 1.6 ms)

The "Patched" just above is the retook numbers. It's a little
under 9% loss from "broken" version. That is the pure drag of
memcpy(). This seems quite big as Heikki said. It's an extreme
case, though.

The rest 1.7% should be the share of all the other stuff.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10

Andrew Gierth

andrew@tao11.riddles.org.uk

about 11 years ago

In reply to: Kyotaro Horiguchi (#9)

Re: Reduce pinning in btree indexes

"Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Kyotaro> Just a reminder, but I am not the author of this patch:)

Yes, I understand that.

Right. I'm suggesting you change that, in order to determine what
performance cost, if any, would result from abandoning the idea of
doing mark/restore entirely.

Kyotaro> I understand that you'd like to see the net drag of
Kyotaro> performance by the memcpy(), right?

No.

What I am suggesting is this: if mark/restore is a performance issue,
then it would be useful to know how much gain we're getting (if any)
from supporting it _at all_.

Let me try and explain it another way. If you change
ExecSupportMarkRestore to return false for index scans, then
btmarkpos/btrestorepos will no longer be called. What is the performance
of this case compared to the original and patched versions?

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 11 years ago

In reply to: Andrew Gierth (#10)

Re: Reduce pinning in btree indexes

Hello.

Mmm... We might be focusing on a bit different things..

2015/02/27 17:27 "Andrew Gierth" > Kyotaro> I understand that you'd like
to see the net drag of

Kyotaro> performance by the memcpy(), right?

No.

What I am suggesting is this: if mark/restore is a performance issue,
then it would be useful to know how much gain we're getting (if any)
from supporting it _at all_.

The mark/restore works as before with this patch, except the stuff for
early dropping of buffer pins. As far as my understanding so far all the
stuff in the patch is for the purpose but doesn't intend to boost btree
itself. That is, it reduces the chance to block vacuum for possible
burden on general performance. And I think the current issue in focus is
that the burden might a bit too heavy on specific situation. Therefore I
tried to measure how heavy/light the burden is.

Am I correct so far?

Let me try and explain it another way. If you change
ExecSupportMarkRestore to return false for index scans, then
btmarkpos/btrestorepos will no longer be called. What is the performance
of this case compared to the original and patched versions?

As you mentioned before, such change will make the planner turn to, perhaps
materialize plan or rescan, or any other scans which are no use comparing
to indexscan concerning the issue in focus, the performance drag when btree
scan does extremely frequent mark/restore in comparison to unpatched,
less copy-amount version. Your suggestion looks intending somewhat
different from this.

Anyway I'm sorry but I have left my dev env and cannot have time till next
week.

regards,

--
Kyotaro Horiguti

#12

Andrew Gierth

andrew@tao11.riddles.org.uk

about 11 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: Reduce pinning in btree indexes

"Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

Kyotaro> Anyway I'm sorry but I have left my dev env and cannot have
Kyotaro> time till next week.

The point is moot; rather than try and explain it further I did the test
myself, and the performance penalty of disabling mark/restore is
substantial, on the order of 30%, so that's a non-starter. (I was a bit
surprised that it was so bad...)

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Reduce pinning in btree indexes

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I measured the performance of this patch considering
markpos/restorepos. The loss seems to be up to about 10%
unfortunately.

Thanks for the test case!

I took another look at this optimization, and found that it didn't
really depend on the pin (as I had first thought), so I put it back
(keeping the rest of the patch unchanged). I saw a 1.4% increase
in run time with the patch (compared to master) for the mark-only
test, and a 1.6% increase in run time for the mark/restore test.
I'll look into what might be causing that. I have seen bigger
differences just due to changing where executable code crossed
cache boundaries, but this is big enough to worry about; it needs
to be checked.

New patch attached.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Kevin Grittner (#13)

Re: Reduce pinning in btree indexes

Kevin Grittner <kgrittn@ymail.com> wrote:

[need to check performance more]

It looks like the remaining performance regression was indeed a
result of code alignment. I found two "paranoia" assignments I had
accidentally failed to put back with the rest of the mark/restore
optimization; after that trivial change the patched version is
ever-so-slightly faster than master on all tests.

Average of 3 `make check-world` runs:
master: 336.288 seconds
patch: 332.237 seconds

Average of 6 `make check` runs:
master: 25.409 seconds
patch: 25.270 seconds

The following were all run 12 times, the worst 2 dropped, the rest
averaged:

Kyotaro-san's 1m mark "worst case" benchmark:
master: 962.581 ms, 6.765 stdev
patch: 947.518 ms, 3.584 stdev

Kyotaro-san's 500k mark, 500k restore "worst case" benchmark:
master: 1366.639 ms, 5.314 stdev
patch: 1363.844 ms, 5.896 stdev

pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench
master: 265.011 tps, 4.952 stdev
patch: 267.164 tps, 9.094 stdev

How do people feel about the idea of me pushing this for 9.5 (after
I clean up all the affected comments and README files)? I know
this first appeared in the last CF, but the footprint is fairly
small and the only user-visible behavior change is that a btree
index scan of a WAL-logged table using an MVCC snapshot no longer
blocks a vacuum indefinitely. (If there are objections I will move
this to the first CF for the next release.)

src/backend/access/nbtree/nbtree.c | 50 +++++-------
src/backend/access/nbtree/nbtsearch.c | 141 +++++++++++++++++++++++++---------
src/backend/access/nbtree/nbtutils.c | 58 ++++++++++----
src/include/access/nbtree.h | 36 ++++++++-
4 files changed, 201 insertions(+), 84 deletions(-)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 11 years ago

In reply to: Kevin Grittner (#14)

Re: Reduce pinning in btree indexes

Hello,

It looks like the remaining performance regression was indeed a
result of code alignment. I found two "paranoia" assignments I had
accidentally failed to put back with the rest of the mark/restore
optimization; after that trivial change the patched version is
ever-so-slightly faster than master on all tests.

The performance which your test shows looks great. The gain might
comes from removing of buffer locking on _bt_next. I also would
like to see this to come along with 9.5 but it is the matter for
committers.

Apart from it, I looked into the patch itself more
closely. Please reconcile as you like among contradicting
comments below:)

In nbtutils.c, _bt_killitems:

- This is not in charge of this patch, but the comment for
_bt_killitems looks to have a trivial typo.

* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
*
* scan->so contains information about the current page and killed tuples
* thereon (generally, this should only be called if so->numKilled > 0).

I suppose "scan->so" should be "scan->opaque" and
"so->numKilled" be "opaque->numKilled".

- The following comment looks somewhat wrong.

/* Since the else condition needs page, get it here, too. */

"the else condition needs page" means "the page of the buffer
is needed later"? But, I suppose the comment itself is not
necessary.

- If (BTScanPosIsPinned(so->currPos)).

As I mention below for nbtsearch.c, the page aquired in the
if-then block may be vacuumed so the LSN check done in the
if-else block is need also in the if-then block. It will be
accomplished only by changing the position of the end of the
if-else block.

- _bt_killitems is called without pin when rescanning from
before, so I think the previous code has already considered the
unpinned case ("if (ItemPointerEquals(&ituple..." does
that). Vacuum rearranges line pointers after deleting LP_DEAD
items so the previous check seems enough for the purpose. The
previous code is more effeicient becuase the mathing items will
be processed even after vacuum.

In nbtsearch.c

- _bt_first(): It now releases the share lock on the page before
the items to be killed is processed. This allows other backends
to insert items into the page simultaneously. It seems to be
dangerous alone, but _bt_killitems already considers the
case. Anyway I think It'll be better to add a comment
mentioning unlocking is not dangerous.

... Sorry, time's up for now.

regards,

Average of 3 `make check-world` runs:
master: 336.288 seconds
patch: 332.237 seconds

Average of 6 `make check` runs:
master: 25.409 seconds
patch: 25.270 seconds

The following were all run 12 times, the worst 2 dropped, the rest
averaged:

Kyotaro-san's 1m mark "worst case" benchmark:
master: 962.581 ms, 6.765 stdev
patch: 947.518 ms, 3.584 stdev

Kyotaro-san's 500k mark, 500k restore "worst case" benchmark:
master: 1366.639 ms, 5.314 stdev
patch: 1363.844 ms, 5.896 stdev

pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench
master: 265.011 tps, 4.952 stdev
patch: 267.164 tps, 9.094 stdev

How do people feel about the idea of me pushing this for 9.5 (after
I clean up all the affected comments and README files)? I know
this first appeared in the last CF, but the footprint is fairly
small and the only user-visible behavior change is that a btree
index scan of a WAL-logged table using an MVCC snapshot no longer
blocks a vacuum indefinitely. (If there are objections I will move
this to the first CF for the next release.)

src/backend/access/nbtree/nbtree.c | 50 +++++-------
src/backend/access/nbtree/nbtsearch.c | 141 +++++++++++++++++++++++++---------
src/backend/access/nbtree/nbtutils.c | 58 ++++++++++----
src/include/access/nbtree.h | 36 ++++++++-
4 files changed, 201 insertions(+), 84 deletions(-)

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Reduce pinning in btree indexes

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The performance which your test shows looks great. The gain might
comes from removing of buffer locking on _bt_next.

Yeah, I had been hoping that removing some buffer pins and locks
from the common path of scanning forward from one page to the next
might have some direct performance benefit; it's possible that this
will show up more dramatically at high concurrency levels. The
main goal, of course, was to improve performance more indirectly,
by not having autovacuum worker processes blocked for extended
periods in some situations where that can now happen.

In nbtutils.c, _bt_killitems:

- This is not in charge of this patch, but the comment for
_bt_killitems looks to have a trivial typo.

As you say, that problem was already there, but no reason not to
fix it in the patch as long as we're here. Reworded.

- The following comment looks somewhat wrong.

/* Since the else condition needs page, get it here, too. */

"the else condition needs page" means "the page of the buffer
is needed later"? But, I suppose the comment itself is not
necessary.

I guess that comment isn't really worth the space it takes up; what
it says is pretty obvious from the code. Removed.

- If (BTScanPosIsPinned(so->currPos)).

As I mention below for nbtsearch.c, the page aquired in the
if-then block may be vacuumed so the LSN check done in the
if-else block is need also in the if-then block. It will be
accomplished only by changing the position of the end of the
if-else block.

I'm not sure I agree on this. If the page is pinned it should have
been pinned continuously since we initially read it, so the line
pointers we have could not have been removed by any vacuum process.
The tuples may have been pruned away in the heap, but that doesn't
matter. Index entries may have been added and the index page may
have been split, but that was true before this patch and
_bt_killitems will deal with those things the same as it always
has. I don't see any benefit to doing the LSN compare in this
case; if we've paid the costs of holding the pin through to this
point, we might as well flag any dead entries we can.

- _bt_killitems is called without pin when rescanning from
before, so I think the previous code has already considered the
unpinned case ("if (ItemPointerEquals(&ituple..." does
that). Vacuum rearranges line pointers after deleting LP_DEAD
items so the previous check seems enough for the purpose. The
previous code is more effeicient becuase the mathing items will
be processed even after vacuum.

I'm not following you on this one; could you rephrase it?

In nbtsearch.c

- _bt_first(): It now releases the share lock on the page before
the items to be killed is processed. This allows other backends
to insert items into the page simultaneously. It seems to be
dangerous alone, but _bt_killitems already considers the
case. Anyway I think It'll be better to add a comment
mentioning unlocking is not dangerous.

Well, _bt_killitems does acquire a shared (read) lock to flag index
tuples as LP_DEAD, and it has always been OK for the index page to
accept inserts and allow page splits before calling _bt_killitems.
I don't really see anything new with the patch in this regard. I
do agree, though, that a lot of comments and at least two README
files refer to the pins blocking vacuum, and these must be found
and changed.

It doesn't seem worth posting to the list for the small changes
since the last version; I'll wait until I update the comments and
README files. If you want review or test the latest, you can peek
at: https://github.com/kgrittn/postgres/tree/btnopin

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Kevin Grittner (#16)

Re: Reduce pinning in btree indexes

Kevin Grittner <kgrittn@ymail.com> wrote:

It doesn't seem worth posting to the list for the small changes
since the last version; I'll wait until I update the comments and
README files. If you want review or test the latest, you can
peek at:

https://github.com/kgrittn/postgres/tree/btnopin

Here is v3, with the promised README and code comment changes.

In the process of documenting the mark/restore area I noticed a
subtlety that I had missed (in the case that there was a mark,
advance to the next page, restore, advance within the page, and
restore). I fixed that, and in the process gave the patched code
an additional direct performance edge over unpatched code. For the
1000k marks, average timings are now:

master: 970.999 ms, stdev: 4.043
patched: 940.460 ms, stdev: 4.874

So, in the micro-benchmark showing the biggest benefit the direct
improvement is now just over 3%. It remains the case that the
primary motivation for the patch is to reduce blocking of vacuum
processes; but that's a nice side-benefit.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 11 years ago

In reply to: Kevin Grittner (#16)

Re: Reduce pinning in btree indexes

Hello,

I found no other problem including the performance issue in the
patch attached to the last mail as far as I can understand. So
I'll mark this as ready for commit after a few days with no
objection after this comments is addressed.

- If (BTScanPosIsPinned(so->currPos)).

As I mention below for nbtsearch.c, the page aquired in the
if-then block may be vacuumed so the LSN check done in the
if-else block is need also in the if-then block. It will be
accomplished only by changing the position of the end of the
if-else block.

I'm not sure I agree on this.

Sorry, it is largely because of my poor composition.

If the page is pinned it should have
been pinned continuously since we initially read it, so the line
pointers we have could not have been removed by any vacuum process.
The tuples may have been pruned away in the heap, but that doesn't
matter.
Index entries may have been added and the index page may
have been split, but that was true before this patch and
_bt_killitems will deal with those things the same as it always
has.

Yes. I think so.

I don't see any benefit to doing the LSN compare in this
case; if we've paid the costs of holding the pin through to this
point, we might as well flag any dead entries we can.

I thought of the case that the buffer has been pinned by another
backend after this backend unpinned it. I looked again the
definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and
understood that the pin should be mine if BTScanPosIsPinned.

Would you mind rewriting the comment there like this?

-  /* The buffer is still pinned, but not locked.  Lock it now. */
+  /* I still hold the pin on the buffer, but not locked.  Lock it now. */
|   LockBuffer(so->currPos.buf, BT_READ);

Or would you mind renaming the macro as "BTScanPosIsPinnedByMe"
or something like, or anyway to notify the reader that the pin
should be mine there?

- _bt_killitems is called without pin when rescanning from
before, so I think the previous code has already considered the
unpinned case ("if (ItemPointerEquals(&ituple..." does
that). Vacuum rearranges line pointers after deleting LP_DEAD
items so the previous check seems enough for the purpose. The
previous code is more effeicient becuase the mathing items will
be processed even after vacuum.

I'm not following you on this one; could you rephrase it?

Sorry, I read btrescan incorrectly that it calls _bt_killitems()
*after* releaseing the buffer. Please forget it.

Finally, I'd like to timidly comment on this..

+ To handle the cases where it is safe to release the pin after
+ reading the index page, the LSN of the index page is read along
+ with the index entries before the shared lock is released, and we
+ do not attempt to flag index tuples as dead if the page is not
+ pinned and the LSN does not match.

I suppose that the sentence like following is more directly
describing about the mechanism and easier to find the
correnponsing code, if it is correct.

To handle the cases where a index page has unpinned when
trying to mark the unused index tuples on the page as dead,
the LSN of the index page is remembered when reading the index
page for index tuples, and we do not attempt to flag index
tuples as dead if the page is not pinned and the LSN does not
match.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Kyotaro Horiguchi (#18)

Re: Reduce pinning in btree indexes

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I found no other problem including the performance issue in the
patch attached to the last mail as far as I can understand. So
I'll mark this as ready for commit after a few days with no
objection after this comments is addressed.

Thanks for the reviews!

I don't see any benefit to doing the LSN compare in this
case; if we've paid the costs of holding the pin through to this
point, we might as well flag any dead entries we can.

I thought of the case that the buffer has been pinned by another
backend after this backend unpinned it. I looked again the
definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and
understood that the pin should be mine if BTScanPosIsPinned.

Would you mind rewriting the comment there like this?
-  /* The buffer is still pinned, but not locked.  Lock it now. */
+  /* I still hold the pin on the buffer, but not locked.  Lock it now. */
|  LockBuffer(so->currPos.buf, BT_READ);
Or would you mind renaming the macro as "BTScanPosIsPinnedByMe"
or something like, or anyway to notify the reader that the pin
should be mine there?

I see your point, although those first person singular pronouns
used like that make me a little uncomfortable; I'll change the
comment and/or macro name, but I'll work on the name some more.

Finally, I'd like to timidly comment on this..
+ To handle the cases where it is safe to release the pin after
+ reading the index page, the LSN of the index page is read along
+ with the index entries before the shared lock is released, and we
+ do not attempt to flag index tuples as dead if the page is not
+ pinned and the LSN does not match.
I suppose that the sentence like following is more directly
describing about the mechanism and easier to find the
correnponsing code, if it is correct.

To handle the cases where a index page has unpinned when
trying to mark the unused index tuples on the page as dead,
the LSN of the index page is remembered when reading the index
page for index tuples, and we do not attempt to flag index
tuples as dead if the page is not pinned and the LSN does not
match.

Will reword that part to try to make it clearer.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 11 years ago

In reply to: Kevin Grittner (#19)

Re: Reduce pinning in btree indexes

Kevin Grittner <kgrittn@ymail.com> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Would you mind rewriting the comment there like this?

- /* The buffer is still pinned, but not locked. Lock it now. */
+ /* I still hold the pin on the buffer, but not locked. Lock it now. */

Or would you mind renaming the macro as "BTScanPosIsPinnedByMe"
or something like, or anyway to notify the reader that the pin
should be mine there?

I see your point, although those first person singular pronouns
used like that make me a little uncomfortable; I'll change the
comment and/or macro name, but I'll work on the name some more.

After thinking it over, I think that the "BTScanPos" part of the
macro name is enough of a hint that it is concerned with the
actions of this scan; it is the comment that needs the change.
I went with:

/*
* We have held the pin on this page since we read the index tuples,
* so all we need to do is lock it. The pin will have prevented
* re-use of any TID on the page, so there is no need to check the
* LSN.
*/

+ To handle the cases where it is safe to release the pin after
+ reading the index page, the LSN of the index page is read along
+ with the index entries before the shared lock is released, and we
+ do not attempt to flag index tuples as dead if the page is not
+ pinned and the LSN does not match.
I suppose that the sentence like following is more directly
describing about the mechanism and easier to find the
correnponsing code, if it is correct.

To handle the cases where a index page has unpinned when
trying to mark the unused index tuples on the page as dead,
the LSN of the index page is remembered when reading the index
page for index tuples, and we do not attempt to flag index
tuples as dead if the page is not pinned and the LSN does not
match.
Will reword that part to try to make it clearer.

That sentence was full of "passive voice", which didn't help any.
I changed it to:

| So that we can handle the cases where we attempt LP_DEAD flagging
| for a page after we have released its pin, we remember the LSN of
| the index page when we read the index tuples from it; we do not
| attempt to flag index tuples as dead if the we didn't hold the
| pin the entire time and the LSN has changed.

Do these changes seem clear?

Because these changes are just to a comment and a README file, I'm
posting a patch-on-patch v3a (to be applied on top of v3).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company