Locking B-tree leafs immediately in exclusive mode

Started by Alexander Korotkovalmost 8 years ago32 messageshackers

aekorotkov@gmail.com

almost 8 years ago

Hi!

Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ),
while caller can relock it later. However, I don't see what prevents
_bt_search()
from locking leaf immediately in exclusive mode (aka BT_WRITE) when required.
When traversing downlink from non-leaf page of level 1 (lowest level of non-leaf
pages just prior to leaf pages), we know that next page is going to be leaf. In
this case, we can immediately lock next page in exclusive mode if required.
For sure, it might happen that we didn't manage to exclusively lock leaf in this
way when _bt_getroot() points us to leaf page. But this case could be handled
in _bt_search() by relock. Please, find implementation of this approach in the
attached patch.

I've run following simple test of this patch on 72-cores machine.

# postgresql.conf
max_connections = 300
shared_buffers = 32GB
fsync = off
synchronous_commit = off

# DDL:
CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);

# script_ordered.sql
INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');

# script_unordered.sql
\set i random(1, 1000000)
INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres

# results
ordered, master: 157473 TPS
ordered, patched 231374 TPS
unordered, master: 232372 TPS
unordered, patched: 232535 TPS

As you can see, difference in unordered case is negligible Due to random
inserts, concurrency for particular leafs is low. But ordered
insertion is almost
50% faster on patched version.

I wonder how could we miss such a simple optimization till now, but I also don't
see this patch to brake anything.

In patched version, it might appear that we have to traverse
rightlinks in exclusive
mode due to splits concurrent to downlink traversal. However, the same might
happen in current master due to splits concurrent to relocks. So, I
don't expect
performance regression to be caused by this patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Rafia Sabih

rafia.sabih@enterprisedb.com

almost 8 years ago

In reply to: Alexander Korotkov (#1)

Re: Locking B-tree leafs immediately in exclusive mode

On Tue, Jun 5, 2018 at 7:15 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Hi!

Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ),
while caller can relock it later. However, I don't see what prevents
_bt_search()
from locking leaf immediately in exclusive mode (aka BT_WRITE) when required.
When traversing downlink from non-leaf page of level 1 (lowest level of non-leaf
pages just prior to leaf pages), we know that next page is going to be leaf. In
this case, we can immediately lock next page in exclusive mode if required.
For sure, it might happen that we didn't manage to exclusively lock leaf in this
way when _bt_getroot() points us to leaf page. But this case could be handled
in _bt_search() by relock. Please, find implementation of this approach in the
attached patch.

I've run following simple test of this patch on 72-cores machine.

# postgresql.conf
max_connections = 300
shared_buffers = 32GB
fsync = off
synchronous_commit = off

# DDL:
CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);

# script_ordered.sql
INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');

# script_unordered.sql
\set i random(1, 1000000)
INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres

# results
ordered, master: 157473 TPS
ordered, patched 231374 TPS
unordered, master: 232372 TPS
unordered, patched: 232535 TPS

As you can see, difference in unordered case is negligible Due to random
inserts, concurrency for particular leafs is low. But ordered
insertion is almost
50% faster on patched version.

I wonder how could we miss such a simple optimization till now, but I also don't
see this patch to brake anything.

In patched version, it might appear that we have to traverse
rightlinks in exclusive
mode due to splits concurrent to downlink traversal. However, the same might
happen in current master due to splits concurrent to relocks. So, I
don't expect
performance regression to be caused by this patch.

I had a look at this patch and it looks good to me.
With the improvement in performance you mentioned, this optimisation
looks worthy.
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Simon Riggs

simon@2ndQuadrant.com

almost 8 years ago

In reply to: Alexander Korotkov (#1)

Re: Locking B-tree leafs immediately in exclusive mode

On 5 June 2018 at 14:45, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ),
while caller can relock it later. However, I don't see what prevents
_bt_search()
from locking leaf immediately in exclusive mode (aka BT_WRITE) when required.
When traversing downlink from non-leaf page of level 1 (lowest level of non-leaf
pages just prior to leaf pages), we know that next page is going to be leaf. In
this case, we can immediately lock next page in exclusive mode if required.
For sure, it might happen that we didn't manage to exclusively lock leaf in this
way when _bt_getroot() points us to leaf page. But this case could be handled
in _bt_search() by relock. Please, find implementation of this approach in the
attached patch.

It's a good idea. How does it perform with many duplicate entries?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Simon Riggs (#3)

Re: Locking B-tree leafs immediately in exclusive mode

On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On 5 June 2018 at 14:45, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ),
while caller can relock it later. However, I don't see what prevents
_bt_search()
from locking leaf immediately in exclusive mode (aka BT_WRITE) when required.
When traversing downlink from non-leaf page of level 1 (lowest level of non-leaf
pages just prior to leaf pages), we know that next page is going to be leaf. In
this case, we can immediately lock next page in exclusive mode if required.
For sure, it might happen that we didn't manage to exclusively lock leaf in this
way when _bt_getroot() points us to leaf page. But this case could be handled
in _bt_search() by relock. Please, find implementation of this approach in the
attached patch.

It's a good idea. How does it perform with many duplicate entries?

Our B-tree is currently maintaining duplicates unordered. So, during insertion
we can traverse rightlinks in order to find page, which would fit new
index tuple.
However, in this case we're traversing pages in exclusive lock mode, and
that happens already after re-lock. _bt_search() doesn't do anything with that.
So, I assume there shouldn't be any degradation in the case of many
duplicate entries.

But this case definitely worth testing, and I'm planning to do it.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexander Korotkov (#4)

Re: Locking B-tree leafs immediately in exclusive mode

On Mon, Jun 11, 2018 at 9:30 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

It's a good idea. How does it perform with many duplicate entries?

I agree that this is a good idea. It independently occurred to me to
do this. The L&Y algorithm doesn't take a position on this at all,
supposing that *no* locks are needed at all (that's why I recommend
people skip L&Y and just read the Lanin & Shasha paper -- L&Y is
unnecessarily confusing). This idea seems relatively low risk.

Our B-tree is currently maintaining duplicates unordered. So, during insertion
we can traverse rightlinks in order to find page, which would fit new
index tuple.
However, in this case we're traversing pages in exclusive lock mode, and
that happens already after re-lock. _bt_search() doesn't do anything with that.
So, I assume there shouldn't be any degradation in the case of many
duplicate entries.

BTW, I have a prototype patch that makes the keys at the leaf level
unique. It is actually an implementation of nbtree suffix truncation
that sometimes *adds* a new attribute in pivot tuples, rather than
truncating away non-distinguishing attributes -- it adds a special
heap TID attribute when it must. The patch typically increases fan-in,
of course, but the real goal was to make all entries at the leaf level
truly unique, as L&Y intended (we cannot do it without suffix
truncation because that will kill fan-in). My prototype has full
amcheck coverage, which really helped me gain confidence in my
approach.

There are still big problems with my patch, though. It seems to hurt
performance more often than it helps when there is a lot of
contention, so as things stand I don't see a path to making it
commitable. I am still going to clean it up some more and post it to
-hackers, though. I still think it's quite interesting. I am not
pursuing it much further now because it seems like my timing is bad --
not because it seems like a bad idea. I can imagine something like
zheap making this patch important again. I can also imagine someone
seeing something that I missed.

--
Peter Geoghegan

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#5)

Re: Locking B-tree leafs immediately in exclusive mode

On Thu, Jun 14, 2018 at 1:01 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 11, 2018 at 9:30 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

It's a good idea. How does it perform with many duplicate entries?

I agree that this is a good idea. It independently occurred to me to
do this. The L&Y algorithm doesn't take a position on this at all,
supposing that *no* locks are needed at all (that's why I recommend
people skip L&Y and just read the Lanin & Shasha paper -- L&Y is
unnecessarily confusing). This idea seems relatively low risk.

Great. Unfortunately, we don't have access to 72-cores machine
anymore. But I'll do benchmarks on smaller 40-cores machine, which we
have access to. I hope there should be some win from this patch too.

Our B-tree is currently maintaining duplicates unordered. So, during insertion
we can traverse rightlinks in order to find page, which would fit new
index tuple.
However, in this case we're traversing pages in exclusive lock mode, and
that happens already after re-lock. _bt_search() doesn't do anything with that.
So, I assume there shouldn't be any degradation in the case of many
duplicate entries.

BTW, I have a prototype patch that makes the keys at the leaf level
unique. It is actually an implementation of nbtree suffix truncation
that sometimes *adds* a new attribute in pivot tuples, rather than
truncating away non-distinguishing attributes -- it adds a special
heap TID attribute when it must. The patch typically increases fan-in,
of course, but the real goal was to make all entries at the leaf level
truly unique, as L&Y intended (we cannot do it without suffix
truncation because that will kill fan-in). My prototype has full
amcheck coverage, which really helped me gain confidence in my
approach.

Could you, please, clarify what do you mean by "fan-in"? As I
understood, higher "fan-in" means more children on single non-leaf
page, and in turn "better branching". Is my understanding correct?

If my understanding is correct, then making leaf level truly unique
without suffix truncation will kill fan-in, because additional heap
TID attribute will increase pivot tuple size. However, that doesn't
look like inevitable shortcoming, because we could store heap TID in
t_tid of pivot index tuples. Yes, we've used t_tid offset for storing
number of attributes in truncated tuples. But heap TID is going to be
virtually the last attribute of tuple. So, pivot tuples containing
heap TID are not going to be suffix-truncated anyway. So, for such
tuples we can just don't set INDEX_ALT_TID_MASK, and they would be
assumed to have all the attributes.

So, my idea that it's not necessary to couple suffix truncation with
making leaf tuples unique. Regarding just making leaf tuples unique,
I understand that it wouldn't be always win. When we have a lot of
duplicates, current implementation searches among the pages containing
those duplicates for the first page containing enough of free space.
While with unique index, we would have to always insert into
particular page. Thus, current implementation gives us better filling
of pages, and (probably) faster insertion. But, unique index would
give us much better performance of retail tuple delete and update
(currently we don't support both). But I believe that future
pluggable table access methods will use both.

Therefore, my idea is that there is a tradeoff between making btree
index unique or non-unique. I think we will need an option for that.
I'm not yet sure if this option should be user-visible or
auto-decision based on table access method used.

There are still big problems with my patch, though. It seems to hurt
performance more often than it helps when there is a lot of
contention, so as things stand I don't see a path to making it
commitable. I am still going to clean it up some more and post it to
-hackers, though. I still think it's quite interesting. I am not
pursuing it much further now because it seems like my timing is bad --
not because it seems like a bad idea. I can imagine something like
zheap making this patch important again. I can also imagine someone
seeing something that I missed.

I think that joint community efforts are necessary, when single person
don't see path to make patch committable. So, I'm looking forward
seeing your patch posted.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Alexander Korotkov (#6)

Re: Locking B-tree leafs immediately in exclusive mode

On Thu, Jun 14, 2018 at 9:44 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Our B-tree is currently maintaining duplicates unordered. So, during insertion
we can traverse rightlinks in order to find page, which would fit new
index tuple.
However, in this case we're traversing pages in exclusive lock mode, and
that happens already after re-lock. _bt_search() doesn't do anything with that.
So, I assume there shouldn't be any degradation in the case of many
duplicate entries.

BTW, I have a prototype patch that makes the keys at the leaf level
unique. It is actually an implementation of nbtree suffix truncation
that sometimes *adds* a new attribute in pivot tuples, rather than
truncating away non-distinguishing attributes -- it adds a special
heap TID attribute when it must. The patch typically increases fan-in,
of course, but the real goal was to make all entries at the leaf level
truly unique, as L&Y intended (we cannot do it without suffix
truncation because that will kill fan-in). My prototype has full
amcheck coverage, which really helped me gain confidence in my
approach.

Could you, please, clarify what do you mean by "fan-in"? As I
understood, higher "fan-in" means more children on single non-leaf
page, and in turn "better branching". Is my understanding correct?

If my understanding is correct, then making leaf level truly unique
without suffix truncation will kill fan-in, because additional heap
TID attribute will increase pivot tuple size. However, that doesn't
look like inevitable shortcoming, because we could store heap TID in
t_tid of pivot index tuples. Yes, we've used t_tid offset for storing
number of attributes in truncated tuples. But heap TID is going to be
virtually the last attribute of tuple. So, pivot tuples containing
heap TID are not going to be suffix-truncated anyway. So, for such
tuples we can just don't set INDEX_ALT_TID_MASK, and they would be
assumed to have all the attributes.

I had a patch[1]/messages/by-id/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com that did pretty much just that. I saw no contention
issues or adverse effects, but my testing of it was rather limited.

The patch has rotted significantly since then, sadly, and it's quite complex.

So, my idea that it's not necessary to couple suffix truncation with
making leaf tuples unique.

No, but the work involved for one and the other shares so much that it
lends itself to be done in the same patch.

Regarding just making leaf tuples unique,
I understand that it wouldn't be always win. When we have a lot of
duplicates, current implementation searches among the pages containing
those duplicates for the first page containing enough of free space.
While with unique index, we would have to always insert into
particular page. Thus, current implementation gives us better filling
of pages, and (probably) faster insertion.

Not at all. Insertion cost in unique indexes with lots of duplicates
(happens, dead duplicates) grows quadratically on the number of
duplicates, and that's improved by making the index unique and sorted.

Therefore, my idea is that there is a tradeoff between making btree
index unique or non-unique. I think we will need an option for that.

We will need a flag somewhere to avoid having to require index
rebuilds during pg_upgrade. My old patch maintained backward
compatibility, but when using an older version index, the sort order
of tids could not be assumed, and thus many optimizations had to be
disabled.

But it is totally doable, and necessary unless you accept a very
painful pg_upgrade.

[1]: /messages/by-id/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#7)

Re: Locking B-tree leafs immediately in exclusive mode

On Thu, Jun 14, 2018 at 4:56 PM Claudio Freire <klaussfreire@gmail.com> wrote:

Not at all. Insertion cost in unique indexes with lots of duplicates
(happens, dead duplicates) grows quadratically on the number of
duplicates, and that's improved by making the index unique and sorted.

Sorry, I've messed up the terms. I did actually compare current
non-unique indexes with non-unique indexes keeping duplicate entries
ordered by TID (which makes them somewhat unique). I didn't really
considered indexes, which forces unique constraints. For them
insertion cost grows quadratically (as you mentioned) independently on
whether we're keeping duplicates ordered by TID or not.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexander Korotkov (#6)

Re: Locking B-tree leafs immediately in exclusive mode

On Thu, Jun 14, 2018 at 5:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Could you, please, clarify what do you mean by "fan-in"? As I
understood, higher "fan-in" means more children on single non-leaf
page, and in turn "better branching". Is my understanding correct?

Yes, your understanding is correct.

If my understanding is correct, then making leaf level truly unique
without suffix truncation will kill fan-in, because additional heap
TID attribute will increase pivot tuple size.

You're right that that's theoretically possible. However, in practice
we can almost always truncate away the heap TID attribute from pivot
tuples. We only need the heap TID attribute when there is no other
distinguishing attribute.

However, that doesn't
look like inevitable shortcoming, because we could store heap TID in
t_tid of pivot index tuples.

But the offset doesn't have enough space for an entire TID.

So, my idea that it's not necessary to couple suffix truncation with
making leaf tuples unique.

It seems like by far the simplest way. I don't really care about
suffix truncation, though. I care about making the leaf tuples unique.
Suffix truncation can help.

Regarding just making leaf tuples unique,
I understand that it wouldn't be always win. When we have a lot of
duplicates, current implementation searches among the pages containing
those duplicates for the first page containing enough of free space.
While with unique index, we would have to always insert into
particular page. Thus, current implementation gives us better filling
of pages, and (probably) faster insertion.

You're definitely right that that's the weakness of the design.

But, unique index would
give us much better performance of retail tuple delete and update
(currently we don't support both). But I believe that future
pluggable table access methods will use both.

Right.

Therefore, my idea is that there is a tradeoff between making btree
index unique or non-unique. I think we will need an option for that.
I'm not yet sure if this option should be user-visible or
auto-decision based on table access method used.

I agree.

--
Peter Geoghegan

#10

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#9)

Re: Locking B-tree leafs immediately in exclusive mode

On Thu, Jun 14, 2018 at 6:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jun 14, 2018 at 5:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

However, that doesn't
look like inevitable shortcoming, because we could store heap TID in
t_tid of pivot index tuples.

But the offset doesn't have enough space for an entire TID.

Sorry, my bad. It slipped my mind that we use t_tid.ip_blkid of pivot
tuples to store downlinks :)
So, yes, additional attribute is required to store heap TID in pivot tuple.

Anyway, I'm looking forward seeing your patch posted, if even it would
be not yet perfect shape.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11

Imai, Yoshikazu

imai.yoshikazu@jp.fujitsu.com

almost 8 years ago

In reply to: Alexander Korotkov (#4)

RE: Locking B-tree leafs immediately in exclusive mode

On Mon, June 11, 2018 at 4:31 PM, Alexander Korotkov wrote:

On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:

On 5 June 2018 at 14:45, Alexander Korotkov <a.korotkov@postgrespro.ru>
wrote:

Currently _bt_search() always locks leaf buffer in shared mode (aka
BT_READ), while caller can relock it later. However, I don't see
what prevents
_bt_search()
from locking leaf immediately in exclusive mode (aka BT_WRITE) when
required.
When traversing downlink from non-leaf page of level 1 (lowest level
of non-leaf pages just prior to leaf pages), we know that next page
is going to be leaf. In this case, we can immediately lock next page
in exclusive mode if required.
For sure, it might happen that we didn't manage to exclusively lock
leaf in this way when _bt_getroot() points us to leaf page. But
this case could be handled in _bt_search() by relock. Please, find
implementation of this approach in the attached patch.

It's a good idea. How does it perform with many duplicate entries?

Our B-tree is currently maintaining duplicates unordered. So, during
insertion we can traverse rightlinks in order to find page, which would
fit new index tuple.
However, in this case we're traversing pages in exclusive lock mode, and
that happens already after re-lock. _bt_search() doesn't do anything
with that.
So, I assume there shouldn't be any degradation in the case of many
duplicate entries.

But this case definitely worth testing, and I'm planning to do it.

Hi, I'm reviewing this.

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did.

# postgresql.conf
max_connections = 300
shared_buffers = 32GB
fsync = off
synchronous_commit = off

# DDL:
CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);

# script_ordered.sql
INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');

# script_unordered.sql
\set i random(1, 1000000)
INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres

# results
ordered, master: 157473 TPS
ordered, patched 231374 TPS
unordered, master: 232372 TPS
unordered, patched: 232535 TPS

# my results
ordered, master: 186045 TPS
ordered, patched: 265842 TPS
unordered, master: 313011 TPS
unordered, patched: 309636 TPS

I confirmed that this patch improves ordered insertion.

In addition to your tests, I did the following tests with many duplicate entries

# DDL
CREATE UNLOGGED TABLE duplicated (i integer not null, value text not null);

# script_duplicated.sql
INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_duplicated.sql -c 150 -j 150 postgres

# results
duplicated, master: 315450 TPS
duplicated, patched: 311584 TPS

It seems there are almostly no performance degression in case of many duplicated entries.

I'm planning to do code review and send it in the next mail.

Yoshikazu Imai

#12

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Imai, Yoshikazu (#11)

Re: Locking B-tree leafs immediately in exclusive mode

Hi!

On Mon, Jul 9, 2018 at 6:19 AM Imai, Yoshikazu
<imai.yoshikazu@jp.fujitsu.com> wrote:

Hi, I'm reviewing this.

Great. Thank you for giving attention to this patch.

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did.

OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is close
to the performance of 36 physical cores.

# results
ordered, master: 157473 TPS
ordered, patched 231374 TPS
unordered, master: 232372 TPS
unordered, patched: 232535 TPS

# my results
ordered, master: 186045 TPS
ordered, patched: 265842 TPS
unordered, master: 313011 TPS
unordered, patched: 309636 TPS

I confirmed that this patch improves ordered insertion.

Good, but it seems like 1% regression in unordered case.

In addition to your tests, I did the following tests with many duplicate entries

# DDL
CREATE UNLOGGED TABLE duplicated (i integer not null, value text not null);

# script_duplicated.sql
INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_duplicated.sql -c 150 -j 150 postgres

# results
duplicated, master: 315450 TPS
duplicated, patched: 311584 TPS

It seems there are almostly no performance degression in case of many duplicated entries.

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexander Korotkov (#12)

Re: Locking B-tree leafs immediately in exclusive mode

On Mon, Jul 9, 2018 at 9:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

It would be nice to actually see script_duplicated.sql. I don't know
exactly what the test case was.

Here is my wild guess: You may end up moving right more often within
_bt_findinsertloc(), which is actually worse than moving right within
_bt_moveright(), even when you _bt_moveright() in exclusive mode.
_bt_findinsertloc() couples/crabs exclusive buffer locks because the
unique case requires it, even when we're not inserting into a unique
index. Whereas _bt_moveright() holds at most one buffer lock at a
time.

--
Peter Geoghegan

#14

Simon Riggs

simon@2ndQuadrant.com

almost 8 years ago

In reply to: Imai, Yoshikazu (#11)

Re: Locking B-tree leafs immediately in exclusive mode

On 9 July 2018 at 04:18, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote:

# script_ordered.sql
INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');

# script_unordered.sql
\set i random(1, 1000000)
INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# results
ordered, master: 157473 TPS
ordered, patched 231374 TPS
unordered, master: 232372 TPS
unordered, patched: 232535 TPS

# my results
ordered, master: 186045 TPS
ordered, patched: 265842 TPS
unordered, master: 313011 TPS
unordered, patched: 309636 TPS

I confirmed that this patch improves ordered insertion.

Looks good

Please can you check insertion with the index on 2 keys
1st key has 10,000 values
2nd key has monotonically increasing value from last 1st key value

So each session picks one 1st key value
Then each new INSERTion is a higher value of 2nd key
so 1,1, then 1,2 then 1,3 etc

Probably easier to do this with a table like this

CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default
now(), value text not null, primary key (id, logdate));

# script_ordered2.sql
\set i random(1, 10000)
INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#13)

Re: Locking B-tree leafs immediately in exclusive mode

On Mon, Jul 9, 2018 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jul 9, 2018 at 9:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

It would be nice to actually see script_duplicated.sql. I don't know
exactly what the test case was.

Here is my wild guess: You may end up moving right more often within
_bt_findinsertloc(), which is actually worse than moving right within
_bt_moveright(), even when you _bt_moveright() in exclusive mode.
_bt_findinsertloc() couples/crabs exclusive buffer locks because the
unique case requires it, even when we're not inserting into a unique
index. Whereas _bt_moveright() holds at most one buffer lock at a
time.

I'm sorry, but I didn't understand this guess. I agree that moving
right within _bt_findinsertloc() might be worse than moving right
within _bt_moveright(). But why should it happen more often, if both
with and without patch that happens only after _bt_moveright() in
exclusive mode (with patch _bt_search() calls _bt_moveright() in
exclusive mode)?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#16

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Alexander Korotkov (#15)

Re: Locking B-tree leafs immediately in exclusive mode

On Mon, Jul 9, 2018 at 3:23 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I'm sorry, but I didn't understand this guess. I agree that moving
right within _bt_findinsertloc() might be worse than moving right
within _bt_moveright(). But why should it happen more often, if both
with and without patch that happens only after _bt_moveright() in
exclusive mode (with patch _bt_search() calls _bt_moveright() in
exclusive mode)?

_bt_moveright() doesn't serialize access, because it doesn't couple
buffer locks. I think that it's possible that it's slower for that
reason -- not because it moves right more often.

"The Convoy Phenomenon" research report may be worth a read sometime,
though it's very old. It's co-authored by Jim Gray. As they put it,
sometimes "a non-FIFO strategy" can be faster. In simple terms,
sometimes it can be faster to randomly grant a lock, since in practice
nobody gets starved out. I'm guessing (and it is very much a guess)
that it could be something like that, since the behavior of
_bt_findinsertloc() can be FIFO-like, whereas the behavior of
_bt_moveright() may not be.

Again, the actual queries would help a lot. It's just a guess.

--
Peter Geoghegan

#17

Imai, Yoshikazu

imai.yoshikazu@jp.fujitsu.com

almost 8 years ago

In reply to: Alexander Korotkov (#12)

RE: Locking B-tree leafs immediately in exclusive mode

On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did.

OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
close to the performance of 36 physical cores.

Thanks for pointing that. I referred to /proc/cpuinfo and understood there are 36 physical cores.

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

I experimented 2) with changing clients parameter with 18, 36, 54, 72.
While doing experiment, I realized that results of pgbench with 36 clients improve after executing pgbench with 72 clients.
I don't know why this occurs, but anyway, in this experiment, I executed pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, -c 36, -c 54, -c 72)
I tested experiments to master and patched unorderly(e.g. master, patched, patched, master, master, patched, patched, master)

# results of changing clients(18, 36, 54, 72 clients)
master, -c 18 -j 18: Ave 400410 TPS (407615,393942,401845,398241)
master, -c 36 -j 36: Ave 415616 TPS (411939,400742,424855,424926)
master, -c 54 -j 54: Ave 378734 TPS (401646,354084,408044,351163)
master, -c 72 -j 72: Ave 360864 TPS (367718,360029,366432,349277)
patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)

It may seem saturation is between 18 and 36 clients, so I also experimented with 27 clients.

# results of changing clients(27 clients)
master, -c 27 -j 27: Ave 416756 (423512,424241,399241,420030) TPS
patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS

I created a graph and attached in this mail("detecting saturation.png").
Referring to a graph, patched version's saturation happens earlier than master's one as you expected.
But even the patched version's nearly saturated TPS value has small regression from the master's one, I think.

Is there another experiments to do about this?

Yoshikazu Imai

#18

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#13)

Re: Locking B-tree leafs immediately in exclusive mode

On Mon, Jul 9, 2018 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jul 9, 2018 at 9:43 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

It would be nice to actually see script_duplicated.sql. I don't know
exactly what the test case was.

BTW, contents of script_duplicated.sql was posted by Imai, Yoshikazu
in the message body:

# script_duplicated.sql
INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#19

Alexander Korotkov

aekorotkov@gmail.com

almost 8 years ago

In reply to: Imai, Yoshikazu (#17)

Re: Locking B-tree leafs immediately in exclusive mode

On Tue, Jul 10, 2018 at 2:19 PM Imai, Yoshikazu
<imai.yoshikazu@jp.fujitsu.com> wrote:

On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did.

OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
close to the performance of 36 physical cores.

Thanks for pointing that. I referred to /proc/cpuinfo and understood there are 36 physical cores.

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

I experimented 2) with changing clients parameter with 18, 36, 54, 72.
While doing experiment, I realized that results of pgbench with 36 clients improve after executing pgbench with 72 clients.
I don't know why this occurs, but anyway, in this experiment, I executed pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, -c 36, -c 54, -c 72)
I tested experiments to master and patched unorderly(e.g. master, patched, patched, master, master, patched, patched, master)

# results of changing clients(18, 36, 54, 72 clients)
master, -c 18 -j 18: Ave 400410 TPS (407615,393942,401845,398241)
master, -c 36 -j 36: Ave 415616 TPS (411939,400742,424855,424926)
master, -c 54 -j 54: Ave 378734 TPS (401646,354084,408044,351163)
master, -c 72 -j 72: Ave 360864 TPS (367718,360029,366432,349277)
patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)

It may seem saturation is between 18 and 36 clients, so I also experimented with 27 clients.

# results of changing clients(27 clients)
master, -c 27 -j 27: Ave 416756 (423512,424241,399241,420030) TPS
patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS

I created a graph and attached in this mail("detecting saturation.png").
Referring to a graph, patched version's saturation happens earlier than master's one as you expected.
But even the patched version's nearly saturated TPS value has small regression from the master's one, I think.

Is there another experiments to do about this?

Thank you for the experiments! It seems that there is real regression
here... BTW, which script were you using in this benchmark:
script_unordered.sql or script_duplicated.sql?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#20

Imai Yoshikazu

yoshikazu_i443@live.jp

almost 8 years ago

In reply to: Alexander Korotkov (#19)

Re: Locking B-tree leafs immediately in exclusive mode

On 2018/07/10 20:36, Alexander Korotkov wrote:

On Tue, Jul 10, 2018 at 2:19 PM Imai, Yoshikazu
<imai.yoshikazu@jp.fujitsu.com> wrote:

On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as you did.

OK. But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
close to the performance of 36 physical cores.

Thanks for pointing that. I referred to /proc/cpuinfo and understood there are 36 physical cores.

In this case it also looks like we observed 1% regression. Despite 1%
may seem to be very small, I think we should clarify whether it really
exists. I have at least two hypothesis about this.

1) There is no real regression, observed difference of TPS is less
than error of measurements. In order to check that we need to retry
the experiment multiple times. Also, if you run benchmark on master
before patched version (or vice versa) you should also try to swap the
order to make sure there is no influence of the order of benchmarks.
2) If we consider relation between TPS and number of clients, TPS is
typically growing with increasing number of clients until reach some
saturation value. After the saturation value, there is some
degradation of TPS. If patch makes some latency lower, that my cause
saturation to happen earlier. In order to check that, we need run
benchmarks with various number of clients and draw a graph: TPS
depending on clients.

So, may I ask you to make more experiments in order to clarify the
observed regression?

I experimented 2) with changing clients parameter with 18, 36, 54, 72.
While doing experiment, I realized that results of pgbench with 36 clients improve after executing pgbench with 72 clients.
I don't know why this occurs, but anyway, in this experiment, I executed pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, -c 36, -c 54, -c 72)
I tested experiments to master and patched unorderly(e.g. master, patched, patched, master, master, patched, patched, master)

# results of changing clients(18, 36, 54, 72 clients)
master, -c 18 -j 18: Ave 400410 TPS (407615,393942,401845,398241)
master, -c 36 -j 36: Ave 415616 TPS (411939,400742,424855,424926)
master, -c 54 -j 54: Ave 378734 TPS (401646,354084,408044,351163)
master, -c 72 -j 72: Ave 360864 TPS (367718,360029,366432,349277)
patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)

It may seem saturation is between 18 and 36 clients, so I also experimented with 27 clients.

# results of changing clients(27 clients)
master, -c 27 -j 27: Ave 416756 (423512,424241,399241,420030) TPS
patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS

I created a graph and attached in this mail("detecting saturation.png").
Referring to a graph, patched version's saturation happens earlier than master's one as you expected.
But even the patched version's nearly saturated TPS value has small regression from the master's one, I think.

Is there another experiments to do about this?

Thank you for the experiments! It seems that there is real regression
here... BTW, which script were you using in this benchmark:
script_unordered.sql or script_duplicated.sql?