brininsert optimization opportunity

Started by Soumyadeep Chakrabortyalmost 3 years ago38 messageshackers
Jump to latest
#1Soumyadeep Chakraborty
soumyadeep2007@gmail.com

Hello hackers,

My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
of the revmap access struct can have non-trivial overhead.

Turns out he is right. We are saving 24 bytes of memory per-call for
the access struct, and a bit on buffer/locking overhead, with the
attached patch.

The implementation ties the revmap cleanup as a MemoryContext callback
to the IndexInfo struct's MemoryContext, as there is no teardown
function provided by the index AM for end-of-insert-command.

Test setup (local Ubuntu workstation):

# Drop caches and restart between each run:
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches;"
pg_ctl -D /usr/local/pgsql/data/ -l /tmp/logfile restart

\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000);

Results:
We see an improvement for 100M tuples and an even bigger improvement for
200M tuples.

Master (29cf61ade3f245aa40f427a1d6345287ef77e622):

test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 100000000);
INSERT 0 100000000
Time: 222762.159 ms (03:42.762)

-- 3 runs
test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000);
INSERT 0 200000000
Time: 471168.181 ms (07:51.168)
Time: 457071.883 ms (07:37.072)
TimeL 486969.205 ms (08:06.969)

Branch:

test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 100000000);
INSERT 0 100000000
Time: 200046.519 ms (03:20.047)

-- 3 runs
test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000);
INSERT 0 200000000
Time: 369041.832 ms (06:09.042)
Time: 365483.382 ms (06:05.483)
Time: 375506.144 ms (06:15.506)

# Profiled backend running INSERT of 100000000 rows
sudo perf record -p 11951 --call-graph fp sleep 180

Please see attached perf diff between master and branch. We see that we
save on a bit of overhead from brinRevmapInitialize(),
brinRevmapTerminate() and lock routines.

Regards,
Soumyadeep (VMware)

Attachments:

perf_diff.outapplication/octet-stream; name=perf_diff.outDownload
v1-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchDownload+59-12
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#1)
Re: brininsert optimization opportunity

On 2023-Jul-03, Soumyadeep Chakraborty wrote:

My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
of the revmap access struct can have non-trivial overhead.

Turns out he is right. We are saving 24 bytes of memory per-call for
the access struct, and a bit on buffer/locking overhead, with the
attached patch.

Hmm, yeah, I remember being bit bothered by this repeated
initialization. Your patch looks reasonable to me. I would set
bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
Also, please add comments atop these two new functions, to explain what
they are.

Nice results.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#2)
Re: brininsert optimization opportunity

On 7/4/23 13:23, Alvaro Herrera wrote:

On 2023-Jul-03, Soumyadeep Chakraborty wrote:

My colleague, Ashwin, pointed out to me that brininsert's per-tuple init
of the revmap access struct can have non-trivial overhead.

Turns out he is right. We are saving 24 bytes of memory per-call for
the access struct, and a bit on buffer/locking overhead, with the
attached patch.

Hmm, yeah, I remember being bit bothered by this repeated
initialization. Your patch looks reasonable to me. I would set
bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
Also, please add comments atop these two new functions, to explain what
they are.

Nice results.

Yeah. I wonder how much of that runtime is the generate_series(),
though. What's the speedup if that part is subtracted. It's guaranteed
to be even more significant, but by how much?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Tomas Vondra (#3)
Re: brininsert optimization opportunity

Thank you both for reviewing!

On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, yeah, I remember being bit bothered by this repeated
initialization. Your patch looks reasonable to me. I would set
bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
Also, please add comments atop these two new functions, to explain what
they are.

Done. Set bistate->bs_desc = NULL; as well. Added comments.

On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Yeah. I wonder how much of that runtime is the generate_series(),
though. What's the speedup if that part is subtracted. It's guaranteed
to be even more significant, but by how much?

When trying COPY, I got tripped by the following:

We get a buffer leak WARNING for the meta page and a revmap page.

WARNING: buffer refcount leak: [094] (rel=base/156912/206068,
blockNum=1, flags=0x83000000, refcount=1 1)
WARNING: buffer refcount leak: [093] (rel=base/156912/206068,
blockNum=0, flags=0x83000000, refcount=1 1)

PrintBufferLeakWarning bufmgr.c:3240
ResourceOwnerReleaseInternal resowner.c:554
ResourceOwnerRelease resowner.c:494
PortalDrop portalmem.c:563
exec_simple_query postgres.c:1284

We release the buffer during this resowner release and then we crash
with:

TRAP: failed Assert("bufnum <= NBuffers"), File:
"../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833
postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79]
postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc]
postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2]
postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39]
postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44]
postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2]
postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683]
postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7]

Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo
gets called is PortalContext and that is what is set in ii_Context.
Furthermore, we clean up the resource owner stuff before we can clean
up the MemoryContexts in PortalDrop().

The CurrentMemoryContext when initialize_brin_insertstate() is called
depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples()
it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is
ExecutorState/ExprContext. We can't rely on it to register the callback
neither.

What we can do is create a new MemoryContext for holding the
BrinInsertState, and we tie the callback to that so that cleanup is not
affected by all of these variables. See v2 patch attached. Passes make
installcheck-world and make installcheck -C src/test/modules/brin.

However, we do still have 1 issue with the v2 patch:
When we try to cancel (Ctrl-c) a running COPY command:
ERROR: buffer 151 is not owned by resource owner TopTransaction

#4 0x0000559cbc54a934 in ResourceOwnerForgetBuffer
(owner=0x559cbd6fcf28, buffer=143) at resowner.c:997
#5 0x0000559cbc2c45e7 in UnpinBuffer (buf=0x7f8d4a8f3f80) at bufmgr.c:2390
#6 0x0000559cbc2c7e49 in ReleaseBuffer (buffer=143) at bufmgr.c:4488
#7 0x0000559cbbd82d53 in brinRevmapTerminate (revmap=0x559cbd7a03b8)
at brin_revmap.c:105
#8 0x0000559cbbd73c44 in brininsertCleanupCallback
(arg=0x559cbd7a5b68) at brin.c:168
#9 0x0000559cbc54280c in MemoryContextCallResetCallbacks
(context=0x559cbd7a5a50) at mcxt.c:506
#10 0x0000559cbc54269d in MemoryContextDelete (context=0x559cbd7a5a50)
at mcxt.c:421
#11 0x0000559cbc54273e in MemoryContextDeleteChildren
(context=0x559cbd69ae90) at mcxt.c:457
#12 0x0000559cbc54625c in AtAbort_Portals () at portalmem.c:850

Haven't found a way to fix this ^ yet.

Maybe there is a better way of doing our cleanup? I'm not sure. Would
love your input!

The other alternative for all this is to introduce new AM callbacks for
insert_begin and insert_end. That might be a tougher sell?

Now, to finally answer your question about the speedup without
generate_series(). We do see an even higher speedup!

seq 1 200000000 > /tmp/data.csv
\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
COPY heap FROM '/tmp/data.csv';

-- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
COPY 200000000
Time: 205072.444 ms (03:25.072)
Time: 215380.369 ms (03:35.380)
Time: 203492.347 ms (03:23.492)

-- 3 runs (branch v2)

COPY 200000000
Time: 135052.752 ms (02:15.053)
Time: 135093.131 ms (02:15.093)
Time: 138737.048 ms (02:18.737)

Regards,
Soumyadeep (VMware)

Attachments:

v2-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchDownload+78-12
perf_diff_v2.outapplication/octet-stream; name=perf_diff_v2.outDownload
#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#4)
Re: brininsert optimization opportunity

On 7/4/23 21:25, Soumyadeep Chakraborty wrote:

Thank you both for reviewing!

On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, yeah, I remember being bit bothered by this repeated
initialization. Your patch looks reasonable to me. I would set
bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure.
Also, please add comments atop these two new functions, to explain what
they are.

Done. Set bistate->bs_desc = NULL; as well. Added comments.

On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Yeah. I wonder how much of that runtime is the generate_series(),
though. What's the speedup if that part is subtracted. It's guaranteed
to be even more significant, but by how much?

When trying COPY, I got tripped by the following:

We get a buffer leak WARNING for the meta page and a revmap page.

WARNING: buffer refcount leak: [094] (rel=base/156912/206068,
blockNum=1, flags=0x83000000, refcount=1 1)
WARNING: buffer refcount leak: [093] (rel=base/156912/206068,
blockNum=0, flags=0x83000000, refcount=1 1)

PrintBufferLeakWarning bufmgr.c:3240
ResourceOwnerReleaseInternal resowner.c:554
ResourceOwnerRelease resowner.c:494
PortalDrop portalmem.c:563
exec_simple_query postgres.c:1284

We release the buffer during this resowner release and then we crash
with:

TRAP: failed Assert("bufnum <= NBuffers"), File:
"../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833
postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79]
postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc]
postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2]
postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39]
postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44]
postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2]
postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683]
postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7]

Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo
gets called is PortalContext and that is what is set in ii_Context.
Furthermore, we clean up the resource owner stuff before we can clean
up the MemoryContexts in PortalDrop().

The CurrentMemoryContext when initialize_brin_insertstate() is called
depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples()
it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is
ExecutorState/ExprContext. We can't rely on it to register the callback
neither.

What we can do is create a new MemoryContext for holding the

BrinInsertState, and we tie the callback to that so that cleanup is not
affected by all of these variables. See v2 patch attached. Passes make
installcheck-world and make installcheck -C src/test/modules/brin.

However, we do still have 1 issue with the v2 patch:

When we try to cancel (Ctrl-c) a running COPY command:
ERROR: buffer 151 is not owned by resource owner TopTransaction

I'm not sure if memory context callbacks are the right way to rely on
for this purpose. The primary purpose of memory contexts is to track
memory, so using them for this seems a bit weird.

There are cases that do something similar, like expandendrecord.c where
we track refcounted tuple slot, but IMHO there's a big difference
between tracking one slot allocated right there, and unknown number of
buffers allocated much later.

The fact that even with the extra context is still doesn't handle query
cancellations is another argument against that approach (I wonder how
expandedrecord.c handles that, but I haven't checked).

Maybe there is a better way of doing our cleanup? I'm not sure. Would
love your input!

The other alternative for all this is to introduce new AM callbacks for
insert_begin and insert_end. That might be a tougher sell?

That's the approach I wanted to suggest, more or less - to do the
cleanup from ExecCloseIndices() before index_close(). I wonder if it's
even correct to do that later, once we release the locks etc.

I don't think ii_AmCache was intended for stuff like this - GIN and GiST
only use it to cache stuff that can be just pfree-d, but for buffers
that's no enough. It's not surprising we need to improve this.

FWIW while debugging this (breakpoint on MemoryContextDelete) I was
rather annoyed the COPY keeps dropping and recreating the two BRIN
contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
reuse those too, but I don't know how much it'd help.

Now, to finally answer your question about the speedup without
generate_series(). We do see an even higher speedup!

seq 1 200000000 > /tmp/data.csv
\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
COPY heap FROM '/tmp/data.csv';

-- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
COPY 200000000
Time: 205072.444 ms (03:25.072)
Time: 215380.369 ms (03:35.380)
Time: 203492.347 ms (03:23.492)

-- 3 runs (branch v2)

COPY 200000000
Time: 135052.752 ms (02:15.053)
Time: 135093.131 ms (02:15.093)
Time: 138737.048 ms (02:18.737)

That's nice, but it still doesn't say how much of that is reading the
data. If you do just copy into a table without any indexes, how long
does it take?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Tomas Vondra (#5)
Re: brininsert optimization opportunity

On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I'm not sure if memory context callbacks are the right way to rely on
for this purpose. The primary purpose of memory contexts is to track
memory, so using them for this seems a bit weird.

Yeah, this just kept getting dirtier and dirtier.

There are cases that do something similar, like expandendrecord.c where
we track refcounted tuple slot, but IMHO there's a big difference
between tracking one slot allocated right there, and unknown number of
buffers allocated much later.

Yeah, the following code in ER_mc_callbackis is there I think to prevent double
freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal())
(The part about: /* Ditto for tupdesc references */).

if (tupdesc->tdrefcount > 0)
{
if (--tupdesc->tdrefcount == 0)
FreeTupleDesc(tupdesc);
}
Plus the above code doesn't try anything with Resource owner stuff, whereas
releasing a buffer means:
ReleaseBuffer() -> UnpinBuffer() ->
ResourceOwnerForgetBuffer(CurrentResourceOwner, b);

The fact that even with the extra context is still doesn't handle query
cancellations is another argument against that approach (I wonder how
expandedrecord.c handles that, but I haven't checked).

Maybe there is a better way of doing our cleanup? I'm not sure. Would
love your input!

The other alternative for all this is to introduce new AM callbacks for
insert_begin and insert_end. That might be a tougher sell?

That's the approach I wanted to suggest, more or less - to do the
cleanup from ExecCloseIndices() before index_close(). I wonder if it's
even correct to do that later, once we release the locks etc.

I'll try this out and introduce a couple of new index AM callbacks. I
think it's best to do it before releasing the locks - otherwise it
might be weird
to manipulate buffers of an index relation, without having some sort of lock on
it. I'll think about it some more.

I don't think ii_AmCache was intended for stuff like this - GIN and GiST
only use it to cache stuff that can be just pfree-d, but for buffers
that's no enough. It's not surprising we need to improve this.

Hmmm, yes, although the docs state:
"If the index AM wishes to cache data across successive index insertions within
an SQL statement, it can allocate space in indexInfo->ii_Context and
store a pointer
to the data in indexInfo->ii_AmCache (which will be NULL initially)."
they don't mention anything about buffer usage. Well we will fix it!

PS: It should be possible to make GIN and GiST use the new index AM APIs
as well.

FWIW while debugging this (breakpoint on MemoryContextDelete) I was
rather annoyed the COPY keeps dropping and recreating the two BRIN
contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
reuse those too, but I don't know how much it'd help.

Interesting, I will investigate that.

Now, to finally answer your question about the speedup without
generate_series(). We do see an even higher speedup!

seq 1 200000000 > /tmp/data.csv
\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
COPY heap FROM '/tmp/data.csv';

-- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
COPY 200000000
Time: 205072.444 ms (03:25.072)
Time: 215380.369 ms (03:35.380)
Time: 203492.347 ms (03:23.492)

-- 3 runs (branch v2)

COPY 200000000
Time: 135052.752 ms (02:15.053)
Time: 135093.131 ms (02:15.093)
Time: 138737.048 ms (02:18.737)

That's nice, but it still doesn't say how much of that is reading the
data. If you do just copy into a table without any indexes, how long
does it take?

So, I loaded the same heap table without any indexes and at the same
scale. I got:

postgres=# COPY heap FROM '/tmp/data.csv';
COPY 200000000
Time: 116161.545 ms (01:56.162)
Time: 114182.745 ms (01:54.183)
Time: 114975.368 ms (01:54.975)

perf diff also attached between the three: w/ no indexes (baseline),
master and v2.

Regards,
Soumyadeep (VMware)

Attachments:

perf_diff_3way.outapplication/octet-stream; name=perf_diff_3way.outDownload
#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#6)
Re: brininsert optimization opportunity

On 7/5/23 02:35, Soumyadeep Chakraborty wrote:

On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I'm not sure if memory context callbacks are the right way to rely on
for this purpose. The primary purpose of memory contexts is to track
memory, so using them for this seems a bit weird.

Yeah, this just kept getting dirtier and dirtier.

There are cases that do something similar, like expandendrecord.c where
we track refcounted tuple slot, but IMHO there's a big difference
between tracking one slot allocated right there, and unknown number of
buffers allocated much later.

Yeah, the following code in ER_mc_callbackis is there I think to prevent double
freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal())
(The part about: /* Ditto for tupdesc references */).

if (tupdesc->tdrefcount > 0)
{
if (--tupdesc->tdrefcount == 0)
FreeTupleDesc(tupdesc);
}
Plus the above code doesn't try anything with Resource owner stuff, whereas
releasing a buffer means:
ReleaseBuffer() -> UnpinBuffer() ->
ResourceOwnerForgetBuffer(CurrentResourceOwner, b);

Well, my understanding is the expandedrecord.c code increments the
refcount exactly to prevent the resource owner to release the slot too
early. My assumption is we'd need to do something similar for the revmap
buffers by calling IncrBufferRefCount or something. But that's going to
be messy, because the buffers are read elsewhere.

The fact that even with the extra context is still doesn't handle query
cancellations is another argument against that approach (I wonder how
expandedrecord.c handles that, but I haven't checked).

Maybe there is a better way of doing our cleanup? I'm not sure. Would
love your input!

The other alternative for all this is to introduce new AM callbacks for
insert_begin and insert_end. That might be a tougher sell?

That's the approach I wanted to suggest, more or less - to do the
cleanup from ExecCloseIndices() before index_close(). I wonder if it's
even correct to do that later, once we release the locks etc.

I'll try this out and introduce a couple of new index AM callbacks. I
think it's best to do it before releasing the locks - otherwise it
might be weird
to manipulate buffers of an index relation, without having some sort of lock on
it. I'll think about it some more.

I don't understand why would this need more than just a callback to
release the cache.

I don't think ii_AmCache was intended for stuff like this - GIN and GiST
only use it to cache stuff that can be just pfree-d, but for buffers
that's no enough. It's not surprising we need to improve this.

Hmmm, yes, although the docs state:
"If the index AM wishes to cache data across successive index insertions within
an SQL statement, it can allocate space in indexInfo->ii_Context and
store a pointer
to the data in indexInfo->ii_AmCache (which will be NULL initially)."
they don't mention anything about buffer usage. Well we will fix it!

PS: It should be possible to make GIN and GiST use the new index AM APIs
as well.

Why should GIN/GiST use the new API? I think it's perfectly sensible to
only require the "cleanup callback" when just pfree() is not enough.

FWIW while debugging this (breakpoint on MemoryContextDelete) I was
rather annoyed the COPY keeps dropping and recreating the two BRIN
contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
reuse those too, but I don't know how much it'd help.

Interesting, I will investigate that.

Now, to finally answer your question about the speedup without
generate_series(). We do see an even higher speedup!

seq 1 200000000 > /tmp/data.csv
\timing
DROP TABLE heap;
CREATE TABLE heap(i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
COPY heap FROM '/tmp/data.csv';

-- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
COPY 200000000
Time: 205072.444 ms (03:25.072)
Time: 215380.369 ms (03:35.380)
Time: 203492.347 ms (03:23.492)

-- 3 runs (branch v2)

COPY 200000000
Time: 135052.752 ms (02:15.053)
Time: 135093.131 ms (02:15.093)
Time: 138737.048 ms (02:18.737)

That's nice, but it still doesn't say how much of that is reading the
data. If you do just copy into a table without any indexes, how long
does it take?

So, I loaded the same heap table without any indexes and at the same
scale. I got:

postgres=# COPY heap FROM '/tmp/data.csv';
COPY 200000000
Time: 116161.545 ms (01:56.162)
Time: 114182.745 ms (01:54.183)
Time: 114975.368 ms (01:54.975)

OK, so the baseline is 115 seconds. With the current code, an index
means +100 seconds. With the optimization, it's just +20. Nice.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Tomas Vondra (#7)
Re: brininsert optimization opportunity

On Wed, Jul 5, 2023 at 3:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I'll try this out and introduce a couple of new index AM callbacks. I
think it's best to do it before releasing the locks - otherwise it
might be weird
to manipulate buffers of an index relation, without having some sort of lock on
it. I'll think about it some more.

I don't understand why would this need more than just a callback to
release the cache.

We wouldn't. I thought that it would be slightly cleaner and slightly more
performant if we moved the (if !state) branches out of the XXXinsert()
functions.
But I guess, let's minimize the changes here. One cleanup callback is enough.

PS: It should be possible to make GIN and GiST use the new index AM APIs
as well.

Why should GIN/GiST use the new API? I think it's perfectly sensible to
only require the "cleanup callback" when just pfree() is not enough.

Yeah no need.

Attached v3 of the patch w/ a single index AM callback.

Regards,
Soumyadeep (VMware)

Attachments:

v3-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchDownload+108-13
#9Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Soumyadeep Chakraborty (#8)
Re: brininsert optimization opportunity

Attached v4 of the patch, rebased against latest HEAD.

Regards,
Soumyadeep (VMware)

Attachments:

v4-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchDownload+108-13
#10Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Soumyadeep Chakraborty (#9)
Re: brininsert optimization opportunity

Created an entry for the Sep CF: https://commitfest.postgresql.org/44/4484/

Regards,
Soumyadeep (VMware)

On Sat, Jul 29, 2023 at 9:28 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:

Show quoted text

Attached v4 of the patch, rebased against latest HEAD.

Regards,
Soumyadeep (VMware)

#11Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Soumyadeep Chakraborty (#10)
Re: brininsert optimization opportunity

Rebased against latest HEAD.

Regards,
Soumyadeep (VMware)

Attachments:

v5-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchDownload+108-13
#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#11)
Re: brininsert optimization opportunity

Hi,

I took a look at this patch today. I had to rebase the patch, due to
some minor bitrot related to 9f0602539d (but nothing major). I also did
a couple tiny cosmetic tweaks, but other than that the patch seems OK.
See the attached v6.

I did some simple performance tests too, similar to those in the initial
message:

CREATE UNLOGGED TABLE heap (i int);
CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
--
TRUNCATE heap;
INSERT INTO heap SELECT 1 FROM generate_series(1, 20000000);

And the results look like this (5 runs each):

master: 16448.338 16066.473 16039.166 16067.420 16080.066
patched: 13260.065 13229.800 13254.454 13265.479 13273.693

So that's a nice improvement, even though enabling WAL will make the
relative speedup somewhat smaller.

The one thing I'm not entirely sure about is adding new stuff to the
IndexAmRoutine. I don't think we want to end up with too many callbacks
that all AMs have to initialize etc. I can't think of a different/better
way to do this, though.

Barring objections, I'll try to push this early next week, after another
round of cleanup.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v6-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchtext/x-patch; charset=UTF-8; name=v6-0001-Reuse-revmap-and-brin-desc-in-brininsert.patchDownload+109-13
#13Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#12)
Re: brininsert optimization opportunity

On Fri, 3 Nov 2023 at 19:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

I took a look at this patch today. I had to rebase the patch, due to
some minor bitrot related to 9f0602539d (but nothing major). I also did
a couple tiny cosmetic tweaks, but other than that the patch seems OK.
See the attached v6.
[...]
Barring objections, I'll try to push this early next week, after another
round of cleanup.

No hard objections: The principle looks fine.

I do think we should choose a better namespace than bs_* for the
fields of BrinInsertState, as BrinBuildState already uses the bs_*
namespace for its fields in the same file, but that's only cosmetic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#14Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Matthias van de Meent (#13)
Re: brininsert optimization opportunity

On Fri, 3 Nov 2023 at 19:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

The one thing I'm not entirely sure about is adding new stuff to the
IndexAmRoutine. I don't think we want to end up with too many callbacks
that all AMs have to initialize etc. I can't think of a different/better
way to do this, though.

Yes there is not really an alternative. Also, aminsertcleanup() is very similar
to amvacuumcleanup(), so it is not awkward. Why should vacuum be an
exclusive VIP? :)
And there are other indexam callbacks that not every AM implements. So this
addition is not unprecedented in that sense.

Barring objections, I'll try to push this early next week, after another
round of cleanup.

Many thanks for resurrecting this patch!

On Fri, Nov 3, 2023 at 12:16PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I do think we should choose a better namespace than bs_* for the
fields of BrinInsertState, as BrinBuildState already uses the bs_*
namespace for its fields in the same file, but that's only cosmetic.

bis_* then.

Regards,
Soumyadeep (VMware)

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#14)
Re: brininsert optimization opportunity

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

Thanks for the patch!

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tomas Vondra (#15)
Re: brininsert optimization opportunity

On Sat, Nov 25, 2023 at 12:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

Thanks a lot Tomas for helping to drive the patch to completion iteratively
and realizing the benefits.

- Ashwin

#17Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Ashwin Agrawal (#16)
Re: brininsert optimization opportunity

Thanks a lot for reviewing and pushing! :)

Regards,
Soumyadeep (VMware)

#18Richard Guo
guofenglinux@gmail.com
In reply to: Tomas Vondra (#15)
Re: brininsert optimization opportunity

On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

It seems that we have an oversight in this commit. If there is no tuple
that has been inserted, we wouldn't have an available insert state in
the clean up phase. So the Assert in brininsertcleanup() is not always
right. For example:

regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly

So I wonder if we should check 'bistate' and do the clean up only if
there is an available one, something like below.

--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -359,7 +359,9 @@ brininsertcleanup(IndexInfo *indexInfo)
 {
    BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache;
-   Assert(bistate);
+   /* We don't have an available insert state, nothing to do */
+   if (!bistate)
+       return;

Thanks
Richard

#19Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Richard Guo (#18)
Re: brininsert optimization opportunity

On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.

It seems that we have an oversight in this commit. If there is no tuple
that has been inserted, we wouldn't have an available insert state in
the clean up phase. So the Assert in brininsertcleanup() is not always
right. For example:

regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly

I wasn't able to repro the issue on 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
with UPDATE/INSERT:

postgres=# drop table a;
DROP TABLE
postgres=# create table a(i int);
CREATE TABLE
postgres=# create index on a using brin(i);
CREATE INDEX
postgres=# insert into a select 1 where 1!=1;
INSERT 0 0
postgres=# update a set i = 2 where i = 0;
UPDATE 0
postgres=# update a set i = a.i;
UPDATE 0

This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
we have moved ExecOpenIndices()
from ExecInitModifyTable() to ExecInsert(). Since we never open the
indices if nothing is
inserted, we would never attempt to close them with ExecCloseIndices()
while the ii_AmCache
is NULL (which is what causes this assertion failure).

However, it is possible to repro the issue with:
# create empty file
touch /tmp/a.csv
postgres=# create table a(i int);
CREATE TABLE
postgres=# create index on a using brin(i);
CREATE INDEX
postgres=# copy a from '/tmp/a.csv';
TRAP: failed Assert("bistate"), File: "brin.c", Line: 362, PID: 995511

So I wonder if we should check 'bistate' and do the clean up only if

there is an available one, something like below.

Yes, this is the right way to go. Thanks for reporting!

Regards,
Soumyadeep (VMware)

#20Richard Guo
guofenglinux@gmail.com
In reply to: Soumyadeep Chakraborty (#19)
Re: brininsert optimization opportunity

On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty <
soumyadeep2007@gmail.com> wrote:

On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com>
wrote:

It seems that we have an oversight in this commit. If there is no tuple
that has been inserted, we wouldn't have an available insert state in
the clean up phase. So the Assert in brininsertcleanup() is not always
right. For example:

regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly

I wasn't able to repro the issue on
86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
with UPDATE/INSERT:

This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
we have moved ExecOpenIndices()
from ExecInitModifyTable() to ExecInsert(). Since we never open the
indices if nothing is
inserted, we would never attempt to close them with ExecCloseIndices()
while the ii_AmCache
is NULL (which is what causes this assertion failure).

AFAICS we would also open the indices from ExecUpdate(). So if we
update the table in a way that no new tuples are inserted, we will have
this issue. As I showed previously, the query below crashes for me on
latest master (dc9f8a7983).

regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly

There are other code paths that call ExecOpenIndices(), such as
ExecMerge(). I believe it's not hard to create queries that trigger this
Assert for those cases.

Thanks
Richard

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Richard Guo (#20)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#21)
#23Alexander Lakhin
exclusion@gmail.com
In reply to: Tomas Vondra (#15)
#24Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexander Lakhin (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#24)
#26Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Tomas Vondra (#25)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#26)
#28James Wang
jwang25610@gmail.com
In reply to: Tomas Vondra (#27)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: James Wang (#28)
#30Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#25)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#30)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#30)
#33Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#34)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#35)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#37)