GIN improvements
Improvements of GIN indexes were presented on PGCon 2008. Presentation:
http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf
1) multicolumn GIN
This patch ( http://www.sigaev.ru/misc/multicolumn_gin-0.2.gz ) adds multicolumn
support to GIN. The basic idea is: keys (entries in GIN terminology) extracted
from values are stored in separated tuples along with their column number. In
that case, multicolumn clause is just AND of column's clauses. Unlike other
indexes, the performance of search doesn't depends on what column of index
(first, last, any subset) is used in search clause. This property can be used in
gincostestimate, but I haven't looked on it yet.
2) fast insert into GIN
This patch ( http://www.sigaev.ru/misc/fast_insert_gin-0.4.gz ) implements an
idea of using bulk insert technique, which used at index creation time. Inserted
rows are stored in the linked list of pending pages and inserted to the regular
structure of GIN at vacuum time. The algorithm is shown in presentation, but
insert completion process (vacuum) was significantly reworkes to improve
concurrency. Now, the list of pending page is locked much lesser time - only
during deletion of pages from the list.
Open item:
what is a right time to call insert completion? Currently, it is called by
ginbulkdelete and ginvacuumcleanup, ginvacuumcleanup will call completion if
ginbulkdelete wasn't called. That's not good, but works. Completion process
should started before ginbulkdelete because ginbulkdelete doesn't look on
pending pages at all.
Since insert completion (of any index if that method will exists, I think) runs
fast if number of inserted tuples is a small because it doesn't go through the
whole index, so, IMHO, the existing statistic's fields should not be changed.
That idea, discussed at PGCon, is to have trigger in vacuum which will be fired
if number of inserted tuples becomes big. Now I don't think that the idea is
useful for two reason: for small number of tuples completion is a cheap and it
should be called before ginbulkdelete. IMHO, it's enough to add an optional
method to pg_am (aminsertcleanup, per Tom's suggestion). This method will be
called before ambulkdelete and amvacuumcleanup. Opinions, objections, suggestions?
On presentation some people were interested on how our changes affect the
search speed after rows insert. The tests are below: We use the same tables as
in presentation and measure search times ( after insertion of some rows ) before
and after vacuum. All times are in ms. Test tables contain 100000 rows, in the
first table the number of elements in array is 100 with cardinality = 500,
second - 100 and 500000, last - 1000 and 500.
Insert 10000 into table with 100000 rows (10%)
| v && '{v1}' |
-----------------+---------+--------+ found
| novac-d | vac-d | rows
-----------------+---------+--------+-------
n:100, c:500 | 118 | 35 | 19909
n:100, c:500000 | 95 | 0.7 | 25
n:1000, c:500 | 380 | 79 | 95211
Insert 1000 into table with 100000 rows (1%)
| v && '{v1}' |
-----------------+---------+--------+ found
| novac-d | vac-d | rows
-----------------+---------+--------+-------
n:100, c:500 | 40 | 31 | 18327
n:100, c:500000 | 13 | 0.5 | 26
n:1000, c:500 | 102 | 71 | 87499
Insert 100 into table with 100000 rows (0.1%)
| v && '{v1}' |
-----------------+---------+--------+ found
| novac-d | vac-d | rows
-----------------+---------+--------+-------
n:100, c:500 | 32 | 31 | 18171
n:100, c:500000 | 1.7 | 0.5 | 20
n:1000, c:500 | 74 | 71 | 87499
Looking at result it's easy to conclude that:
- time of search pending list is O(number of inserted rows), i.e., search time
is equal to (time of search in GIN) + K1 * (number of inserted rows after the
last vacuum).
- search time is O(average length of indexed columns). Observations made above
is also applicable here.
- significant performance gap starts around 5-10% of inserts or near 500-1000
inserts. This is very depends on specific dataset.
Notice, that insert performance to GIN was increased up to 10 times. See
exact results in presentation.
Do we need to add option to control this (fast insertion) feature?
If so, what is a default value? It's not clear to me.
Note: These patches are mutually exclusive because they touch the same pieces
of code and I'm too lazy to manage several depending patches. I don't see any
problem to join patches to one, but IMHO it will be difficult to review.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
2) fast insert into GIN
New version:
http://www.sigaev.ru/misc/fast_insert_gin-0.6.gz
Changes:
- added option FASTUPDATE=(1|t|true|on|enable|0|f|false|disable) for
CREATE/ALTER command for GIN indexes
- Since there wasn't any comments on first email, pg_am.aminsertcleanup optional
method was introduced.
- added documentation
Suppose, patch is ready to review/commit...
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
2) fast insert into GIN
New version:
http://www.sigaev.ru/misc/fast_insert_gin-0.6.gzChanges:
- added option FASTUPDATE=(1|t|true|on|enable|0|f|false|disable) for
CREATE/ALTER command for GIN indexes
I think we should try to make it automatic. I'm not sure how.
How about having a constant sized "fastupdate" buffer, of say 100 rows
or a fixed number of pages, and when that becomes full, the next
inserter will have to pay the price of updating the index and flushing
the buffer. To keep that overhead out of the main codepath, we could
make autovacuum to flush the buffers periodically.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
How about having a constant sized "fastupdate" buffer, of say 100 rows
or a fixed number of pages, and when that becomes full, the next
inserter will have to pay the price of updating the index and flushing
I'm not sure that is acceptable because flushing pending list may take several
seconds in unpredictable moment.
the buffer. To keep that overhead out of the main codepath, we could
make autovacuum to flush the buffers periodically.
Do you mean that GIN sends a "smoke signal" to the autovacuum launcher process
to ask about vacuum?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
How about having a constant sized "fastupdate" buffer, of say 100 rows
or a fixed number of pages, and when that becomes full, the next
inserter will have to pay the price of updating the index and flushingI'm not sure that is acceptable because flushing pending list may take
several seconds in unpredictable moment.
Perhaps we could make the fixed-size buffer be of size X, and trigger
autovac on X/3 or some such, to give it enough slack so that it would be
very unlikely to be processed by user processes.
the buffer. To keep that overhead out of the main codepath, we could
make autovacuum to flush the buffers periodically.Do you mean that GIN sends a "smoke signal" to the autovacuum launcher
process to ask about vacuum?
Something like that, yes.
Currently, autovac only uses pgstats as trigger for actions. Maybe we
could use something else (say, a flag in shared memory?), or just stash
the info that the index needs to be processed in pgstats and have
autovac examine it.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Teodor Sigaev wrote:
2) fast insert into GIN
New version:
http://www.sigaev.ru/misc/fast_insert_gin-0.6.gzChanges:
- added option FASTUPDATE=(1|t|true|on|enable|0|f|false|disable) for
CREATE/ALTER command for GIN indexes
- Since there wasn't any comments on first email, pg_am.aminsertcleanup optional
method was introduced.
Hmm, this alters the semantics of amvacuumcleanup a bit. Currently in
btvacuumcleanup we invoke btvacuumscan only if btbulkdelete was not
called. This is noticed by observing whether the "stats" pointer is
NULL. However, the patch changes this a bit because
index_vacuum_cleanup is called with the results of index_insert_cleanup,
instead of a plain NULL.
Right now this is not a problem because there is no insert_cleanup
function for btree, but I wonder if we should clean it up.
FWIW there's a typo in catalogs.sgml (finction -> function)
What's the use of the FASTUPDATE parameter? Is there a case when a user
is interested in turning it off?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Perhaps we could make the fixed-size buffer be of size X, and trigger
autovac on X/3 or some such, to give it enough slack so that it would be
very unlikely to be processed by user processes.Do you mean that GIN sends a "smoke signal" to the autovacuum launcher
process to ask about vacuum?Currently, autovac only uses pgstats as trigger for actions. Maybe we
could use something else (say, a flag in shared memory?), or just stash
the info that the index needs to be processed in pgstats and have
autovac examine it.
Flag in pgstats or shared memory is most reasonable solution. Using size of
buffers is not very good because other indexes might use another technics for
delayed insert and use another trigger criteria.
Suppose, the best technic for GIN will be a setting flag by search procedure -
if time of scanning pending pages is eqial to time of search in regular
structure then it's time to call insert cleanup.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Right now this is not a problem because there is no insert_cleanup
function for btree, but I wonder if we should clean it up.
Look at gistbulkdelete and gistvacuumcleanup, first function wants to send a
bool flag to second one and they use GiSTBulkDelete structure instead of usual
IndexBulkDeleteResult. When it will be needed btree may use the same method.
FWIW there's a typo in catalogs.sgml (finction -> function)
Thank you, will fix.
What's the use of the FASTUPDATE parameter? Is there a case when a user
is interested in turning it off?
Yeah - when time of search is much-much more important (or crucial) than
insertion time. Or table stores read-only values.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
1) multicolumn GIN
Unlike other indexes, the performance of search
doesn't depends on what column of index (first, last, any subset) is
used in search clause. This property can be used in gincostestimate, but
I haven't looked on it yet.
After some playing I didn't find any mentions in *costestimate function about
difference of cost estimation between first and any other columns in clauses,
so, IMHO, issue above isn't an issue. :)
So, I didn't see any comments/objections and I intend to commit this patch for
next two days and synchronize 'fast insert into GIN' patch with CVS.
Objections?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
So, I didn't see any comments/objections and I intend to commit this patch for
next two days and synchronize 'fast insert into GIN' patch with CVS.
Objections?
I think it hasn't really gotten reviewed at all (certainly not by me).
If you want other people to look it over you should wait for next
month's commit fest.
regards, tom lane
Sync with current CVS HEAD and post in hackers- too because patches- close to
the closing.
http://www.sigaev.ru/misc/fast_insert_gin-0.7.gz
http://www.sigaev.ru/misc/multicolumn_gin-0.3.gz
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Dumb question:
What's the benefit of a multi-column GIN index over multiple
single-column GIN indexes?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
What's the benefit of a multi-column GIN index over multiple
single-column GIN indexes?
Page 12 from presentation on PgCon
(http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf):
Multicolumn index vs. 2 single column indexes
Size: 539 Mb 538 Mb
Speed: *1.885* ms 4.994 ms
Index: ~340 s ~200 s
Insert: 72 s/10000 66 s/10000
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Fri, 4 Jul 2008, Teodor Sigaev wrote:
What's the benefit of a multi-column GIN index over multiple
single-column GIN indexes?Page 12 from presentation on PgCon
(http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf):Multicolumn index vs. 2 single column indexes
Size: 539 Mb 538 Mb Speed: *1.885* ms 4.994 ms Index: ~340 s ~200 s Insert: 72 s/10000 66 s/10000
Well, another reason is a index feature-completeness
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Teodor Sigaev wrote:
Sync with current CVS HEAD and post in hackers- too because patches-
close to the closing.
I looked this over and it looks good in general. I was only wondering
about for single-column indexes -- we're storing attribute numbers too,
right? Would it be too difficult to strip them out in that case?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
I looked this over and it looks good in general. I was only wondering
about for single-column indexes -- we're storing attribute numbers too,
right?
No, GinState->oneCol field signals to GinFormTuple and
gin_index_getattr/gintuple_get_attrnum about actual storage.
Single column index is binary compatible with current index :)
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
I looked this over and it looks good in general. I was only wondering
about for single-column indexes -- we're storing attribute numbers too,
right?No, GinState->oneCol field signals to GinFormTuple and
gin_index_getattr/gintuple_get_attrnum about actual storage.Single column index is binary compatible with current index :)
Ah, neat!
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I looked this over and it looks good in general.
May I think that patch passed review and commit it?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
I looked this over and it looks good in general.
May I think that patch passed review and commit it?
I'd still like to take a look.
regards, tom lane
On Tue, 2008-07-08 at 14:51 -0400, Tom Lane wrote:
I'd still like to take a look.
I was tasked with reviewing this for the current commit fest, although
so far I've just been working on grokking the rest of the GIN code. But
if you'd like to review it instead, that's fine with me.
-Neil
Neil,
I was tasked with reviewing this for the current commit fest, although
so far I've just been working on grokking the rest of the GIN code. But
if you'd like to review it instead, that's fine with me.
I have plenty of other stuff I could assign you if you're not needed on
GIN.
--
--Josh
Josh Berkus
PostgreSQL @ Sun
San Francisco
Teodor Sigaev <teodor@sigaev.ru> writes:
http://www.sigaev.ru/misc/fast_insert_gin-0.7.gz
http://www.sigaev.ru/misc/multicolumn_gin-0.3.gz
I've committed the multicolumn one with minor revisions (fix some poor
English in docs and comments, add regression-test coverage). Do you
need more review of fast_insert yet? It looked like a number of people
commented on it already.
regards, tom lane
I've committed the multicolumn one with minor revisions (fix some poor
English in docs and comments, add regression-test coverage). Do you
Thank you very much.
need more review of fast_insert yet? It looked like a number of people
commented on it already.
I should modify it to support/synchronize with multicolumn GIN - both patches
touch the same pieces of code, and I didn't make a single patch to simplify review.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Updated: http://www.sigaev.ru/misc/fast_insert_gin-0.9.gz
need more review of fast_insert yet? It looked like a number of people
commented on it already.
I still havn't clearness of acceptability for suggested aminsertcleanup calling.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
Updated: http://www.sigaev.ru/misc/fast_insert_gin-0.9.gz
I still havn't clearness of acceptability for suggested aminsertcleanup calling.
I started to look at this. I don't understand why VACUUM does an insert
cleanup before starting to vacuum, but VACUUM FULL doesn't?
I don't particularly like the idea of adding aminsertcleanup calls
immediately before other AM operations such as ambulkdelete. It seems
to me that those operations ought to include the cleanup subroutine
themselves, if they need it; they shouldn't depend on callers to get
this right. Offhand it looks to me like the only new index AM call
needed is the one at vacuum startup, which tempts me to propose that
the new AM entry point should be called "amvacuumstartup", instead of
wiring in the assumption that what it's for is specifically cleanup
of insertions.
Comments? I can make the change if you think it's okay --- I'm busy
cleaning up docs and comments at the moment.
regards, tom lane
I started to look at this. I don't understand why VACUUM does an insert
cleanup before starting to vacuum, but VACUUM FULL doesn't?
Hmm. May be I missed something, but I don't understand where and what... I tried
to track all places of ambultdelete call. aminsertcleanup should be called
before any ambulkdelete, because ambulkdelete doesn't scan pending list which
can store items to be deleted and hence index will store item pointers to absent
tuples.
needed is the one at vacuum startup, which tempts me to propose that
the new AM entry point should be called "amvacuumstartup", instead of
wiring in the assumption that what it's for is specifically cleanup
of insertions.
That's possible but inserts into index should be forbidden between
amvacuumstartup and last call of ambulkdelete.
Comments? I can make the change if you think it's okay --- I'm busy
cleaning up docs and comments at the moment.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
I started to look at this. I don't understand why VACUUM does an insert
cleanup before starting to vacuum, but VACUUM FULL doesn't?
Hmm. May be I missed something, but I don't understand where and what... I tried
to track all places of ambultdelete call. aminsertcleanup should be called
before any ambulkdelete, because ambulkdelete doesn't scan pending list which
can store items to be deleted and hence index will store item pointers to absent
tuples.
needed is the one at vacuum startup, which tempts me to propose that
the new AM entry point should be called "amvacuumstartup", instead of
wiring in the assumption that what it's for is specifically cleanup
of insertions.
That's possible but inserts into index should be forbidden between
amvacuumstartup and last call of ambulkdelete.
Well, if that is required to be true then this whole design is pretty
broken, because VACUUM doesn't hold any lock that would guarantee that
no insert happens between the two calls. If we fold the two AM calls
into one call then it'd be okay for the index AM to take such a lock
transiently during the single index-cleanup-plus-bulkdelete call.
For VACUUM FULL there's no such issue because the whole table is locked,
but I still don't see any real point in having two successive index AM
calls when the AM could perfectly well do all the work in one call.
Maybe it'd be better if ambulkdelete *did* scan the pending list?
You'd still need at least page-level locking but perhaps not anything
stronger.
regards, tom lane
Well, if that is required to be true then this whole design is pretty
broken, because VACUUM doesn't hold any lock that would guarantee that
no insert happens between the two calls. If we fold the two AM calls
into one call then it'd be okay for the index AM to take such a lock
transiently during the single index-cleanup-plus-bulkdelete call.
Actually, lock doesn't needed. Just bulkdelete should not try to remove not yet
"insertcleanuped" items pointer. That's easy because VacPageList is prepared
before insertcleanup call.
Maybe it'd be better if ambulkdelete *did* scan the pending list?
I don't like that idea because it requires to add a lot of code (concurrent
deletion of pages in list), much simpler to call insertcleanup inside
ginbulkdelete/ginvacuumcleanup.
You'd still need at least page-level locking but perhaps not anything
stronger.
That's close to trivial to revert this piece to add cleanup call to
ginbulkdelete/ginvacuumcleanup. Early variants used this variant.
Reasons for new variant was:
- defining needing of call of insertcleanup, and stats argument was used for
it in both function. If it's a NULL then call cleanup.
- I thought about statistic-based trigger for separate call of insertcleanup.
Trigger should be fired on massive insert/update events very similar to
trigger on massive delete for ambulkdelete. I'm very sorry but I didn't do it
yet, and definitely I need some help here.
Do I revert that piece?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
That's close to trivial to revert this piece to add cleanup call to
ginbulkdelete/ginvacuumcleanup. Early variants used this variant.
Yeah, I think we should do it that way.
On reflection I don't think you even need the amvacuumstartup call,
because it is *not* safe to assume that an index cleanup operation
there will guarantee that vacuum won't try to remove pending tuples.
Remember that a tuple inserted by a transaction that later aborted
is DEAD and can be reclaimed instantly by VACUUM. So while in the
case of VACUUM FULL it might be okay to call index_cleanup only
once, for regular VACUUM I think you really have to call it within
each bulkdelete operation. There's probably no point in optimizing
it away in VACUUM FULL either, since surely it'll be fast to call
index_cleanup when there's nothing in the pending list?
- I thought about statistic-based trigger for separate call of insertcleanup.
Trigger should be fired on massive insert/update events very similar to
trigger on massive delete for ambulkdelete. I'm very sorry but I didn't do it
yet, and definitely I need some help here.
Yeah, I was going to complain about that next :-). Autovacuum isn't
going to trigger as a result of INSERT operations; somehow we have
to teach it what to do for GIN indexes. I remember we discussed this
at PGCon but I don't think we decided exactly what to do...
Do I revert that piece?
I've already made a number of changes to the patch; let me keep working
on it and send it back to you later.
regards, tom lane
once, for regular VACUUM I think you really have to call it within
each bulkdelete operation.
Exactly what I did in last patch.
There's probably no point in optimizing
it away in VACUUM FULL either, since surely it'll be fast to call
index_cleanup when there's nothing in the pending list?
Sure, with empty pending list insertcleanup will just lock/unlock metapage.
Yeah, I was going to complain about that next :-). Autovacuum isn't
going to trigger as a result of INSERT operations; somehow we have
to teach it what to do for GIN indexes. I remember we discussed this
at PGCon but I don't think we decided exactly what to do...
So, may be we just move insertcleanup call to ginbulkdelete/ginvacuumcleanup
but leave aminsertcleanup field in pg_proc for a future.
I've already made a number of changes to the patch; let me keep working
on it and send it back to you later.
ok
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
So, may be we just move insertcleanup call to ginbulkdelete/ginvacuumcleanup
but leave aminsertcleanup field in pg_proc for a future.
I'd be inclined not to add the extra AM call if we aren't going to use
it now. There's no very good reason to think that a definition we
settled on today would be exactly the right thing for whatever future
need might appear. Better to wait till we have a concrete example to
design around.
regards, tom lane
I wrote:
Yeah, I was going to complain about that next :-). Autovacuum isn't
going to trigger as a result of INSERT operations; somehow we have
to teach it what to do for GIN indexes. I remember we discussed this
at PGCon but I don't think we decided exactly what to do...
One simple idea is to call aminsertcleanup (probably renamed to
something else like amanalyzehook) during ANALYZE. This seems a bit
grotty, but it has the very attractive property that we don't need to
give the autovacuum control logic any special knowledge about GIN
indexes. Either inserts or updates will lead it to trigger either
auto-ANALYZE or auto-VACUUM, and either way GIN gets a cleanup
opportunity.
A possible argument against this is that if we later fix things so
that VACUUM and ANALYZE can happen concurrently on the same table,
amanalyzehook could get called concurrently with ambulkdelete or
other vacuum-support operations. So the AM author would have to
take care to interlock that safely. But this doesn't seem like
a big deal to me --- interlocks against regular inserts/updates
are probably a harder problem anyway.
Thoughts?
regards, tom lane
Teodor Sigaev <teodor@sigaev.ru> writes:
Here is the GIN fast-insert patch back again. Changes:
* Sync with CVS HEAD
* Clean up documentation and some of the code comments
* Fix up custom reloptions code
* Suppress some compiler warnings
I didn't get much further than that because I got discouraged after
looking at the locking issues around the pending-insertions list.
It's a mess:
* shiftList() holds an exclusive lock on metapage throughout its run,
which means that it's impossible for two of them to run concurrently.
So why bother with "concurrent deletion" detection?
* shiftList does LockBufferForCleanup, which means that it can be blocked
for an indefinitely long time by a concurrent scan, and since it's holding
exclusive lock on metapage no new scans or insertions can start meanwhile.
This is not only horrid from a performance standpoint but it very probably
can result in deadlocks --- which will be deadlocks on LWLocks and thus
not even detected by the system.
* GIN index scans release lock and pin on one pending-list page before
acquiring pin and lock on the next, which means there's a race condition:
shiftList could visit and delete the next page before we get to it,
because there's a window where we're holding no buffer lock at all.
I think this isn't fatal in itself, since presumably the data in the next
page has been moved into the main index and we can scan it later, but the
scan code isn't checking whether the page has been deleted out from under
it.
* It seems also possible that once a list page has been marked
GIN_DELETED, it could be re-used for some other purpose before a
scan-in-flight reaches it -- reused either as a regular index page or as a
new list page. Neither case is being defended against. It might be that
the new-list-page case isn't a problem, or it might not.
* There is a bigger race condition, which is that after a scan has
returned a tuple from a pending page, vacuum could move the index entry
into the main index structure, and then that same scan could return that
same index entry a second time. This is a no-no, and I don't see any easy
fix.
I haven't really finished reviewing this code, but I'm going to bounce it
back to you to see if you can solve the locking problems. Unless that can
be made safe there is no point doing any more work on this patch.
regards, tom lane
Tom Lane wrote:
Teodor Sigaev <teodor@sigaev.ru> writes:
I didn't get much further than that because I got discouraged after
looking at the locking issues around the pending-insertions list.
It's a mess:
These are rather severe problems. Maybe there's a better solution, but
perhaps it would be good enough to lock out concurrent access to the
index while the bulkinsert procedure is working.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes:
Tom Lane wrote:
It's a mess:
These are rather severe problems. Maybe there's a better solution, but
perhaps it would be good enough to lock out concurrent access to the
index while the bulkinsert procedure is working.
Ugh...
The idea I was toying with was to not allow GIN scans to "stop" on
pending-insertion pages; rather, they should suck out all the matching
tuple IDs into backend-local memory as fast as they can, and then return
the TIDs to the caller one at a time from that internal array. Then,
when the scan is later visiting the main part of the index, it could
check each matching TID against that array to see if it'd already
returned the TID. (So it might be an idea to sort the TID array after
gathering it, to make those subsequent checks fast via binary search.)
This would cost in backend-local memory, of course, but hopefully not
very much. The advantages are the elimination of the deadlock risk
from scan-blocks-insertcleanup-blocks-insert, and fixing the race
condition when a TID previously seen in the pending list is moved to
the main index. There were still a number of locking issues to fix
but I think they're all relatively easy to deal with.
regards, tom lane
* shiftList() holds an exclusive lock on metapage throughout its run,
which means that it's impossible for two of them to run concurrently.
So why bother with "concurrent deletion" detection?
Because metapage is locked immediately before shiftList call, while metapage is
unlocked another process could produce locking metapage and execution of
shiftList. So, when shiftList starts it should check of already deleted page. If
shiftList sees already deleted page then it doesn't do anything and reports to
the caller.
* shiftList does LockBufferForCleanup, which means that it can be blocked
for an indefinitely long time by a concurrent scan, and since it's holding
exclusive lock on metapage no new scans or insertions can start meanwhile.
This is not only horrid from a performance standpoint but it very probably
can result in deadlocks --- which will be deadlocks on LWLocks and thus
not even detected by the system.
Ops, I see possible scenario: UPDATE tbl SET gin_indexed_field = ... where
gin_indexed_field .... with concurrent shiftList. Will fix. Thank you.
Nevertheless, shiftList should be fast in typical scenario: it doesn't do
complicated work but just marks as deleted pages which already was readed before.
* GIN index scans release lock and pin on one pending-list page before
acquiring pin and lock on the next, which means there's a race condition:
shiftList could visit and delete the next page before we get to it,
because there's a window where we're holding no buffer lock at all.
Agree, will fix.
* It seems also possible that once a list page has been marked
GIN_DELETED, it could be re-used for some other purpose before a
scan-in-flight reaches it -- reused either as a regular index page or as a
Impossible - because deletion is running from the head of list and scan too. But
deletion locks metapage and locks pages for cleanup. So, scan may start only
from not yet deleted page and will go through the list before deletion process.
* There is a bigger race condition, which is that after a scan has
returned a tuple from a pending page, vacuum could move the index entry
into the main index structure, and then that same scan could return that
same index entry a second time. This is a no-no, and I don't see any easy
fix.
Hmm, isn't it allowed for indexes? At least GiST has this behaviour from its
birth date.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
* There is a bigger race condition, which is that after a scan has
returned a tuple from a pending page, vacuum could move the index entry
into the main index structure, and then that same scan could return that
same index entry a second time. This is a no-no, and I don't see any easy
fix.
Hmm, isn't it allowed for indexes? At least GiST has this behaviour from its
birth date.
Really? Then GiST needs to be fixed too. Otherwise you risk having
queries return the same row twice. A bitmap indexscan plan would mask
such an index bug ... but a plain indexscan won't.
regards, tom lane
I wrote:
Really? Then GiST needs to be fixed too. Otherwise you risk having
queries return the same row twice. A bitmap indexscan plan would mask
such an index bug ... but a plain indexscan won't.
BTW, there's another issue I forgot about yesterday, which is that
the planner assumes that all index AMs work correctly for backwards
scan. The place where the rubber meets the road here is that
if you DECLARE SCROLL CURSOR for a plan implemented as a plain
indexscan, then FETCH BACKWARDS is supposed to reliably generate
results consistent with previous FETCH FORWARDS, to wit the same
tuples in the reverse order.
We can assume that the query is using an MVCC snapshot, which means
that at the index level it's okay for the index to return newly-inserted
entries that weren't returned in the previous forward scan, or to not
return entries that were removed meanwhile by VACUUM. But re-ordering
live tuples is bad news.
The idea of copying the pending-tuples list into local scan state would
make this work as expected as far as the proposed patch goes, but I'm
wondering whether the behavior isn't completely broken anyway by
operations such as page splits. Do we need to change the planner to
assume that this only works nicely for btree?
regards, tom lane
operations such as page splits. Do we need to change the planner to
assume that this only works nicely for btree?
It seems to that direction (backward or forward) has meaning only for indexes
with amcanorder = true. With amcanorder=false results will be occasionally for
any direction.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
operations such as page splits. Do we need to change the planner to
assume that this only works nicely for btree?
It seems to that direction (backward or forward) has meaning only for
indexes with amcanorder = true. With amcanorder=false results will be
occasionally for any direction.
Well, no; amcanorder specifies that the index can return results that
are sorted according to some externally meaningful ordering. The
question at hand is just whether the results of a single indexscan
are self-consistent. That's a property that can reasonably be expected
to hold regardless of amcanorder; it does hold for hash indexes for
instance. (In the case of hash we have to forbid splitting a bucket
that's actively being scanned in order to make it true.)
regards, tom lane
queries return the same row twice. A bitmap indexscan plan would mask
such an index bug ... but a plain indexscan won't.
Fuh. :(. Well. Will fix.
GiST:
- GiST already supports both scan directions in theory, but page split may
change order between forward and backward scans (user-defined pageSplit doesn't
preserve order of tuples). Holding of split until end of scan will produce
unacceptable concurrency level.
- GiST can return one itempointer twice. It's fixable by storing content of
current page in memory instead of just keeping page pinned. Will do (backpatches
too).
GIN:
- GIN doesn't support backward scan direction and will not support in close
future.
- Right now GIN doesn't return twice the same itempointer, but with current
fast_insert patch it might return. So, suppose, to fix that it's enough just to
remember itempointers returned from pending list and use it as filter for
results from regular structure. Will do.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
- GiST already supports both scan directions in theory, but page split may
change order between forward and backward scans (user-defined pageSplit doesn't
preserve order of tuples). Holding of split until end of scan will produce
unacceptable concurrency level.
- GIN doesn't support backward scan direction and will not support in close
future.
Okay. I'll see about fixing the planner to not assume that GIST or GIN
indexscans are scrollable.
The cleanest way to do this is to introduce a new bool column in pg_am
rather than hard-wiring assumptions about which AMs can do it. However
(a) that's not back-patchable and (b) it'll create a merge conflict with
your patch, if you're still going to add a new AM function column.
I think that aminsertcleanup per se isn't needed, but if we want an
"amanalyze" there'd still be a conflict. Where are we on that?
regards, tom lane
(a) that's not back-patchable and (b) it'll create a merge conflict with
your patch, if you're still going to add a new AM function column.
I think that aminsertcleanup per se isn't needed, but if we want an
"amanalyze" there'd still be a conflict. Where are we on that?
I'll revert aminsertcleanup framework but leave gininsertcleanup function as is,
because I'll not have enough time until end of summer - I'd like to finalize
patch and fixes first.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Reworked version of fast insertion patch for GIN.
* shiftList does LockBufferForCleanup, which means that it can be blocked
for an indefinitely long time by a concurrent scan, and since it's
holding
exclusive lock on metapage no new scans or insertions can start
meanwhile.
This is not only horrid from a performance standpoint but it very
probably
can result in deadlocks --- which will be deadlocks on LWLocks and thus
not even detected by the system.Ops, I see possible scenario: UPDATE tbl SET gin_indexed_field = ...
where gin_indexed_field .... with concurrent shiftList. Will fix. Thank
you.
Fixed, see below
* GIN index scans release lock and pin on one pending-list page before
acquiring pin and lock on the next, which means there's a race condition:
shiftList could visit and delete the next page before we get to it,
because there's a window where we're holding no buffer lock at all.Agree, will fix.
Fixed
* There is a bigger race condition, which is that after a scan has
returned a tuple from a pending page, vacuum could move the index entry
into the main index structure, and then that same scan could return that
same index entry a second time. This is a no-no, and I don't see any
easy
fix.
Fixed. TIDBitmap is used for that and for preventing deadlock mentioned above.
TIDBitmap is used for collectiong matched tuples from pending pages and after
that it used as filter for results from regular GIN's scan.
Patch extends TIDBitmap interface by 2 functions:
bool tbm_check_tuple(TIDBitmap *tbm, const ItemPointer tid);
returns true if tid already exists in bitmap
bool tbm_has_lossy(TIDBitmap *tbm);
returns true if bitmap becomes lossy
Also, sequential scan on pending page is replaced to binary search for
performance. It's not a big win but it might improve performance.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
There's a pretty fundamental issue with this patch, which is that while
buffering the inserts in the "list pages" makes the inserts fast, all
subsequent queries become slower until the tuples have been properly
inserted into the index. I'm sure it's a good tradeoff in many cases,
but there has got to be a limit to it. Currently, if you create an empty
table, and load millions of tuples into it using INSERTs, the index
degenerates into just a pile of "fast" tuples that every query needs to
grovel through. The situation will only be rectified at the next vacuum,
but if there's no deletes or updates on the table, just inserts,
autovacuum won't happen until the next anti-wraparound vacuum.
To make things worse, a query will fail if all the matching
fast-inserted tuples don't fit in the non-lossy tid bitmap. That's
another reason to limit the number of list pages; queries will start
failing otherwise.
Yet another problem is that if so much work is offloaded to autovacuum,
it can tie up autovacuum workers for a very long time. And the work can
happen on an unfortunate time, when the system is busy, and affect other
queries. There's no vacuum_delay_point()s in gininsertcleanup, so
there's no way to throttle that work.
I think we need a hard limit on the number of list pages, before we can
consider accepting this patch. After the limit is full, the next
inserter can flush the list, inserting the tuples in the list into the
tree, or just fall back to regular, slow, inserts.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
I think we need a hard limit on the number of list pages, before we can
consider accepting this patch. After the limit is full, the next inserter can
flush the list, inserting the tuples in the list into the tree, or just fall
back to regular, slow, inserts.
I do like the idea of having the work fall to vacuum though. Perhaps we need
some way for autovacuum to ask an access method what shape an index is in and
whether it needs vacuuming? Or more likely a separate command from vacuum that
specifically cleans up an index.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
I think we need a hard limit on the number of list pages, before we can
consider accepting this patch. After the limit is full, the next inserter can
flush the list, inserting the tuples in the list into the tree, or just fall
back to regular, slow, inserts.I do like the idea of having the work fall to vacuum though. Perhaps we need
some way for autovacuum to ask an access method what shape an index is in and
whether it needs vacuuming? Or more likely a separate command from vacuum that
specifically cleans up an index.
Yeah, this is what we agreed to on Ottawa. We need to collect some
different stats for the GIN indexes where this is active, and ensure
that autovacuum checks them.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
grovel through. The situation will only be rectified at the next vacuum,
but if there's no deletes or updates on the table, just inserts,
autovacuum won't happen until the next anti-wraparound vacuum.
There is not agreement here, see
http://archives.postgresql.org/message-id/2818.1216753264@sss.pgh.pa.us
Yet another problem is that if so much work is offloaded to autovacuum,
it can tie up autovacuum workers for a very long time. And the work can
happen on an unfortunate time, when the system is busy, and affect other
queries. There's no vacuum_delay_point()s in gininsertcleanup, so
there's no way to throttle that work.
Will insert.
I think we need a hard limit on the number of list pages, before we can
consider accepting this patch. After the limit is full, the next
inserter can flush the list, inserting the tuples in the list into the
tree, or just fall back to regular, slow, inserts.
Hard limit is not very good decision
- If it will make a flush when limit is reached then sometimes insert or update
will take unacceptable amount of time. Small limit is not very helpful, large
will takes a lot of time. Although if we calculate limit using work_mem setting
then, may be, it will be useful. Bulk insert will collect all pending pages in
memory at once.
- Falling back to regular insert will take long time for update of whole table -
and that was one of reasons of that patch. Users forget to drop GIN index before
a global update and query runs forever.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
- Falling back to regular insert will take long time for update of whole
table - and that was one of reasons of that patch. Users forget to drop
GIN index before a global update and query runs forever.
If *that* is a use case we're interested in, the incoming tuples could
be accumulated in backend-private memory, and inserted into the index at
commit. That would be a lot simpler, with no need to worry about
concurrent inserts or vacuums.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Teodor Sigaev wrote:
- Falling back to regular insert will take long time for update of whole
table - and that was one of reasons of that patch. Users forget to drop
GIN index before a global update and query runs forever.
If *that* is a use case we're interested in, the incoming tuples could
be accumulated in backend-private memory, and inserted into the index at
commit. That would be a lot simpler, with no need to worry about
concurrent inserts or vacuums.
Doesn't work --- the index would yield wrong answers for later queries
in the same transaction.
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
If *that* is a use case we're interested in, the incoming tuples could
be accumulated in backend-private memory, and inserted into the index at
commit. That would be a lot simpler, with no need to worry about
concurrent inserts or vacuums.Doesn't work --- the index would yield wrong answers for later queries
in the same transaction.
Queries would still need to check the backend-private list.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 3 Dec 2008, at 06:57 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com
wrote:
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
If *that* is a use case we're interested in, the incoming tuples
could be accumulated in backend-private memory, and inserted into
the index at commit. That would be a lot simpler, with no need to
worry about concurrent inserts or vacuums.Doesn't work --- the index would yield wrong answers for later
queries
in the same transaction.Queries would still need to check the backend-private list.
More to the point -- at least if I'm guessing right about tom's
thoughts --queries would still have to check the heap. That is the
backend private list would just be a proxy for buffered *index* tuples.
If we do this though it would be really nice to do it at a higher
level than the indexam. If we could do it for any indexam that
provides a kind of bulk insert method that would be great.
I'm just not sure how to support all the indexable operators for the
various indexams on the local buffered list.
Incidentally buffering btree index inserts was originally Heikki's idea.
Show quoted text
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark <greg.stark@enterprisedb.com> writes:
If we do this though it would be really nice to do it at a higher
level than the indexam. If we could do it for any indexam that
provides a kind of bulk insert method that would be great.
I'm just not sure how to support all the indexable operators for the
various indexams on the local buffered list.
In principle, just return all those TIDs marked "lossy, please recheck".
This is a bit brute-force but I'm not sure any useful optimization is
possible.
regards, tom lane
Tom Lane wrote:
Greg Stark <greg.stark@enterprisedb.com> writes:
If we do this though it would be really nice to do it at a higher
level than the indexam. If we could do it for any indexam that
provides a kind of bulk insert method that would be great.I'm just not sure how to support all the indexable operators for the
various indexams on the local buffered list.In principle, just return all those TIDs marked "lossy, please recheck".
This is a bit brute-force but I'm not sure any useful optimization is
possible.
You could flush the local buffer to the index whenever the index is
queried. Not sure if it's better than returning them for recheck, though.
This wouldn't work for unique indexes, BTW, but that's not a problem for
GIN.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Changes:
- added vacuum_delay_point() in gininsertcleanup
- add trigger to run vacuum by number of inserted tuples from
last vacuum. Number of inserted tuples represents number
of really inserted tuples in index and it is calculated as
tuples_inserted + tuples_updated - tuples_hot_updated.
Trigger fires on instuples > vac_base_thresh because search time is linear
on number of pending pages (tuples)
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
fast_insert_gin-0.17.gzapplication/x-tar; name=fast_insert_gin-0.17.gzDownload
���AI fast_insert_gin-0.17 �}{_G�����(�G!�xc������$�I�~���Kj��e�f|?�=�zvWK
�����R����9u^u�T?�JO�$b�T�I�'���U�����A����u4^}|�0�J���,--U���Y[{��������y��~��i��������deee���l=_�,����� ������B������B� �{�E�V���q&�4���S8���Ip��~���/�O
]eI(K�^0�O����c��NG�9H��n��=j�������G�{�i?�^��I`�����
�^<����A�g�V�����A�f�_�2������GW�K����liG�)vg��Y�0�����Bw-�:�Q��k
|VK�l��w���v�"���Y8zG�,L�����_� ��$K��Sd�x:�{�u����`����tt&�]��^d%_#��k�8�]�}�F��6�b|e��������Yx4�����~|�t��|:[����������j�7�e����b J���>�"����pKd$����:hR����P[����e@��}"���a*��g�u��-��8��:�3�]a��g�Y�_`m�h_4��w�]��X_>W���[�k��
��Z�����P
3��@,'a0x�RJ��M���l�J���M�� #��O��
�i������G��g�M�A1a���_����"&a����(���&����&�^��^��0��T,�Qo������&��l�Q�'