GIN - Generalized Inverted iNdex.

Started by Teodor Sigaevalmost 20 years ago4 messages
#1Teodor Sigaev
teodor@sigaev.ru

We (me and Oleg) are glad to present GIN to PostgreSQL. If community will agree,
we will commit it to HEAD branch.

http://www.sigaev.ru/gin/gin.gz
http://www.sigaev.ru/gin/README

Install:
% cd pgsql
% zcat gin.gz | patch -p0
make and initdb

README:

Gin for PostgreSQL

Gin stands for Generalized Inverted iNdex and should be considered as a genie,
not drink.

Generalized means that index doesn't know what operation it accelerates, it
works with strategies, defined for specific data type (Index Method Strategies).
In that sense, gin is similar to GiST and differ from btree index,which has
predefined comparison based operations.

Inverted index is an index structure storing a (key,posting list) pairs, where
posting list is a set of documents, which contain this key (document, usually,
contains many keys). The primary goal of the Gin index is a scalable full text
search in PostgreSQL.

Gin is consists of a B-tree builded over entries (ET, entries tree), where
entry is an element of indexed value ( element of array, lexeme for tsvector)
and where each tuple in the leaf pages is either pointer to B-tree over item
pointers (PT, posting tree), or list of item pointers (PL, posting list) if
tuple is small enough.

Notes: There is no delete operation for ET. The reason for that is that from
our experience, a set of unique words of a whole collection changed very rare.
This greatly simplify code and concurrency algorithm.

Gin comes with built-in support for one dimensional arrays ( integer[], text[],
no support for NULL elements) and following operations:

* contains : value_array @ query_array
* overlap : value_array && query_array
* contained: value_array ~ query_array

Synopsis

=# create index txt_idx on aa using gin(a);

Features

* Concurrency
* WAL-logging (recoverable)
* user-defined opclass, the scheme is similar to GiST
* optimized index creation (make use of maintenance_work_mem to accumulate
postings in memory)

Limitations

* no support for multicolumn indices
* Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
* Gin search entry only by equality matching, this may be improved in future

Gin interface

OpClass interface (pseudocode). Example for opclass is in ginarayproc.c.

Datum* extractValue(Datum inputValue, uint32* nentries)
Returns array of Datum of entries of value to be indexed, nentries should
contains number of returning entries
int compareEntry( Datum a, Datum b )
Compares two entries (not the indexing values!)
Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n)
Returns array of Datum of entries of query to be searched, n contains
Strategy number of operation.
bool consistent( bool[] check, StrategyNumber n, Datum query)
The size of check array is the same as sizeof of array returned by
extractQuery. Each element of check array is true if indexed value has
corresponding entry in the query, i.e. if check[i] == TRUE then i-th entry
of query is presented in indexed value. Function should returns true if
indexed value matches by StrategyNumber and query.

Open items (We appreciate any comments, help, advice and recommendations).

* teach optimizer/executor, that GIN is intrinsically clustered, i.e., it
always returns ItemPointer in ascending order.
* tweak gincostestimate
* GIN stores several ItemPointer to heap tuple, so vacuum full produces
warning message:
WARNING: index "idx" contains 88395 row versions, but table contains
51812 row versions
HINT: Rebuild the index with REINDEX.

TODO

Nearest future:

* add tsearch2 opclass
* create full scale test suite

Distant future:

* Replace entry btree to something like to GiST
* Add multicolumn support
* Optimize insert operation (use background index insertion)

Authors:
Teodor Sigaev <teodor@sigaev.ru>
Oleg Bartunov <oleg@sai.msu.su>

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Teodor Sigaev (#1)
Re: GIN - Generalized Inverted iNdex.

On Fri, Apr 07, 2006 at 05:05:12PM +0400, Teodor Sigaev wrote:

We (me and Oleg) are glad to present GIN to PostgreSQL. If community will
agree, we will commit it to HEAD branch.

http://www.sigaev.ru/gin/gin.gz
http://www.sigaev.ru/gin/README

Fantastic! I was thinking about this a while ago and it is a new type
of index that perfectly complements the existing types.

Neat stuff,

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#3Oleg Bartunov
oleg@sai.msu.su
In reply to: Teodor Sigaev (#1)
Re: GIN - Generalized Inverted iNdex.

One addition - some results of Gin testing is available:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/GinTest

Oleg

On Fri, 7 Apr 2006, Teodor Sigaev wrote:

We (me and Oleg) are glad to present GIN to PostgreSQL. If community will
agree, we will commit it to HEAD branch.

http://www.sigaev.ru/gin/gin.gz
http://www.sigaev.ru/gin/README

Install:
% cd pgsql
% zcat gin.gz | patch -p0
make and initdb

README:

Gin for PostgreSQL

Gin stands for Generalized Inverted iNdex and should be considered as a
genie,
not drink.

Generalized means that index doesn't know what operation it accelerates, it
works with strategies, defined for specific data type (Index Method
Strategies).
In that sense, gin is similar to GiST and differ from btree index,which has
predefined comparison based operations.

Inverted index is an index structure storing a (key,posting list) pairs,
where
posting list is a set of documents, which contain this key (document,
usually,
contains many keys). The primary goal of the Gin index is a scalable full
text
search in PostgreSQL.

Gin is consists of a B-tree builded over entries (ET, entries tree), where
entry is an element of indexed value ( element of array, lexeme for tsvector)
and where each tuple in the leaf pages is either pointer to B-tree over item
pointers (PT, posting tree), or list of item pointers (PL, posting list) if
tuple is small enough.

Notes: There is no delete operation for ET. The reason for that is that from
our experience, a set of unique words of a whole collection changed very
rare.
This greatly simplify code and concurrency algorithm.

Gin comes with built-in support for one dimensional arrays ( integer[],
text[],
no support for NULL elements) and following operations:

* contains : value_array @ query_array
* overlap : value_array && query_array
* contained: value_array ~ query_array

Synopsis

=# create index txt_idx on aa using gin(a);

Features

* Concurrency
* WAL-logging (recoverable)
* user-defined opclass, the scheme is similar to GiST
* optimized index creation (make use of maintenance_work_mem to
accumulate
postings in memory)

Limitations

* no support for multicolumn indices
* Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
* Gin search entry only by equality matching, this may be improved in
future

Gin interface

OpClass interface (pseudocode). Example for opclass is in ginarayproc.c.

Datum* extractValue(Datum inputValue, uint32* nentries)
Returns array of Datum of entries of value to be indexed, nentries should
contains number of returning entries
int compareEntry( Datum a, Datum b )
Compares two entries (not the indexing values!)
Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n)
Returns array of Datum of entries of query to be searched, n contains
Strategy number of operation.
bool consistent( bool[] check, StrategyNumber n, Datum query)
The size of check array is the same as sizeof of array returned by
extractQuery. Each element of check array is true if indexed value has
corresponding entry in the query, i.e. if check[i] == TRUE then i-th
entry
of query is presented in indexed value. Function should returns true if
indexed value matches by StrategyNumber and query.

Open items (We appreciate any comments, help, advice and recommendations).

* teach optimizer/executor, that GIN is intrinsically clustered, i.e., it
always returns ItemPointer in ascending order.
* tweak gincostestimate
* GIN stores several ItemPointer to heap tuple, so vacuum full produces
warning message:
WARNING: index "idx" contains 88395 row versions, but table contains
51812 row versions
HINT: Rebuild the index with REINDEX.

TODO

Nearest future:

* add tsearch2 opclass
* create full scale test suite

Distant future:

* Replace entry btree to something like to GiST
* Add multicolumn support
* Optimize insert operation (use background index insertion)

Authors:
Teodor Sigaev <teodor@sigaev.ru>
Oleg Bartunov <oleg@sai.msu.su>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Martijn van Oosterhout (#2)
Re: GIN - Generalized Inverted iNdex.

On Fri, 7 Apr 2006, Martijn van Oosterhout wrote:

On Fri, Apr 07, 2006 at 05:05:12PM +0400, Teodor Sigaev wrote:

We (me and Oleg) are glad to present GIN to PostgreSQL. If community will
agree, we will commit it to HEAD branch.

http://www.sigaev.ru/gin/gin.gz
http://www.sigaev.ru/gin/README

Fantastic! I was thinking about this a while ago and it is a new type
of index that perfectly complements the existing types.

it already outperforms intarray.

Neat stuff,

Have a nice day,

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83