Understanding and implementing a GiST Index

Started by Connor Wolfover 11 years ago4 messages

wolf@imaginaryindustries.com

over 11 years ago

Hi there!

I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search
across a set of 64 bit integers by hamming distance. This is intended to
be used in searching for similar images, where the 64 bit integer is
actually a phash
<http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html> of
an image, and similarity between two images is reflected in the hamming
distance between two integers.

Anyways, The appropriate approach here is to use something called a BK
tree
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
for which I've written some test code
<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
and I think I have a decent grip of (my test code seems to work, in any
event).

That said:

* Is there a /simple/ piece of example-code for implementing a GiST
index for a single index? I've been looking through the /contrib/
directory, particularly the /contrib/btree_gist/ files, but it looks
like 90% of the complexity there is related to supporting all the
column types Postgres has, rather then actually tied to producing a
functional index.
* Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my
compiled extension using `CREATE EXTENSION`, but is there an option
for `CREATE INDEX test_index ON testing USING gist (val);` that lets
me specify *which* GiST index is actually employed? Is this even a
valid question?
This is probably something that's available in one of the system
tables somewhere, but I haven't had much luck with google finding
out where.
* Testing: What's the appropriate way to examine the generated tree
structure of the index? I certainly went through a few bugs with my
test tree system (and that was in python!). Is there any way to
examine the index structure for debugging purposes?
* Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
* In that vein, is there any way to have information that is on the
operations of an entire query? Looking at the number of tree nodes
touched for a scan would be nice (and I would not be surprised if
there is already a facility for it).

Project code is here <https://github.com/fake-name/pg-spgist_hamming> if
anyone is interested, any help would be great. I have very little idea
what I'm doing.

Thanks,
Connor

Sergey Konoplev

gray.ru@gmail.com

over 11 years ago

In reply to: Connor Wolf (#1)

Re: Understanding and implementing a GiST Index

On Thu, Oct 9, 2014 at 12:09 AM, Connor Wolf
<wolf@imaginaryindustries.com> wrote:

I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across
a set of 64 bit integers by hamming distance. This is intended to be used in
searching for similar images, where the 64 bit integer is actually a phash
of an image, and similarity between two images is reflected in the hamming
distance between two integers.

Have you seen the smlar extension?

http://www.pgcon.org/2012/schedule/events/443.en.html
http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;hb=HEAD;f=README
http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

Anyways, The appropriate approach here is to use something called a BK tree,
for which I've written some test code and I think I have a decent grip of
(my test code seems to work, in any event).

That said:

Is there a simple piece of example-code for implementing a GiST index for a
single index? I've been looking through the /contrib/ directory,
particularly the /contrib/btree_gist/ files, but it looks like 90% of the
complexity there is related to supporting all the column types Postgres has,
rather then actually tied to producing a functional index.
Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my compiled
extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX
test_index ON testing USING gist (val);` that lets me specify *which* GiST
index is actually employed? Is this even a valid question?
This is probably something that's available in one of the system tables
somewhere, but I haven't had much luck with google finding out where.
Testing: What's the appropriate way to examine the generated tree structure
of the index? I certainly went through a few bugs with my test tree system
(and that was in python!). Is there any way to examine the index structure
for debugging purposes?
Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
In that vein, is there any way to have information that is on the operations
of an entire query? Looking at the number of tree nodes touched for a scan
would be nice (and I would not be surprised if there is already a facility
for it).

Project code is here if anyone is interested, any help would be great. I
have very little idea what I'm doing.

Thanks,
Connor

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (499) 346-7196, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Connor Wolf

wolf@imaginaryindustries.com

over 11 years ago

In reply to: Sergey Konoplev (#2)

Re: Understanding and implementing a GiST Index

I had skimmed the presentation slides, but I hadn't looked that closely
because it appeared to be using cosine similarity metrics, and only
operated on matrices, neither of which are what I wanted.

On closer examination, I think I could explode my packed hash values to
a matrix. I'm not sure how the cosine similarity metric would work given
that the arrays would only contain the values either 0 or 1 (as my hash
value is fundamentally a integer of configurable length (you can alter
the hash size, I'm using 64 bits)).

I'll check out the code and poke it a bit.

I'll probably also move this discussion to the hackers mailing list (the
instructions say to ask here first, but Oleg suggested I go there, and I
generally agree).

Connor

On 10/9/2014 8:19 AM, Sergey Konoplev wrote:

On Thu, Oct 9, 2014 at 12:09 AM, Connor Wolf
<wolf@imaginaryindustries.com> wrote:

I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across
a set of 64 bit integers by hamming distance. This is intended to be used in
searching for similar images, where the 64 bit integer is actually a phash
of an image, and similarity between two images is reflected in the hamming
distance between two integers.

Have you seen the smlar extension?

http://www.pgcon.org/2012/schedule/events/443.en.html
http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;hb=HEAD;f=README
http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

Anyways, The appropriate approach here is to use something called a BK tree,
for which I've written some test code and I think I have a decent grip of
(my test code seems to work, in any event).

That said:

Is there a simple piece of example-code for implementing a GiST index for a
single index? I've been looking through the /contrib/ directory,
particularly the /contrib/btree_gist/ files, but it looks like 90% of the
complexity there is related to supporting all the column types Postgres has,
rather then actually tied to producing a functional index.
Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my compiled
extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX
test_index ON testing USING gist (val);` that lets me specify *which* GiST
index is actually employed? Is this even a valid question?
This is probably something that's available in one of the system tables
somewhere, but I haven't had much luck with google finding out where.
Testing: What's the appropriate way to examine the generated tree structure
of the index? I certainly went through a few bugs with my test tree system
(and that was in python!). Is there any way to examine the index structure
for debugging purposes?
Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
In that vein, is there any way to have information that is on the operations
of an entire query? Looking at the number of tree nodes touched for a scan
would be nice (and I would not be surprised if there is already a facility
for it).

Project code is here if anyone is interested, any help would be great. I
have very little idea what I'm doing.

Thanks,
Connor

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Connor Wolf

wolf@imaginaryindustries.com

over 11 years ago

In reply to: Connor Wolf (#1)

Re: [GENERAL] Understanding and implementing a GiST Index

I'd be ok with either SP-GiST or GiST, though there are tree structures
(BK tree) that are particularly suited to this application that I
/think/ map to GiST better then SP-GiST.

Re: hamming in the RD-tree implementation: Where? I cloned the smlar git
repo, and poked around, and it looks like it only can operate on set
intersections, which is a non-starter for what I need to do. I spent a
while looking through the source, and I didn't see anything that looked
like hamming distance calculation (through I probably missed some stuff.
The indirection through `FCall2()` makes things very hard to follow).

Anyways, right now, smlar is not useful to me, because it completely
ignores the structure of incoming arrays:

postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
smlar
-------
1
(1 row)

For two radically different hashes (shortened for example purposes)
which compare as identical.

I spent some time trying to see if I could easily disable the array
uniquefication, and by disabling the calls to `qsort_arg`, and modifying
`numOfIntersect` so it just counts the number of times the array
elements do not match, and I'm getting the behaviour I want out of
`smlar()` at this point:

postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
smlar
--------
0.4375
(1 row)

postgres=# SELECT smlar(
'{0,1,1,0}'::int[],
'{1,0,1,0}'
);
smlar
-------
0.5
(1 row)

But the index does not seem to work after the changes I made, and the
debugging print statements (well, `elog()` statements) I inserted into
`cmpArrayElem()` and `numOfIntersect()` are not being hit, so I don't
understand how the index is even being built.

Anyways, I'm rambling a bit. Any input would be great.
Thanks,
Connor

Show quoted text

On 10/9/2014 8:11 AM, Oleg Bartunov wrote:

Just quick recommendation.
You need to ask -hackers mailing list for that. Also, do you want GiST
or SP-GiST ?
We already use hamming distance in rd-tree, which implemented in GiST,
see our presentation
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf
<http://www.sai.msu.su/%7Emegera/postgres/talks/pgcon-2012.pdf>, for
example.

Oleg

On Thu, Oct 9, 2014 at 11:09 AM, Connor Wolf
<wolf@imaginaryindustries.com <mailto:wolf@imaginaryindustries.com>>
wrote:

Hi there!

I'm trying to implement a custom indexing system using the GiST
index framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to
search across a set of 64 bit integers by hamming distance. This
is intended to be used in searching for similar images, where the
64 bit integer is actually a phash
<http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html>
of an image, and similarity between two images is reflected in the
hamming distance between two integers.

Anyways, The appropriate approach here is to use something called
a BK tree
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
for which I've written some test code
<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
and I think I have a decent grip of (my test code seems to work,
in any event).

That said:

* Is there a /simple/ piece of example-code for implementing a
GiST index for a single index? I've been looking through the
/contrib/ directory, particularly the /contrib/btree_gist/
files, but it looks like 90% of the complexity there is
related to supporting all the column types Postgres has,
rather then actually tied to producing a functional index.
* Once I have something compiling, how can I check to be sure
that I'm actually using the indexing module I created? I can
enable my compiled extension using `CREATE EXTENSION`, but is
there an option for `CREATE INDEX test_index ON testing USING
gist (val);` that lets me specify *which* GiST index is
actually employed? Is this even a valid question?
This is probably something that's available in one of the
system tables somewhere, but I haven't had much luck with
google finding out where.
* Testing: What's the appropriate way to examine the generated
tree structure of the index? I certainly went through a few
bugs with my test tree system (and that was in python!). Is
there any way to examine the index structure for debugging
purposes?
* Also, it looks like `ereport()` is the proper way to emit
debugging information. Is this correct?
* In that vein, is there any way to have information that is on
the operations of an entire query? Looking at the number of
tree nodes touched for a scan would be nice (and I would not
be surprised if there is already a facility for it).

Project code is here
<https://github.com/fake-name/pg-spgist_hamming> if anyone is
interested, any help would be great. I have very little idea what
I'm doing.

Thanks,
Connor

Import Notes

Reply to msg id not found: CAF4Au4xP5Xvzq910nH2xHfPkk_q7gw1D8UbvtifwV+ijk0q+Ww@mail.gmail.com