Multi-entry indexes (with a view to XPath queries)

Started by John Grayover 24 years ago3 messages
#1John Gray
jgray@beansindustry.co.uk

Firstly, I appreciate this may be a hare-brained scheme, but I've been
thinking about indexes in which the tuple pointer is not unique.

The reason for my interest is storing XML documents in text fields in the
database. (It could also help with particular kinds of full-text search?)

I would like to be able to construct indexes on a collection of XML
documents, based on the "value" of certain "fields" within the document.
(In jargon terms, producing an index whose key is the CDATA content of a
particular XML element). This could tie in with the Xpath and XQuery
proposals

A simplified example (from an archaeological site classification system):

<site>
<name>Glebe Farm, Long Itchington</name>
<location scheme="osgb">SU41793684</location>
<feature>
<type>Agricultural:Stock Control</type>
<date scheme="code">med</date>
</feature>
<feature>
<type>Unassigned:Ditch</type>
<size type="depth" unit="m">1.5</size>
</feature>
</site>

I'd like to produce an index on feature types so that I could type
(roughly):

SELECT siteid, xpath(doc,'//site/name'), xpath(doc,'//site/location') FROM
documents WHERE xpath(doc,'feature/type') = 'Agricultural: Stock Control';

[create table documents (integer siteid, text doc)]

Obviously I need to write a basic XML parser that can support such an
xpath function, but it would also be good to index by the results of that
function-i.e. to have an index containing feature type values. As each
document could have any number of these instances, the number of index
tuples would differ from the number of heap tuples.

As far as I can see, there is no particular reason why a btree index could
not be used[*]. However, vacuum.c makes assertions about number of index
tuples == number of heap tuples. I realise this is a useful consistency
check, but would it be possible to have a field in pg_index
(indnoidentity?) that indicates that a given index hasn't got a 1:1
index:heap relationship.

I have tried the approach of decomposing documents into cdata, element and
attribute tables, and I can use joins to extract a list of feature types
etc. (and could use triggers to update this) but the idea of not having to
parse a document to enter it into the database and not requiring
application logic to reconstruct it again seems a potential win for a
system which might store complex documents but usually searches on limited
criteria.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Gray (#1)
Re: Multi-entry indexes (with a view to XPath queries)

"John Gray" <jgray@beansindustry.co.uk> writes:

Firstly, I appreciate this may be a hare-brained scheme, but I've been
thinking about indexes in which the tuple pointer is not unique.

It sounds pretty hare-brained to me all right ;-). What's wrong with
the normal approach of one index tuple per heap tuple, ie, multiple
index tuples with the same key? It seems to me that your idea will just
make index maintenance a lot more difficult. For example, what happens
when one of the referenced rows is deleted? We'd have to actually
change, not just remove, the index tuple, since it'd also be pointing at
undeleted rows. That'll create a whole bunch of concurrency problems.

Obviously I need to write a basic XML parser that can support such an
xpath function, but it would also be good to index by the results of that
function-i.e. to have an index containing feature type values. As each
document could have any number of these instances, the number of index
tuples would differ from the number of heap tuples.

Why would you want multiple index entries for the same key (never mind
whether they are in a single index tuple or multiple tuples) pointing to
the same row?

Actually, after thinking a little more, I suspect the idea you are
really trying to describe here is index entries with finer-than-tuple
granularity. This is not silly, but it is sufficiently outside the
normal domain of SQL that I think you are fighting an uphill battle.
You'd be *much* better off creating a table that has one row per
indexable entity, whatever that is.

I have tried the approach of decomposing documents into cdata, element and
attribute tables, and I can use joins to extract a list of feature types
etc. (and could use triggers to update this) but the idea of not having to
parse a document to enter it into the database

How do you expect that to happen, when you will have to parse it to get
the index terms?

You might be able to address your problem with two tables, one holding
original documents and one with a row for each indexable entity
(document section). This second one would then have the field index
built on it.

regards, tom lane

#3John Gray
jgray@beansindustry.co.uk
In reply to: Tom Lane (#2)
Re: Multi-entry indexes (with a view to XPath queries)

In article <28692.993502132@sss.pgh.pa.us>, tgl@sss.pgh.pa.us (Tom Lane)
wrote:

"John Gray" <jgray@beansindustry.co.uk> writes:

Firstly, I appreciate this may be a hare-brained scheme, but I've been
thinking about indexes in which the tuple pointer is not unique.

It sounds pretty hare-brained to me all right ;-). What's wrong with
the normal approach of one index tuple per heap tuple, ie, multiple
index tuples with the same key? It seems to me that your idea will just
make index maintenance a lot more difficult. For example, what happens
when one of the referenced rows is deleted? We'd have to actually
change, not just remove, the index tuple, since it'd also be pointing at
undeleted rows. That'll create a whole bunch of concurrency problems.

Sorry, I fear my explanation has not been very clear. I'm not suggesting
that one index entry will point to more than one heap tuple, but that more
than one index entry may point to the same heap tuple. In this respect it
is the same as a full-text index (or an 'inverted index' as described by
Hannu). i.e. the data format and algorithm of the btree index need not
change.

If the tuple is changed, then a set of x index tuples which point to it will
become invalid, (and a new set of y index tuples will be created pointing
to then new version) but VACUUM will dispose of the outdated index tuples
readily enough.

In fact, maybe my question should be: Full-text indexing in contrib is
provided by an auxiliary table. Is there a reason why it couldn't be
performed using a functional btree index with (broadly) the same
format?:

word(key) document (heaptuplepointer)
apple 2
apple 3
Bob 1
cow 2
coward 1

Obviously I need to write a basic XML parser that can support such an
xpath function, but it would also be good to index by the results of
that function-i.e. to have an index containing feature type values. As
each document could have any number of these instances, the number of
index tuples would differ from the number of heap tuples.

Why would you want multiple index entries for the same key (never mind
whether they are in a single index tuple or multiple tuples) pointing to
the same row?

Muddled thinking :). I was trying to decide what to do if two identical
items appeared in the same record: should there be two index entries
or just one?

I had thought that this might help queries where you want to count the
number of instances of a particular element -but on reflection, the index
entries aren't a useful way to achieve that.

Actually, after thinking a little more, I suspect the idea you are
really trying to describe here is index entries with finer-than-tuple
granularity. This is not silly, but it is sufficiently outside the
normal domain of SQL that I think you are fighting an uphill battle.
You'd be *much* better off creating a table that has one row per
indexable entity, whatever that is.

I accept that the offset (and its use) is not so straightforward. It would
have benefits for indexing documents, but as SQL isn't built around
document entities, it may be something best left.

In general, I was trying to formulate something where the functionality
didn't rely on extra tables (because I reckoned the choice of extra tables
would depend on the type of documents I was trying to index. HOWEVER,
I realise that this is not true: I can use a table with

path string document
/feature/type Moat 3
/feature/type Cowshed 3

etc. and use a two-column index on path and string instead)

I have tried the approach of decomposing documents into cdata, element
and attribute tables, and I can use joins to extract a list of feature
types etc. (and could use triggers to update this) but the idea of not
having to parse a document to enter it into the database

How do you expect that to happen, when you will have to parse it to get
the index terms?

My feeling was that I wanted the database side to be document-
structure agnostic. In other words, I could use the database as a plain
document store and I would develop the xpath function which could be
used with a sequential scan at first. Then I started to ask whether I could
create a functional index using it -the point being that the xpath
function is like a 'words' function which returns all the words from a
string -it returns more than one entry for a given row. An index based on
this would need to have more than one entry relating to a given heap
tuple.

Thanks for your comments: given that actions speak louder than words
maybe I'll try and implement my fabled xpath operator first, then worry
about indexing or performance improvements :)

Regards

John