JSON vs Text + Regexp Index Searching

Started by Eliot Gableabout 12 years ago6 messagesgeneral

egable+pgsql-general@gmail.com

about 12 years ago

I am working on a project for which I require some assistance. Any input
would be appreciated.

We have a table with millions of records and dozens of columns. On some
systems, it takes up >10GB of disk space, and it runs on a single disk
which is rather slow (7200 RPM). We typically don't do any complicated
queries on it, so performance has not posed a problem previously. The table
is actually defined as 3 tables using inheritance. One table shows only the
last 7 days of data, and then a process moves the 8th day into the history
table for archiving. For performance reasons, we typically only query the
table with 7 days of data, which generally does not exceed 1 million rows.
In fact, it is typically under 50k rows for most systems. However, a few
systems may contain a few million records, at most. There are indices on
each of the fields we query, and we also put limits on the number of
records returned. The table is practically an append-only table.

We are looking at adding some additional columns to the table, and planning
for some future features. One of those features will allow cross
referencing the records in this table with a list of IDs. If the record's
list of IDs contains the one we are looking for, we want to include the
record in our query. I advocated creating a separate mapping table which
maps the ID of these records to the other ID we are searching for and
performing a JOIN on the two tables with appropriate foreign key
relationships and indices. However, I was ask to instead put the list into
a single column on each row to reduce implementation complexity.

Assuming the list of IDs is in a column on each row as TEXT in the format
of a JSON array, what is the best way to index the column so I can quickly
find the rows with the given ID?

Limitations:

We run version 9.0.x on these systems right now. We plan on updating later
this year, but for now we don't have the JSON type. Once we do, we plan to
make use of it. The column value will also be passed to the UI, which is
JavaScript-based, so a JSON format would be ideal.

We cannot modify the disks we run on to make them faster. It is cost
prohibitive in our particular scenario.

The systems in question are real-time systems, and are carefully optimized.
Any excessively lengthy queries which have high disk IO or CPU usage will
adversely affect the entire system.

My Thoughts:

Is it possible to use a REGEXP INDEX in this case? For example: column ~
'^\[67|,67,|,67\]$'
Will such an index perform well? How would I create it?

How badly would a simple LIKE operation perform? Is there any way to
optimize it?

Would using a JSON type in the future allow us to do any better type of
index on the column?

If we wanted to put an index on a JSON data type column whose values was a
simple array of IDs, what would the index look like, and how would we
construct the query to make use of it?

Thanks in advance for any input / insight on this.

Merlin Moncure

mmoncure@gmail.com

about 12 years ago

In reply to: Eliot Gable (#1)

Re: JSON vs Text + Regexp Index Searching

On Tue, Feb 25, 2014 at 9:08 AM, Eliot Gable
<egable+pgsql-general@gmail.com> wrote:

If we wanted to put an index on a JSON data type column whose values was a
simple array of IDs, what would the index look like, and how would we
construct the query to make use of it?

For 9.0, your only option would be to use the intarray extension. If
your IDs are all integers, this may even be your best option. 9.4
hstore may get the ability to support similar operations in a more
general way. json type may be able to optimize this case in the
future (probably as jsonb), but as of 9.4 it can't.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

about 12 years ago

In reply to: Eliot Gable (#1)

Re: JSON vs Text + Regexp Index Searching

Eliot Gable-4 wrote

I advocated creating a separate mapping table which
maps the ID of these records to the other ID we are searching for and
performing a JOIN on the two tables with appropriate foreign key
relationships and indices. However, I was ask to instead put the list into
a single column on each row to reduce implementation complexity.

Assuming the list of IDs is in a column on each row as TEXT in the format
of a JSON array, what is the best way to index the column so I can quickly
find the rows with the given ID?

I recommend benchmarking two implementations:

1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)" as
the WHERE condition
2) Your multi-table solution but use "EXISTS (SELECT 1 FROM xref_master
WHERE search_id = id_xref)"

And I'd politely respond that implementation complexity is somewhat less
important than performance in an embedded system - not that either of these
solutions is considered complex and both can readily be encapsulated into
functions to hide any such complexity from the application.

I would not introduce the added indirection of storing the values as a
single JSON array. Especially if the IDs are integer-based but even if you
represent IDs as text anyway.

The fact you want to use LIKE/REGEX confuses me but that may be because you
are limiting yourself to text. Most cross-ref searches know the exact ID
being looked for so pattern matching is not required...

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/JSON-vs-Text-Regexp-Index-Searching-tp5793472p5793492.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

about 12 years ago

In reply to: David G. Johnston (#3)

Re: JSON vs Text + Regexp Index Searching

David Johnston wrote

Eliot Gable-4 wrote

I advocated creating a separate mapping table which
maps the ID of these records to the other ID we are searching for and
performing a JOIN on the two tables with appropriate foreign key
relationships and indices. However, I was ask to instead put the list
into
a single column on each row to reduce implementation complexity.

Assuming the list of IDs is in a column on each row as TEXT in the format
of a JSON array, what is the best way to index the column so I can
quickly
find the rows with the given ID?

I recommend benchmarking two implementations:

1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)"
as the WHERE condition

[...]

#1 can be greatly expanded in usefulness by making use of the "intarray"
contrib/extension; as Merlin mentioned up-thread.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/JSON-vs-Text-Regexp-Index-Searching-tp5793472p5793495.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Eliot Gable

egable+pgsql-general@gmail.com

about 12 years ago

In reply to: David G. Johnston (#4)

Re: JSON vs Text + Regexp Index Searching

On Tue, Feb 25, 2014 at 11:17 AM, David Johnston <polobo@yahoo.com> wrote:

David Johnston wrote

Eliot Gable-4 wrote

I advocated creating a separate mapping table which
maps the ID of these records to the other ID we are searching for and
performing a JOIN on the two tables with appropriate foreign key
relationships and indices. However, I was ask to instead put the list
into
a single column on each row to reduce implementation complexity.

Assuming the list of IDs is in a column on each row as TEXT in the

format

of a JSON array, what is the best way to index the column so I can
quickly
find the rows with the given ID?

I recommend benchmarking two implementations:

1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)"
as the WHERE condition

[...]

#1 can be greatly expanded in usefulness by making use of the "intarray"
contrib/extension; as Merlin mentioned up-thread.

David J.

Thank you both for the responses. I will benchmark the options you
suggested.

Eliot Gable

egable+pgsql-general@gmail.com

about 12 years ago

In reply to: Eliot Gable (#5)

Re: JSON vs Text + Regexp Index Searching

On Tue, Feb 25, 2014 at 3:13 PM, Eliot Gable <egable+pgsql-general@gmail.com

wrote:

On Tue, Feb 25, 2014 at 11:17 AM, David Johnston <polobo@yahoo.com> wrote:

David Johnston wrote

Eliot Gable-4 wrote

I advocated creating a separate mapping table which
maps the ID of these records to the other ID we are searching for and
performing a JOIN on the two tables with appropriate foreign key
relationships and indices. However, I was ask to instead put the list
into
a single column on each row to reduce implementation complexity.

Assuming the list of IDs is in a column on each row as TEXT in the

format

of a JSON array, what is the best way to index the column so I can
quickly
find the rows with the given ID?

I recommend benchmarking two implementations:

1) id_xref integer[] --on the same table, use "search_id = ANY(id_xref)"
as the WHERE condition

[...]

#1 can be greatly expanded in usefulness by making use of the "intarray"
contrib/extension; as Merlin mentioned up-thread.

David J.

Thank you both for the responses. I will benchmark the options you
suggested.

Is there some way to make intarray work with int8 or some int8 variation of
intarray? The data type of the IDs is BIGINT.