Remove 1MB size limit in tsvector

Started by Ildus Kurbangalievover 8 years ago19 messageshackers
Jump to latest
#1Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru

Hello, hackers!

Historically tsvector type can't hold more than 1MB data.
I want to propose a patch that removes that limit.

That limit is created by 'pos' field from WordEntry, which have only
20 bits for storage.

In the proposed patch I removed this field and instead of it I keep
offsets only at each Nth item in WordEntry's array. Now I set N as 4,
because it gave best results in my benchmarks. It can be increased in
the future without affecting already saved data in database. Also
removing the field improves compression of tsvectors.

I simplified the code by creating functions that can be used to
build tsvectors. There were duplicated code fragments in places where
tsvector was built.

Also new patch frees some space in WordEntry that can be used to
save some additional information about saved words.

-
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachments:

tsvector_stretched_v1.patchtext/x-patchDownload+999-703
#2Robert Haas
robertmhaas@gmail.com
In reply to: Ildus Kurbangaliev (#1)
Re: Remove 1MB size limit in tsvector

On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:

Historically tsvector type can't hold more than 1MB data.
I want to propose a patch that removes that limit.

That limit is created by 'pos' field from WordEntry, which have only
20 bits for storage.

In the proposed patch I removed this field and instead of it I keep
offsets only at each Nth item in WordEntry's array.

So this would break pg_upgrade for tsvector columns?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru
In reply to: Robert Haas (#2)
Re: Remove 1MB size limit in tsvector

On Tue, 1 Aug 2017 14:56:54 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 1, 2017 at 10:08 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:

Historically tsvector type can't hold more than 1MB data.
I want to propose a patch that removes that limit.

That limit is created by 'pos' field from WordEntry, which have only
20 bits for storage.

In the proposed patch I removed this field and instead of it I keep
offsets only at each Nth item in WordEntry's array.

So this would break pg_upgrade for tsvector columns?

I added a function that will convert old tsvectors on the fly. It's the
approach used in hstore before.

Regards,
Ildus Kurbangaliev

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Ildus Kurbangaliev (#3)
Re: Remove 1MB size limit in tsvector

On Tue, Aug 1, 2017 at 3:10 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote:

So this would break pg_upgrade for tsvector columns?

I added a function that will convert old tsvectors on the fly. It's the
approach used in hstore before.

Does that mean the answer to the question that I asked is "yes, but I
have a workaround" or does it mean that the answer is "no"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru
In reply to: Robert Haas (#4)
Re: Remove 1MB size limit in tsvector

On Tue, 1 Aug 2017 15:33:08 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
<i.kurbangaliev@postgrespro.ru> wrote:

So this would break pg_upgrade for tsvector columns?

I added a function that will convert old tsvectors on the fly. It's
the approach used in hstore before.

Does that mean the answer to the question that I asked is "yes, but I
have a workaround" or does it mean that the answer is "no"?

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

Regards,
Ildus Kurbangaliev

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Torsten Zuehlsdorff
mailinglists@toco-domains.de
In reply to: Ildus Kurbangaliev (#5)
Re: Remove 1MB size limit in tsvector

On 01.08.2017 22:00, Ildus K wrote:

On Tue, 1 Aug 2017 15:33:08 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
<i.kurbangaliev@postgrespro.ru> wrote:

So this would break pg_upgrade for tsvector columns?

I added a function that will convert old tsvectors on the fly. It's
the approach used in hstore before.

Does that mean the answer to the question that I asked is "yes, but I
have a workaround" or does it mean that the answer is "no"?

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

I'm not familiar with pg_upgrade, but want to ask: should this
workaround be part of pg_upgrade?

Greetings,
Torsten

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru
In reply to: Torsten Zuehlsdorff (#6)
Re: Remove 1MB size limit in tsvector

On Wed, 9 Aug 2017 09:01:44 +0200
Torsten Zuehlsdorff <mailinglists@toco-domains.de> wrote:

On 01.08.2017 22:00, Ildus K wrote:

On Tue, 1 Aug 2017 15:33:08 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 1, 2017 at 3:10 PM, Ildus K
<i.kurbangaliev@postgrespro.ru> wrote:

So this would break pg_upgrade for tsvector columns?

I added a function that will convert old tsvectors on the fly.
It's the approach used in hstore before.

Does that mean the answer to the question that I asked is "yes,
but I have a workaround" or does it mean that the answer is "no"?

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

I'm not familiar with pg_upgrade, but want to ask: should this
workaround be part of pg_upgrade?

Greetings,
Torsten

I chose the way when the data remains the same, until the user decides
to update it. I'm not so familiar with pg_upgrade myself and I don't
see now how the data will be converted with it, but it will anyway
increase downtime which is the shorter the better.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Ildus Kurbangaliev (#5)
Re: Remove 1MB size limit in tsvector

On Tue, Aug 1, 2017 at 4:00 PM, Ildus K <i.kurbangaliev@postgrespro.ru> wrote:

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

Hmm, that seems like a real fix, not just a workaround. If you can
transparently read the old format, there's no problem. Not sure about
performance, though.

The patch doesn't really conform to our coding standards, though, so
you need to clean it up (or, if you're not sure what you need to do,
you need to have someone who knows how PostgreSQL code needs to look
review it for you).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#8)
Re: Remove 1MB size limit in tsvector

On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The patch doesn't really conform to our coding standards, though, so
you need to clean it up (or, if you're not sure what you need to do,
you need to have someone who knows how PostgreSQL code needs to look
review it for you).

The documentation has a couple of rules for coding conventions:
https://www.postgresql.org/docs/9.6/static/source.html
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#9)
Re: Remove 1MB size limit in tsvector

On Thu, Aug 10, 2017 at 7:37 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Wed, Aug 9, 2017 at 6:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The patch doesn't really conform to our coding standards, though, so
you need to clean it up (or, if you're not sure what you need to do,
you need to have someone who knows how PostgreSQL code needs to look
review it for you).

The documentation has a couple of rules for coding conventions:
https://www.postgresql.org/docs/9.6/static/source.html

+1

Ildus, from the first glance I see at least following violations of
PostgreSQL coding standards in your code.

+/*
+ * Converts tsvector with the old structure to current.
+ * @orig - tsvector to convert,
+ * @copy - return copy of tsvector, it has a meaning when tsvector doensn't
+ * need to be converted.
+ */
This comment will be reflowed by pgindent.  Also we don't use '@' for
parameters description in comments.
https://www.postgresql.org/docs/9.6/static/source-format.html
+TSVector
+tsvector_upgrade(Datum orig, bool copy)
+{
+ int       i,
+           dataoff = 0,
+   datalen = 0,
+   totallen;
+ TSVector   in,
+   out;

You have random mix of tabs and spaces here.

+ {
+ stroff = SHORTALIGN(stroff); \
+ entry->hasoff = 0;
+ entry->len = lexeme_len;
+ entry->npos = npos;
+ }

What this backslash is doing here?
There are other similar (and probably different) violations of coding
standard over the code. Ildus, please check you patches carefully before
publishing.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#8)
Re: Remove 1MB size limit in tsvector

On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 1, 2017 at 4:00 PM, Ildus K <i.kurbangaliev@postgrespro.ru>
wrote:

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

Hmm, that seems like a real fix, not just a workaround. If you can
transparently read the old format, there's no problem. Not sure about
performance, though.

+1
Ildus, I think we need to benchmark reading of the old format. There would
be tradeoff between performance of old format reading and amount of extra
code needed. Once we will have benchmarks we can consider whether this is
the solution we would like to buy.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#10)
Re: Remove 1MB size limit in tsvector

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

...
You have random mix of tabs and spaces here.

It's worth running pgindent over your code before submitting. It should
be pretty easy to set that up nowadays, see src/tools/pgindent/README.
(If you find any portability problems while trying to install pgindent,
please let me know.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru
In reply to: Tom Lane (#12)
Re: Remove 1MB size limit in tsvector

On Thu, 10 Aug 2017 11:46:55 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

...
You have random mix of tabs and spaces here.

It's worth running pgindent over your code before submitting. It
should be pretty easy to set that up nowadays, see
src/tools/pgindent/README. (If you find any portability problems
while trying to install pgindent, please let me know.)

Attached a new version of the patch. It mostly contains cosmetic
changes. I rebased it to current master, ran pgindent and fixed
formatting errors.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachments:

tsvector_mixed_positions_v2.patchtext/x-patchDownload+997-690
#14Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru
In reply to: Alexander Korotkov (#11)
Re: Remove 1MB size limit in tsvector

On Thu, 10 Aug 2017 18:06:17 +0300
Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 9, 2017 at 7:38 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Aug 1, 2017 at 4:00 PM, Ildus K
<i.kurbangaliev@postgrespro.ru> wrote:

It's a workaround. DatumGetTSVector and
DatumGetTSVectorCopy will upgrade tsvector on the fly if it
has old format.

Hmm, that seems like a real fix, not just a workaround. If you can
transparently read the old format, there's no problem. Not sure
about performance, though.

+1
Ildus, I think we need to benchmark reading of the old format. There
would be tradeoff between performance of old format reading and
amount of extra code needed. Once we will have benchmarks we can
consider whether this is the solution we would like to buy.

In my benchmarks when database fits into buffers (so it's measurement of
the time required for the tsvectors conversion) it gives me these
results:

Without conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:04:44 Number of connections: 4
2017/08/17 12:04:44 Database: test1
2017/08/17 12:09:44 Processed: 51419

With conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:14:31 Number of connections: 4
2017/08/17 12:14:31 Database: test1
2017/08/17 12:19:31 Processed: 43607

I ran a bunch of these tests, and these results are stable on my
machine. So in these specific tests performance regression about 15%.

Same time I think this could be the worst case, because usually data
is on disk and conversion will not affect so much to performance.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ildus Kurbangaliev (#14)
Re: Remove 1MB size limit in tsvector

Hi,

On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:

In my benchmarks when database fits into buffers (so it's measurement of
the time required for the tsvectors conversion) it gives me these
results:

Without conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:04:44 Number of connections: 4
2017/08/17 12:04:44 Database: test1
2017/08/17 12:09:44 Processed: 51419

With conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:14:31 Number of connections: 4
2017/08/17 12:14:31 Database: test1
2017/08/17 12:19:31 Processed: 43607

I ran a bunch of these tests, and these results are stable on my
machine. So in these specific tests performance regression about 15%.

Same time I think this could be the worst case, because usually data
is on disk and conversion will not affect so much to performance.

That seems like a fairly significant regression, TBH. I don't quite
agree we can simply assume in-memory workloads don't matter, plenty of
databases have 99% cache hit ratio (particularly when considering not
just shared buffers, but also page cache).

Can you share the benchmarks, so that others can retry running them?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Ildus Kurbangaliev
i.kurbangaliev@postgrespro.ru
In reply to: Tomas Vondra (#15)
Re: Remove 1MB size limit in tsvector

On Thu, 7 Sep 2017 23:08:14 +0200
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:

In my benchmarks when database fits into buffers (so it's
measurement of the time required for the tsvectors conversion) it
gives me these results:

Without conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:04:44 Number of connections: 4
2017/08/17 12:04:44 Database: test1
2017/08/17 12:09:44 Processed: 51419

With conversion:

$ ./tsbench2 -database test1 -bench_time 300
2017/08/17 12:14:31 Number of connections: 4
2017/08/17 12:14:31 Database: test1
2017/08/17 12:19:31 Processed: 43607

I ran a bunch of these tests, and these results are stable on my
machine. So in these specific tests performance regression about
15%.

Same time I think this could be the worst case, because usually data
is on disk and conversion will not affect so much to performance.

That seems like a fairly significant regression, TBH. I don't quite
agree we can simply assume in-memory workloads don't matter, plenty of
databases have 99% cache hit ratio (particularly when considering not
just shared buffers, but also page cache).

I think part of this regression is caused by better compression of new
format. I can't say exact percent here, need to check with perf.

If you care about performace, you create indexes, which means that
tsvector will no longer be used for text search (except for ORDER BY
rank). Index machinery will only peek into tsquery. Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

Can you share the benchmarks, so that others can retry running them?

Benchmarks are published at github:
https://github.com/ildus/tsbench . I'm not sure that they are easy to
use.

Best regards,
Ildus Kurbangaliev

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Ildus Kurbangaliev (#16)
Re: Remove 1MB size limit in tsvector

On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:

Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

RUM indexes, though, are not in core.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#17)
Re: Remove 1MB size limit in tsvector

On 09/11/2017 01:54 PM, Robert Haas wrote:

On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:

Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

RUM indexes, though, are not in core.

Yeah, but I think Ildus has a point that this should not really matter
on indexed tsvectors. So the question is how realistic that benchmark
actually is. How likely are we to do queries on fts directly, not
through a GIN/GiST index? Particularly in performance-sensitive cases?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#18)
Re: [HACKERS] Remove 1MB size limit in tsvector

On Mon, Sep 11, 2017 at 9:51 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 09/11/2017 01:54 PM, Robert Haas wrote:

On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
<i.kurbangaliev@postgrespro.ru> wrote:

Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

RUM indexes, though, are not in core.

Yeah, but I think Ildus has a point that this should not really matter
on indexed tsvectors. So the question is how realistic that benchmark
actually is. How likely are we to do queries on fts directly, not
through a GIN/GiST index? Particularly in performance-sensitive cases?

So many questions unanswered... I am marking the patch as returned
with feedback as the thread has stalled for two months now.
--
Michael