pg_migrator and an 8.3-compatible tsvector data type

Started by Bruce Momjianabout 17 years ago59 messageshackers

bruce@momjian.us

about 17 years ago

I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4. Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type. You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:

Restoring database schema
psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type
tsvector has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.

So, how do I create a compatible set of hooks for v8_3_tsvector?
Because tsvector is a native data type I can't figure out how to set up
a similar type. Also, will the indexes be OK if they are created from
types that don't have the 8.3 format? What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used. I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#1)

Re: pg_migrator and an 8.3-compatible tsvector data type

Because no one has responed

---------------------------------------------------------------------------

Bruce Momjian wrote:

I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4. Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type. You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:

Restoring database schema
psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type
tsvector has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.

So, how do I create a compatible set of hooks for v8_3_tsvector?
Because tsvector is a native data type I can't figure out how to set up
a similar type. Also, will the indexes be OK if they are created from
types that don't have the 8.3 format? What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used. I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#1)

Re: pg_migrator and an 8.3-compatible tsvector data type

Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector. I realize this limits
pg_migrator's usefulness, but I have to move on.

---------------------------------------------------------------------------

Bruce Momjian wrote:

I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4. Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type. You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:

Restoring database schema
psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type
tsvector has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.

So, how do I create a compatible set of hooks for v8_3_tsvector?
Because tsvector is a native data type I can't figure out how to set up
a similar type. Also, will the indexes be OK if they are created from
types that don't have the 8.3 format? What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used. I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

josh@agliodbs.com

about 17 years ago

In reply to: Bruce Momjian (#3)

Re: pg_migrator and an 8.3-compatible tsvector data type

On 5/28/09 2:30 PM, Bruce Momjian wrote:

Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector. I realize this limits
pg_migrator's usefulness, but I have to move on.

I don't know how to fix the issue, certainly.

Why was there a format change to tsvector anyway? Was there an
important reason for this?

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

bruce@momjian.us

about 17 years ago

In reply to: Josh Berkus (#4)

Re: pg_migrator and an 8.3-compatible tsvector data type

Josh Berkus wrote:

On 5/28/09 2:30 PM, Bruce Momjian wrote:

Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector. I realize this limits
pg_migrator's usefulness, but I have to move on.

I don't know how to fix the issue, certainly.

Why was there a format change to tsvector anyway? Was there an
important reason for this?

The ordering of the lexems was changed:

/*
* Convert 8.3 tsvector to 8.4
*
* 8.3 sorts lexemes by its length and if lengths are the same then it uses
* alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g.
*
* => SELECT 'c bb aaa'::tsvector;
* tsvector
* ----------------
* 'aaa' 'bb' 'c' -- 8.4
* 'c' 'bb' 'aaa' -- 8.3
*/

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

josh@agliodbs.com

about 17 years ago

In reply to: Bruce Momjian (#5)

Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce,

The ordering of the lexems was changed:

/*
* Convert 8.3 tsvector to 8.4
*
* 8.3 sorts lexemes by its length and if lengths are the same then it uses
* alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g.
*
* => SELECT 'c bb aaa'::tsvector;
* tsvector
* ----------------
* 'aaa' 'bb' 'c' -- 8.4
* 'c' 'bb' 'aaa' -- 8.3
*/

What does that get us in terms of performance etc.?

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

bruce@momjian.us

about 17 years ago

In reply to: Josh Berkus (#6)

Re: pg_migrator and an 8.3-compatible tsvector data type

Josh Berkus wrote:

Bruce,

The ordering of the lexems was changed:

/*
* Convert 8.3 tsvector to 8.4
*
* 8.3 sorts lexemes by its length and if lengths are the same then it uses
* alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g.
*
* => SELECT 'c bb aaa'::tsvector;
* tsvector
* ----------------
* 'aaa' 'bb' 'c' -- 8.4
* 'c' 'bb' 'aaa' -- 8.3
*/

What does that get us in terms of performance etc.?

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data. The area I got stuck on is that there is no CAST behavior
when creating an index. We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Josh Berkus (#6)

Re: pg_migrator and an 8.3-compatible tsvector data type

Josh Berkus <josh@agliodbs.com> writes:

Bruce,

The ordering of the lexems was changed:

What does that get us in terms of performance etc.?

It was changed to support partial-match tsvector queries. Without it,
a partial match query would have to scan entire tsvectors instead
of applying binary search. I don't know if Oleg and Teodor did any
actual performance tests on the size of the hit, but it seems like
it could be pretty awful for large documents.

regards, tom lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Bruce Momjian (#7)

Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce Momjian <bruce@momjian.us> writes:

Josh Berkus wrote:

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data. The area I got stuck on is that there is no CAST behavior
when creating an index. We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

It's certainly doable. Bruce is just applying the strategy he mentioned
in our talk ;-)

regards, tom lane

josh@agliodbs.com

about 17 years ago

In reply to: Tom Lane (#8)

Re: pg_migrator and an 8.3-compatible tsvector data type

Tom,

It was changed to support partial-match tsvector queries. Without it,
a partial match query would have to scan entire tsvectors instead
of applying binary search. I don't know if Oleg and Teodor did any
actual performance tests on the size of the hit, but it seems like
it could be pretty awful for large documents.

Ah, ok, that's pretty significant. It makes a huge difference for some
users, one of whom will be quoted on the press release.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

bruce@momjian.us

about 17 years ago

In reply to: Tom Lane (#9)

Re: pg_migrator and an 8.3-compatible tsvector data type

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Josh Berkus wrote:

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data. The area I got stuck on is that there is no CAST behavior
when creating an index. We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

It's certainly doable. Bruce is just applying the strategy he mentioned
in our talk ;-)

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#11)

Re: pg_migrator and an 8.3-compatible tsvector data type

On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote:

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

I was pretty stumped by what you meant by "auto-cast" but now I'm
beginning to think what you're looking for are cross-type comparison
operators? You could define a bunch of such operators which manually
call the cast on one side and add them to the operator class. I'm not
sure how to set up the operator class entries for GIN though.

--
greg

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Bruce Momjian (#12)

Re: pg_migrator and an 8.3-compatible tsvector data type

Greg Stark <stark@enterprisedb.com> writes:

On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote:

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

I was pretty stumped by what you meant by "auto-cast" but now I'm
beginning to think what you're looking for are cross-type comparison
operators? You could define a bunch of such operators which manually
call the cast on one side and add them to the operator class. I'm not
sure how to set up the operator class entries for GIN though.

No, what we need is a GIN (and probably GIST) opclass to support
old-style tsvectors. The user-level operators aren't really an issue
because an implicit cast will get inserted in front of them when
necessary, but we don't support that for indexes.

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into. It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.

regards, tom lane

alvherre@2ndquadrant.com

about 17 years ago

In reply to: Tom Lane (#13)

Re: pg_migrator and an 8.3-compatible tsvector data type

Tom Lane wrote:

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into. It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

bruce@momjian.us

about 17 years ago

In reply to: Alvaro Herrera (#14)

Re: pg_migrator and an 8.3-compatible tsvector data type

Alvaro Herrera wrote:

Tom Lane wrote:

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into. It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

That is accurate. I doubt copy mode speed can be improved. I think
doing a backup, then using --link mode will be the most common usage
pattern.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#15)

Re: pg_migrator and an 8.3-compatible tsvector data type

On Fri, May 29, 2009 at 12:53 AM, Bruce Momjian <bruce@momjian.us> wrote:

That is accurate. I doubt copy mode speed can be improved. I think
doing a backup, then using --link mode will be the most common usage
pattern.

Well doing a backup will take more or less the same time as copy mode, no?

Actually there's a big win for people who can take a snapshot using
their SAN or filesystem, but that's not everyone.

--
greg

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#16)

Re: pg_migrator and an 8.3-compatible tsvector data type

Greg Stark wrote:

On Fri, May 29, 2009 at 12:53 AM, Bruce Momjian <bruce@momjian.us> wrote:

That is accurate. ?I doubt copy mode speed can be improved. ?I think
doing a backup, then using --link mode will be the most common usage
pattern.

Well doing a backup will take more or less the same time as copy mode, no?

Actually there's a big win for people who can take a snapshot using
their SAN or filesystem, but that's not everyone.

Well, you are up during the backup.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

alvherre@2ndquadrant.com

about 17 years ago

In reply to: Bruce Momjian (#15)

Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce Momjian wrote:

Alvaro Herrera wrote:

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

That is accurate. I doubt copy mode speed can be improved.

Why not? Right now it's single-threaded. Would it be faster if it ran
several copies in parallel?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

bruce@momjian.us

about 17 years ago

In reply to: Alvaro Herrera (#18)

Re: pg_migrator and an 8.3-compatible tsvector data type

Alvaro Herrera wrote:

Bruce Momjian wrote:

Alvaro Herrera wrote:

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

That is accurate. I doubt copy mode speed can be improved.

Why not? Right now it's single-threaded. Would it be faster if it ran
several copies in parallel?

Sure, but that assumes you have parallel I/O channels; I assume right
now it is I/O limited.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

alvherre@2ndquadrant.com

about 17 years ago

In reply to: Bruce Momjian (#19)

Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce Momjian wrote:

Alvaro Herrera wrote:

Why not? Right now it's single-threaded. Would it be faster if it ran
several copies in parallel?

Sure, but that assumes you have parallel I/O channels; I assume right
now it is I/O limited.

But so does parallel pg_restore, no?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

josh@agliodbs.com

about 17 years ago

In reply to: Bruce Momjian (#19)

gsmith@gregsmith.com

about 17 years ago

In reply to: Alvaro Herrera (#14)

andrew@dunslane.net

about 17 years ago

In reply to: Alvaro Herrera (#20)

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

about 17 years ago

In reply to: Alvaro Herrera (#18)

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Bruce Momjian (#5)

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Alvaro Herrera (#20)

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Zdenek Kotala (#25)

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Tom Lane (#27)

bruce@momjian.us

about 17 years ago

In reply to: Tom Lane (#8)

josh@agliodbs.com

about 17 years ago

In reply to: Bruce Momjian (#29)

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Josh Berkus (#30)

bruce@momjian.us

about 17 years ago

In reply to: Tom Lane (#31)

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#32)

bruce@momjian.us

about 17 years ago

In reply to: Bruce Momjian (#33)

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 17 years ago

In reply to: Bruce Momjian (#34)

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Bruce Momjian (#33)

bruce@momjian.us

almost 17 years ago

In reply to: Dimitri Fontaine (#35)

bruce@momjian.us

almost 17 years ago

In reply to: Tom Lane (#36)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#37)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#38)

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Bruce Momjian (#38)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#38)

bruce@momjian.us

almost 17 years ago

In reply to: Tom Lane (#41)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#34)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#42)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#34)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#38)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#47)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#47)

robertmhaas@gmail.com

almost 17 years ago

In reply to: Bruce Momjian (#45)

bruce@momjian.us

almost 17 years ago

In reply to: Robert Haas (#50)

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Robert Haas (#50)

robertmhaas@gmail.com

almost 17 years ago

In reply to: Tom Lane (#52)

bruce@momjian.us

almost 17 years ago

In reply to: Robert Haas (#53)

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Bruce Momjian (#54)

robertmhaas@gmail.com

almost 17 years ago

In reply to: Bruce Momjian (#54)

robertmhaas@gmail.com

almost 17 years ago

In reply to: Tom Lane (#55)

bruce@momjian.us

almost 17 years ago

In reply to: Robert Haas (#57)

bruce@momjian.us

almost 17 years ago

In reply to: Bruce Momjian (#54)