pg_migrator and an 8.3-compatible tsvector data type

Started by Bruce Momjianalmost 17 years ago59 messageshackers
Jump to latest
#1Bruce Momjian
bruce@momjian.us

I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4. Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type. You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:

Restoring database schema
psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type
tsvector has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.

So, how do I create a compatible set of hooks for v8_3_tsvector?
Because tsvector is a native data type I can't figure out how to set up
a similar type. Also, will the indexes be OK if they are created from
types that don't have the 8.3 format? What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used. I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#2Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#1)
Re: pg_migrator and an 8.3-compatible tsvector data type

Because no one has responed

---------------------------------------------------------------------------

Bruce Momjian wrote:

I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4. Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type. You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&amp;content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:

Restoring database schema
psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type
tsvector has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.

So, how do I create a compatible set of hooks for v8_3_tsvector?
Because tsvector is a native data type I can't figure out how to set up
a similar type. Also, will the indexes be OK if they are created from
types that don't have the 8.3 format? What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used. I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#1)
Re: pg_migrator and an 8.3-compatible tsvector data type

Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector. I realize this limits
pg_migrator's usefulness, but I have to move on.

---------------------------------------------------------------------------

Bruce Momjian wrote:

I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4. Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type. You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&amp;content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:

Restoring database schema
psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type
tsvector has no default operator class for access method "gin"
HINT: You must specify an operator class for the index or define a
default operator class for the data type.

So, how do I create a compatible set of hooks for v8_3_tsvector?
Because tsvector is a native data type I can't figure out how to set up
a similar type. Also, will the indexes be OK if they are created from
types that don't have the 8.3 format? What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used. I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#4Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#3)
Re: pg_migrator and an 8.3-compatible tsvector data type

On 5/28/09 2:30 PM, Bruce Momjian wrote:

Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector. I realize this limits
pg_migrator's usefulness, but I have to move on.

I don't know how to fix the issue, certainly.

Why was there a format change to tsvector anyway? Was there an
important reason for this?

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

#5Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#4)
Re: pg_migrator and an 8.3-compatible tsvector data type

Josh Berkus wrote:

On 5/28/09 2:30 PM, Bruce Momjian wrote:

Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector. I realize this limits
pg_migrator's usefulness, but I have to move on.

I don't know how to fix the issue, certainly.

Why was there a format change to tsvector anyway? Was there an
important reason for this?

The ordering of the lexems was changed:

/*
* Convert 8.3 tsvector to 8.4
*
* 8.3 sorts lexemes by its length and if lengths are the same then it uses
* alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g.
*
* => SELECT 'c bb aaa'::tsvector;
* tsvector
* ----------------
* 'aaa' 'bb' 'c' -- 8.4
* 'c' 'bb' 'aaa' -- 8.3
*/

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#6Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#5)
Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce,

The ordering of the lexems was changed:

/*
* Convert 8.3 tsvector to 8.4
*
* 8.3 sorts lexemes by its length and if lengths are the same then it uses
* alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g.
*
* => SELECT 'c bb aaa'::tsvector;
* tsvector
* ----------------
* 'aaa' 'bb' 'c' -- 8.4
* 'c' 'bb' 'aaa' -- 8.3
*/

What does that get us in terms of performance etc.?

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

#7Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#6)
Re: pg_migrator and an 8.3-compatible tsvector data type

Josh Berkus wrote:

Bruce,

The ordering of the lexems was changed:

/*
* Convert 8.3 tsvector to 8.4
*
* 8.3 sorts lexemes by its length and if lengths are the same then it uses
* alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g.
*
* => SELECT 'c bb aaa'::tsvector;
* tsvector
* ----------------
* 'aaa' 'bb' 'c' -- 8.4
* 'c' 'bb' 'aaa' -- 8.3
*/

What does that get us in terms of performance etc.?

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data. The area I got stuck on is that there is no CAST behavior
when creating an index. We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#6)
Re: pg_migrator and an 8.3-compatible tsvector data type

Josh Berkus <josh@agliodbs.com> writes:

Bruce,

The ordering of the lexems was changed:

What does that get us in terms of performance etc.?

It was changed to support partial-match tsvector queries. Without it,
a partial match query would have to scan entire tsvectors instead
of applying binary search. I don't know if Oleg and Teodor did any
actual performance tests on the size of the hit, but it seems like
it could be pretty awful for large documents.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#7)
Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce Momjian <bruce@momjian.us> writes:

Josh Berkus wrote:

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data. The area I got stuck on is that there is no CAST behavior
when creating an index. We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

It's certainly doable. Bruce is just applying the strategy he mentioned
in our talk ;-)

regards, tom lane

#10Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#8)
Re: pg_migrator and an 8.3-compatible tsvector data type

Tom,

It was changed to support partial-match tsvector queries. Without it,
a partial match query would have to scan entire tsvectors instead
of applying binary search. I don't know if Oleg and Teodor did any
actual performance tests on the size of the hit, but it seems like
it could be pretty awful for large documents.

Ah, ok, that's pretty significant. It makes a huge difference for some
users, one of whom will be quoted on the press release.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

#11Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#9)
Re: pg_migrator and an 8.3-compatible tsvector data type

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Josh Berkus wrote:

Given that it's going to prevent users of Tsearch from
upgrading-in-place, the format change ought to be giving us some serious
gains. Otherwise we should put it off until we need to make other
datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data. The area I got stuck on is that there is no CAST behavior
when creating an index. We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

It's certainly doable. Bruce is just applying the strategy he mentioned
in our talk ;-)

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#12Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#11)
Re: pg_migrator and an 8.3-compatible tsvector data type

On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote:

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

I was pretty stumped by what you meant by "auto-cast" but now I'm
beginning to think what you're looking for are cross-type comparison
operators? You could define a bunch of such operators which manually
call the cast on one side and add them to the operator class. I'm not
sure how to set up the operator class entries for GIN though.

--
greg

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#12)
Re: pg_migrator and an 8.3-compatible tsvector data type

Greg Stark <stark@enterprisedb.com> writes:

On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote:

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

I was pretty stumped by what you meant by "auto-cast" but now I'm
beginning to think what you're looking for are cross-type comparison
operators? You could define a bunch of such operators which manually
call the cast on one side and add them to the operator class. I'm not
sure how to set up the operator class entries for GIN though.

No, what we need is a GIN (and probably GIST) opclass to support
old-style tsvectors. The user-level operators aren't really an issue
because an implicit cast will get inserted in front of them when
necessary, but we don't support that for indexes.

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into. It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.

regards, tom lane

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#13)
Re: pg_migrator and an 8.3-compatible tsvector data type

Tom Lane wrote:

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into. It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#15Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#14)
Re: pg_migrator and an 8.3-compatible tsvector data type

Alvaro Herrera wrote:

Tom Lane wrote:

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into. It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

That is accurate. I doubt copy mode speed can be improved. I think
doing a backup, then using --link mode will be the most common usage
pattern.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#16Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#15)
Re: pg_migrator and an 8.3-compatible tsvector data type

On Fri, May 29, 2009 at 12:53 AM, Bruce Momjian <bruce@momjian.us> wrote:

That is accurate.  I doubt copy mode speed can be improved.  I think
doing a backup, then using --link mode will be the most common usage
pattern.

Well doing a backup will take more or less the same time as copy mode, no?

Actually there's a big win for people who can take a snapshot using
their SAN or filesystem, but that's not everyone.

--
greg

#17Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#16)
Re: pg_migrator and an 8.3-compatible tsvector data type

Greg Stark wrote:

On Fri, May 29, 2009 at 12:53 AM, Bruce Momjian <bruce@momjian.us> wrote:

That is accurate. ?I doubt copy mode speed can be improved. ?I think
doing a backup, then using --link mode will be the most common usage
pattern.

Well doing a backup will take more or less the same time as copy mode, no?

Actually there's a big win for people who can take a snapshot using
their SAN or filesystem, but that's not everyone.

Well, you are up during the backup.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#15)
Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce Momjian wrote:

Alvaro Herrera wrote:

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

That is accurate. I doubt copy mode speed can be improved.

Why not? Right now it's single-threaded. Would it be faster if it ran
several copies in parallel?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#18)
Re: pg_migrator and an 8.3-compatible tsvector data type

Alvaro Herrera wrote:

Bruce Momjian wrote:

Alvaro Herrera wrote:

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore. From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster. (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

That is accurate. I doubt copy mode speed can be improved.

Why not? Right now it's single-threaded. Would it be faster if it ran
several copies in parallel?

Sure, but that assumes you have parallel I/O channels; I assume right
now it is I/O limited.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#19)
Re: pg_migrator and an 8.3-compatible tsvector data type

Bruce Momjian wrote:

Alvaro Herrera wrote:

Why not? Right now it's single-threaded. Would it be faster if it ran
several copies in parallel?

Sure, but that assumes you have parallel I/O channels; I assume right
now it is I/O limited.

But so does parallel pg_restore, no?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#21Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#19)
#22Greg Smith
gsmith@gregsmith.com
In reply to: Alvaro Herrera (#14)
#23Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#20)
#24Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Alvaro Herrera (#18)
#25Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Bruce Momjian (#5)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#20)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#25)
#28Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#27)
#29Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#8)
#30Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#30)
#32Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#31)
#33Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#32)
#34Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#33)
#35Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Bruce Momjian (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#33)
#37Bruce Momjian
bruce@momjian.us
In reply to: Dimitri Fontaine (#35)
#38Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#36)
#39Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#37)
#40Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#38)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#38)
#42Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#38)
#43Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#41)
#44Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#34)
#45Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#42)
#46Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#34)
#47Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#38)
#48Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#47)
#49Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#47)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#45)
#51Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#50)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#50)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#52)
#54Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#54)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#54)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#55)
#58Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#57)
#59Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#54)