[Fwd: pg_migrator: in-place upgrade tool at pgFoundry]

Started by Nonameabout 19 years ago7 messages
#1Noname
korryd@enterprisedb.com

EnterpriseDB has created a new project at pgFoundry -
http://pgfoundry.org/projects/pg-migrator/

pg_migrator is a tool that can in-place upgrade existing data without
the usual dump/reload cycle.

The pg_migrator project site (at pgFoundry) contains a complete
implementation of the functionality described below as well as a copy of
the introductory document that I've included in this message.

We would welcome feedback on implementation details and ideas for
improvements.

-- Korry

--
Korry Douglas korryd@enterprisedb.com
EnterpriseDB http://www.enterprisedb.com

=====================================================================================================

------------------------------------------------------------------------------
PG_MIGRATOR: IN-PLACE UPGRADES FOR POSTGRESQL
------------------------------------------------------------------------------

Upgrading a PostgreSQL database from one release to another can be an
expensive process. For minor upgrades, you can simply install new executables
and forget about upgrading existing data. But for major upgrades, you have to
export all of your data (using pg_dump), install the new release, run initdb
to create a new cluster, and then import your old data. If you have a lot of
data, that can take a considerable amount of time (hours?, days?). If you have
too much data, you may have to buy more storage since you need enough room to
hold the original data plus the exported data.

EnterpriseDB is contributing a new tool, pg_migrator, that can reduce the
amount of time (and disk space) required for many upgrades.

------------------------------------------------------------------------------
WHAT IT DOES
------------------------------------------------------------------------------

PG_migrator is a tool (not a complete solution) that performs an in-place
upgrade of existing data. For many upgrades, the data stored in user-defined
tables does not have to change when moving from one version of PostgreSQL to
another. Some upgrades require changes in the on-disk representation of data;
pg_migrator cannot help in those upgrades. However, many upgrades require no
changes to the on-disk representation of a user-defined table and, in those
cases, pg_migrator will move existing user-defined tables from the old
database cluster into the new cluster.

There are two factors that determine whether an in-place upgrade is practical.

Every table in a cluster (actually, every table created by a given version)
shares the same infrastructure layout. By infrastructure, we mean the on-disk
representation of the table headers and trailers and the on-disk
representation of tuple headers. If the infrastructure changes between the old
version of PostgreSQL and the new version, pg_migrator cannot move existing
tables to the new cluster (you'll have to pg_dump the old data and then import
that data into the new cluster).

Occasionally, a PostgreSQL release introduces a change to the on-disk
representation of a data type. For example, PostgreSQL version 8.2 changes the
layout for values of type INET and CIDR. If you are not storing any values of
type INET (or CIDR), pg_migrator can upgrade any table in your cluster. If you
are storing values of type INET (or CIDR) in some tables, you must
export/import those tables, but pg_migrator can in-place upgrade other tables
(a change in infrastructure means that you have to export/import every table).

If a new version of PostgreSQL does not change the infrastructure layout and
does not change the on-disk representation of a data type (that you are
using), you can pg_migrator to save a tremendous amount of time (and disk
space).

------------------------------------------------------------------------------
HOW IT WORKS
------------------------------------------------------------------------------

To use pg_migrator during an upgrade, you start by installing a fresh cluster
(using the newest version) in a new directory. When you've finished installing
the new version, the new cluster will contain the new executables (postmaster,
pg_dump, ...) and the usual template0, template1, and postgresql databases,
but no user-defined tables. At this point, you can shutdown the new postmaster
(we presume that you shutdown the old postmaster prior to creating the new
cluster) and invoke pg_migrator.

When pg_migrator starts, it runs through a verification process that ensures
that all required executables (the old postmaster, the new postmaster,
pg_dump, pg_resetxlog, ...) are present and contain the expected version
numbers. The verification process also checks the old and new $PGDATA
directories to ensure that the expected files and subdirectories (base,
global, pg_clog, pg_xlog, ...) are in place. If the verification process
succeeds, pg_migrator starts the old postmaster and runs pg_dumpall
--schema-only to capture the metadata contained in the old cluster. The script
produced by pg_dumpall will be used in a later step to recreate the following
user-defined objects in the new cluster:

databases
roles (users)
privileges (grants)
schemas
tables
indexes
views
packages (EnterpriseDB clusters only)
synonyms (EnterpriseDB clusters only)
functions
procedures (EnterpriseDB clusters only)
data types
procedural languages
aggregate functions
operators
operator classes
encoding conversions
inheritance relationships
rewrite rules
type casts
constraints
triggers

Note that the script produced by pg_dumpall will only recreate user-defined
objects, not system-defined objects (most object types are considered
“system-defined' if they reside in a schema whose name begins with “pg_”, if
they reside in the information_schema or, in the case of an EnterpriseDB
cluster, reside in the sys or dbo schemas). The new cluster will contain the
system-defined objects created by the latest version of PostgreSQL.

Once pg_migrator has extracted the metadata from the old cluster, it performs
a number of bookkeeping tasks required to “sync up” the new cluster with the
existing data.

First, pg_migrator temporarily renames any tablespace directories (in the old
cluster) – the new cluster will point to the same tablespace directories and
will complain (ERROR: directory not empty) if those directories exist when
pg_migrator imports the metadata (in a later step).

Next, pg_migrator copies the “next transaction ID” from the old cluster to the
new cluster. This is the first step in ensuring that the proper tuples are
visible (and other tuples are hidden) from the new cluster (remember,
pg_migrator does not export/import the content of user-defined tables so the
transaction ID's in the new cluster must match the transaction ID's in the old
data). pg_migrator also copies the starting address for write-ahead logs from
the old cluster to the new cluster.

pg_migrator now shuts down the postmaster and copies the commit logs
($PGDATA/pg_clog/*) from the old cluster to the new cluster. At this point,
the new cluster and the old cluster agree on the visibility of tuples in
user-defined tables.

Now pg_migrator begins reconstructing the metadata (obtained from the old
cluster), running createdb for each database defined in the old cluster.

Once all of the databases have been created in the new cluster, pg_migrator
tackles the problem of naming toast relations. Toast tables are used to store
oversized data out-of-line (that is, in a separate file). When the server
decides to move a datum out of a tuple and into a toast table, it stores a
pointer in the original slot in the tuple. That pointer contains the
relfilenode (i.e. filename) of the toast table. That means that any table
which contains toasted data will contain the filename of the toast table (in
each toast pointer). Therefore, it is very important that toast tables retain
their old names when they are created in the new cluster. At the moment, the
CREATE TABLE command does not offer any explicit support for naming toast
tables (and it probably never will offer such a feature). To ensure that the
toast table names retain their old names, pg_migrator “reserves” the name of
each toast table before importing the metadata from the old cluster. To
reserve a filename, pg_migrator simply creates an empty file (with the
appropriate name) and the server avoids that name when it detects a
collision. In a later step, pg_migrator removes the “fake” toast tables and
patch the proper reltoastrelids back into the pg_class table so the new
cluster can find the old toast tables.

Next, pg_migrator executes the script produced earlier by pg_dumpall – this
script effectively copies the complete user-defined metadata from the old
cluster to the new cluster. When that script completes, pg_migrator (after
shutting down the new postmaster) deletes the fake toast tables and patches
the old toast tuple names into the new cluster (pg_migrator also removes and
recreates the corresponding pg_class and pg_type entries).

Finally, pg_migrator links (or copies, depending on the command-line supplied
by the user) each user-defined table (including data tables, index tables, and
toast tables) from the old cluster to the new cluster. This is the time-saver;
instead of exporting and importing every tuple, pg_migrator simply renames
each user-defined table (or, at worst, copies the entire table if the clusters
reside on different filesystems). Note that in this last step, pg_migrator
assigns a new name to each relation (the name of the relation matches the
relfilenode in the new cluster).

An important feature of the pg_migrator design is that it leaves the original
cluster intact – if a problem occurs during the upgrade, you can still run the
previous version.

------------------------------------------------------------------------------
WHERE IT FITS
------------------------------------------------------------------------------

We see pg_migrator as a component within a larger upgrade toolkit.

One of the consequences of this perspective is that pg_migrator relies on the
user to determine whether an upgrade-in-place is possible. pg_migrator does
not check the CATALOG_VERSION so it will happily perform an in-place upgrade
even though it would be sufficient to simply replace the PostgreSQL
executables. pg_migrator cannot detect changes in relation-file headers or
trailers or in tuple headers. pg_migrator cannot detect changes in the on-disk
representation of a data type. Instead, pg_migrator should be invoked by a
tool (or user) that has already determined that an in-place upgrade is
practical.

Since pg_migrator is expected to function as part of a larger toolkit, it is
designed to produce status messages (and error messages) that can be easily
processed by other tools.

------------------------------------------------------------------------------
IMPLEMENTATION NOTES
------------------------------------------------------------------------------
pg_migrator is written in C. To reduce source-code dependencies, pg_migrator
makes heavy use of existing tools such as pg_ctl, pg_dumpall, and pg_resetlog.

pg_migrator can upgrade a cluster from one PostgreSQL release to another
PostgreSQL release, from one EnterpriseDB release to another, or from a
PostgreSQL cluster to an EnterpriseDB cluster.

#2Kai-Uwe Sattler
kus@tu-ilmenau.de
In reply to: Noname (#1)
Re: [Fwd: Index Advisor]

Hi Gurjeet,
I include pgsql-hackers in this discussion ...

Am 01.11.2006 um 17:38 schrieb Gurjeet Singh:

Hi Kai,

I am working with Simon at EnterpriseDB, and am currently
working on porting
your patch to 8.2 sources. I have done a quick hack to make it work
on 8.2;
please find the modified patch attached (remember it's a quick-n-
dirty hack; it
still needs a cleanup).

The only changes (as yet) that I have done are:
(1) Changed the code according to the change in the linked-
list (List*)
handling across 7.4 and 8.2.
(2) Added support for BitmapAnd, BitmapOr, BitmapHeapScan and
BitmapIndexScan plan-nodes in scan_plan().

The outstanding issues, as of now, as I see them, are:
(1) There are left-over dependencies in pg_depends, that
stops the
table-being-analyzed from getting dropped (probably it'll affect many
other DDLs, I haven't tested though). I am investigating this.

(2) The intermediate indexes that are created are 'real'
indexes, in the
sense that they are actiually created on the disk before
planning/analyzing and are then dropped. I assume this since, you are
calling index_create() which in turn calls index_build(), which in
turn
calls the index's Access-method's build method, which, I assume, will
create the index on disk. Please point out if this is not the case.

You are right - at least an empty index is created. I'm not sure if
the index appears on disk, but this was the easiest way in 7.4 to do
this. However, I agree - it has performance drawbacks and raises
concurreny issues.

This, real-index build, can be counter productive where the
underlying
table is huge. We were thinking of creating actual 'virtual'
indexes, by
utilizing the 'skip_build' parameter of index_create() (introduced in
index.c:1.229). But that would entail calculating the number-of-
pages, average
record-size, and probably some more stats, on our own. And then
putting these
values in the catalog.

Actually, we did exactly this - if you have only an empty index then
you have to estimate this values.

I see that you have declred following two functions in
src/include/catalog/index.h:

extern int2 get_attrType(Oid attrTypId, int2 attrSize, int4
attrTypMod);

extern int4 estimate_indexPages(Relation heapRelation,
IndexInfo *indexInfo);

Looking at these, I suppose that you also worked on some such
calculations.
If still with you, can you share this code with us?

It should be part of the patch - but let me check this.

(3) (If you've lost track, this is the third in the list of
outstanding
issues :). I am concerned about the visibility of the virtual
indexes. If
these indexes are immediately visible to other sessions, then there
is a
strong possibilty that other backends that are in the planning
stage of a
non-explain query, will pickup these indexes and develop their plan
and send
for execution. And you know what hell will break loose if that
happens;
there won't be any data in these indexes!!

Right - that's what I meant above by concurrency issues. Honestly, we
had not enough knowledge about pgsql at this time to do this. I
suppose the right way would be to add a session or transaction id to
the virtual index and let the planner use only virtual indexes from
the same session as the query.

One more thing, we are looking at ways to make it easier for
others too, to
develop their own advisors. So we are looking at the possibility of
making
it plugin based arch, similar to how edb-debugger for pl/pgsql is
being
developed. This will allow others to develop and use their own
advisors for
various Select/DML statements, in the least invasive way.

That's a great idea - it could be helpful for other kind of virtual
objects too, e.g. materialized views, partitions etc. There are
several exits for plugins: the set off indexes which should be
created virtually, the profit assignment to the individual indexes as
well as the way the recommendation is used. For example, we have the
prototype of an online advisor which collects the recommendations
continuously and tries to adapt the current set of real indexes (at
least as an alerter for the DBA).

Lastly, and most importantly, can we move this discussion to
pgsql-hackers?

Done.

So, let me know if there is anything that I can do.

Best regards,
-- gurjeet@EnterpriseDB.com singh.gurjeet@{ gmail | hotmail |
yahoo }.com

Best,
Kai

#3Kai-Uwe Sattler
kus@tu-ilmenau.de
In reply to: Kai-Uwe Sattler (#2)
Re: [Fwd: Index Advisor]

Hi Gurjeet,
I will look at the pg_advise bug and will send a patch ASAP.
Best,
Kai

Am 15.11.2006 um 15:34 schrieb Gurjeet Singh:

Show quoted text

BUGS:
=====
.) The SELECTs in the pg_advise are returning wrong results, when
the same index is suggested twice, because of the SUM() aggregates.
.) I doubt that on a table t(a,b), for a suggestion of idx(b,a),
pg_advise will
suggest idx(a,b);

Wish-list:
==========
.) Make pg_indexadvisor a user table.
Reason: a normal user cannot do "delete from pg_indexadvisor".
Difficulty: Need to know how to do
"insert into pg_indexadvisor values( 1, ...)"
from within the backend; that is, need to study/
invent RSI
(Recursive SQL Interface).
Trial code can be seen by searching for:
exec_simple_query( "insert into index_advisor values
( 10 )",
"advisor" /*portal name*/ );

.) Make it plugin-based.
Reason: so that someone else with a better idea can replace
this advisor, without having to recompile the server.
Difficulty: This code calls many internal functoions:
index_create(), index_drop(), planner(), etc.
That makes it impossible to compile it standalone.

.) Remove the dependency on the global "index_candidates"; used for
communication between indexadvisor.c and plancat.c.
Reason: Bad coding practice.
Difficulty: Even though I was successful in updating
pg_class.relpages for
the virtual indexes, the planner is still calling
smgr.c code to
get the number of pages occupied by the index!
Hence, I had to
use the global the way I did.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet @{ gmail | hotmail | yahoo }.com
<patch_and_other_files.tar.gz>

#4Kai-Uwe Sattler
kus@tu-ilmenau.de
In reply to: Kai-Uwe Sattler (#2)
Re: [Fwd: Index Advisor]

Hi,
Am 15.11.2006 um 15:34 schrieb Gurjeet Singh:

=====
.) The SELECTs in the pg_advise are returning wrong results, when
the same index is suggested twice, because of the SUM() aggregates.

I don't think that this is a bug. If the same index is recommended
for two different queries it will appear two times in
pg_indexadvisor. So, if you want to calculate the overall benefit of
this index, then you have to sum up the local benefits for each query.

.) I doubt that on a table t(a,b), for a suggestion of idx(b,a),
pg_advise will
suggest idx(a,b);

?? Not sure, if I understand you right. idx(b,a) and idx(a,b) are
completely different indexes. Why should pg_advise suggest idx(a,b).
But there is another bug: if there are recommendations like idx
(a,b,c), idx(a,b) and idx(a) it would be a good idea to create just
idx(a). I will add this to pg_advise as an optional feature.

Best,
Kai

#5Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Kai-Uwe Sattler (#4)
Re: [Fwd: Index Advisor]

On 11/19/06, Kai-Uwe Sattler <kus@tu-ilmenau.de> wrote:

Hi,

.) The SELECTs in the pg_advise are returning wrong results, when
the same index is suggested twice, because of the SUM() aggregates.

I don't think that this is a bug. If the same index is recommended
for two different queries it will appear two times in
pg_indexadvisor. So, if you want to calculate the overall benefit of
this index, then you have to sum up the local benefits for each query.

If this is intended behaviour, then its okay.

.) I doubt that on a table t(a,b), for a suggestion of idx(b,a),

pg_advise will
suggest idx(a,b);

?? Not sure, if I understand you right. idx(b,a) and idx(a,b) are
completely different indexes. Why should pg_advise suggest idx(a,b).

I am referring to the way get_column_names() is coded. First, the SQL for
the portal does not guarantee any order of the result; secondly, the 'for'
loops that follow, will always output the columns in their increasing order
of attribute number. Here's a small way to reproduce the bug, that I cooked
up just now:

Change the SQL in read_advisor_output() to:

res = PQexec(conn, "DECLARE myportal CURSOR FOR "
"SELECT relname,"
"int2vector_to_string(index_attrs) AS colids,"
"MAX(index_pages) AS size_in_pages,"
"SUM(profit) AS benefit,"
"SUM(profit)/MAX(index_pages) AS gain "
"FROM pg_indexadvisor,"
"pg_class "
"WHERE backend_pid = 0 "
"AND rel_oid = pg_class.oid "
"GROUP BY relname, colids "
"ORDER BY gain DESC");

Notice the backend_pid = 0. Now insert the following into pg_indexadvisor:

insert into pg_indexadvisor values( 1259, '2 1', 2, 1000, 20,0,0 );

This should prompt the advisor to generate the statement:

create index advidx_1 on pg_class (relnamespace,relname);

But instead, it will output this:

create index advidx_1 on pg_class (relname,relnamespace);

Now run the advisor with any workload, and inspect the output.

$ pg_advise.exe -d postgres -h localhost -p 5432 -U gsk -o create_index.sql
workload.sql

We should tokenize the list of attribute numbers (column_ids variable) in
get_column_names() and query them individually.

But there is another bug: if there are recommendations like idx

(a,b,c), idx(a,b) and idx(a) it would be a good idea to create just
idx(a). I will add this to pg_advise as an optional feature.

I'd say it's a new feature request and not a bug :) But I don't understand
why would you want to not build idx(a,b,c) in such a situation? idx(a,b,c)
can be useful in places where idx(a,b) or idx(a) is required, but the same
can't be said about idx(a) or idx(a,b) being useful where idx(a,b,c) is
required!

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

#6Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Gurjeet Singh (#5)
Re: [Fwd: Index Advisor]

On 11/20/06, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:

We should tokenize the list of attribute numbers (column_ids variable) in
get_column_names() and query them individually.

Done and bug resolved.

Regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

#7Kai-Uwe Sattler
kus@tu-ilmenau.de
In reply to: Gurjeet Singh (#5)
Re: [Fwd: Index Advisor]

Am 20.11.2006 um 06:35 schrieb Gurjeet Singh:

But there is another bug: if there are recommendations like idx
(a,b,c), idx(a,b) and idx(a) it would be a good idea to create just
idx(a). I will add this to pg_advise as an optional feature.

I'd say it's a new feature request and not a bug :) But I don't
understand why would you want to not build idx(a,b,c) in such a
situation? idx(a,b,c) can be useful in places where idx(a,b) or idx
(a) is required, but the same can't be said about idx(a) or idx
(a,b) being useful where idx(a,b,c) is required!

You are right - that's what I actually meant...
Kai