COPY command details

Started by Benjamin Araiabout 19 years ago23 messagesgeneral
Jump to latest
#1Benjamin Arai
benjamin@araisoft.com

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

Benjamin

#2A. Kretschmer
andreas.kretschmer@schollglas.com
In reply to: Benjamin Arai (#1)
Re: COPY command details

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai folgendes:

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1 a
2 b
2 c
\.

ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: A. Kretschmer (#2)
Re: COPY command details

"A. Kretschmer" <andreas.kretschmer@schollglas.com> writes:

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai folgendes:

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
1 a
2 b
2 c
\.
ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

No, "during" is the right answer. The above only demonstrates that
libpq buffers COPY data in larger-than-one-line units --- once the
data gets to the backend it's inserted and checked a row at a time.

regards, tom lane

#4Benjamin Arai
me@benjaminarai.com
In reply to: A. Kretschmer (#2)
Re: COPY command details

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

Show quoted text

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai
folgendes:

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1 a
2 b
2 c
\.

ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

#5Benjamin Arai
benjamin@araisoft.com
In reply to: A. Kretschmer (#2)
Re: COPY command details

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

Show quoted text

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai
folgendes:

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1 a
2 b
2 c
\.

ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

#6A. Kretschmer
andreas.kretschmer@schollglas.com
In reply to: Benjamin Arai (#4)
Re: COPY command details

am Thu, dem 29.03.2007, um 10:02:49 -0700 mailte Benjamin Arai folgendes:

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Drop the index(es), do the COPIES, recreate the index(es). You can do this
within a transaction.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

#7Bruce Momjian
bruce@momjian.us
In reply to: Benjamin Arai (#4)
Re: COPY command details

Benjamin Arai wrote:

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Sure, drop the index, do the COPY, and then recreate the index. That is
done often.

---------------------------------------------------------------------------

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai
folgendes:

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1 a
2 b
2 c
\.

ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#8Benjamin Arai
benjamin@araisoft.com
In reply to: Bruce Momjian (#7)
Re: COPY command details

I would prefer not to drop the index because the database is several
hundred gigs. I would prefer to incrementally add to the index.

Benjamin

Bruce Momjian wrote:

Show quoted text

Benjamin Arai wrote:

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Sure, drop the index, do the COPY, and then recreate the index. That is
done often.

---------------------------------------------------------------------------

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai
folgendes:

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1 a
2 b
2 c
\.

ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#9Ron Johnson
ron.l.johnson@cox.net
In reply to: Bruce Momjian (#7)
Re: COPY command details

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 14:41, Bruce Momjian wrote:

Benjamin Arai wrote:

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Sure, drop the index, do the COPY, and then recreate the index. That is
done often.

ALTER INDEX xxxx MAINTENANCE IS DISABLED would be helpful, so that
you don't have to be dragging around (possibly changing) SQL around
"everywhere".

So, an example would be from an evening load job would be:
ALTER INDEX foobar_idx MAINTENANCE IS DISABLED;
COPY foo FROM 'blarg';
REINDEX INDEX foobar_idx;

So if the DBA decides that foobar_idx needs different fields, you
don't have to edit the evening load job when the index changes.

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGDB7xS9HxQb37XmcRAqzuAJwK9LATewVE6GwJg/us6p5KzznWAgCfSZ9J
xtqWwHsVMvjuoSYP+/rEfNE=
=nJ+F
-----END PGP SIGNATURE-----

#10Bruce Momjian
bruce@momjian.us
In reply to: Benjamin Arai (#8)
Re: COPY command details

Benjamin Arai wrote:

I would prefer not to drop the index because the database is several
hundred gigs. I would prefer to incrementally add to the index.

I know of now way to do that in a batch, unless you go with partitioned
tables.

---------------------------------------------------------------------------

Benjamin

Bruce Momjian wrote:

Benjamin Arai wrote:

So, is there a way to defer the index updating until a later period
of time. More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Sure, drop the index, do the COPY, and then recreate the index. That is
done often.

---------------------------------------------------------------------------

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai
folgendes:

Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1 a
2 b
2 c
\.

ERROR: duplicate key violates unique constraint "foo_pkey"
CONTEXT: COPY foo, line 3: "2 c"
test=*#

I can type the wrong key and the error occurs later with the finaly \.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11Ron Johnson
ron.l.johnson@cox.net
In reply to: Benjamin Arai (#8)
Re: COPY command details

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 14:51, Benjamin Arai wrote:

I would prefer not to drop the index because the database is several
hundred gigs. I would prefer to incrementally add to the index.

Some RDBMSs (well, one that I know of) has the ability to defer
index updates during data load, and it actually works very well.

The down side is that if there's a unique value constraint failure,
you don't know which record it failed on.

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGDCClS9HxQb37XmcRAo0pAKCwqYwXOAPIXK25L9zhWhtGMFi/hwCgtA+x
zgc5Bz8wrVQ5UGocGe5v3s4=
=aFmR
-----END PGP SIGNATURE-----

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Benjamin Arai (#8)
Re: COPY command details

Benjamin Arai <benjamin@araisoft.com> writes:

I would prefer not to drop the index because the database is several
hundred gigs. I would prefer to incrementally add to the index.

This may well be false economy. I don't have numbers at hand, but a
full rebuild can be substantially faster than adding a large number
of rows to the index incrementally. Also, you don't end up with a
physically disordered index, so there is some ongoing performance
benefit.

regards, tom lane

#13Ron Johnson
ron.l.johnson@cox.net
In reply to: Tom Lane (#12)
Re: COPY command details

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 18:35, Tom Lane wrote:

Benjamin Arai <benjamin@araisoft.com> writes:

I would prefer not to drop the index because the database is several
hundred gigs. I would prefer to incrementally add to the index.

This may well be false economy. I don't have numbers at hand, but a
full rebuild can be substantially faster than adding a large number
of rows to the index incrementally. Also, you don't end up with a
physically disordered index, so there is some ongoing performance
benefit.

But deferring the index updates allows you to play games with the
the index input data, such as presorting it in order to take
advantage of locality of data.

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGDGC5S9HxQb37XmcRAk1nAJwNb72P1ZBFxA8jv2d7eo2GOMTvYQCgukr7
QbOAq/Sd88ZHeOTOt+pAgcM=
=A1+E
-----END PGP SIGNATURE-----

#14Benjamin Arai
benjamin@araisoft.com
In reply to: Tom Lane (#12)
Re: COPY command details

I agree, this is true if I cannot defer index updates. But if it is
possible to defer index updates until the end then I should be able to
achieve some sort of speedup. Rebuilding an index can't be the
PostgreSQL solution for all cases. I am dealing with databases in the
hundreds of gigs range and I am adding about 10gigs of data a week. At
some point its going to take longer than a week to rebuild all of the
indexes in the database.

On the other hand, if I am to partition the data into several tables
then it might not be such a big deal since I am only adding and never
deleting... This makes it a little more of a pain in the ass.

Benjamin

Tom Lane wrote:

Show quoted text

Benjamin Arai <benjamin@araisoft.com> writes:

I would prefer not to drop the index because the database is several
hundred gigs. I would prefer to incrementally add to the index.

This may well be false economy. I don't have numbers at hand, but a
full rebuild can be substantially faster than adding a large number
of rows to the index incrementally. Also, you don't end up with a
physically disordered index, so there is some ongoing performance
benefit.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#15Gerald Timothy G Quimpo
gerald.quimpo@qualservcentral.com
In reply to: Benjamin Arai (#14)
Re: COPY command details

On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote:

Rebuilding an index can't be the PostgreSQL solution for all
cases. I am dealing with databases in the hundreds of gigs
range and I am adding about 10gigs of data a week. At
some point its going to take longer than a week to rebuild
all of the indexes in the database.

On the other hand, if I am to partition the data into
several tables then it might not be such a big deal since
I am only adding and never deleting... This makes it a
little more of a pain in the ass.

I am leaning toward a policy of always partitioning large
tables. I haven't found the time to do it properly yet,
thinking about it, hoping that someone who'se done it will
chime in with their rules of thumb.

Like Benjamin, I have a database that is close to 600GB
for 2.25 years of data, and if I were to add the other
4 years of data that we have archived away, will easily go
into the terabyte range. There are a few individual tables
which approach 100GB all by themselves.

As it is, I can't afford to do reindex or even backup
(pg_dump or any other method) or other administrative tasks
on those tables since the processes take too long (there are
workarounds, i could backup single tables at slack times,
which would allow me to do a complete backup (but not
self-consistent as a set) over the course of a week or so.

So I'm leaning toward partitioning, perhaps selecting
partition rules so that no table will be larger than
around 5GB, at which point, reindex or admin procedures
that take exclusive locks now become only minor
inconveniences rather than showstoppers.

How do people take consistent backups of very large
databases on Linux/FreeBSD? I'm aware of PITR, but
might not be able to set aside a box with enough
drives for it. LVM Snapshot? performance issues with
LVM, etc?

tiger

--
Gerald Timothy Quimpo gerald.quimpo@qualservcentral.com
Business Systems Development, KFC/Mr Donut/Ramcar

There is nothing more dreadful than imagination without taste.
-- Johann Wolfgang von Goethe

#16Benjamin Arai
benjamin@araisoft.com
In reply to: Gerald Timothy G Quimpo (#15)
Re: COPY command details

I have one system which I have used partitioning. For this particular
case I have tons of data over about (50 years). What I did is wrote
small loader that breaks data in tables based on date, so I have tables
like abc_2000, abc_2001 etc. The loading script is only a couple
hundred lines of code. The only part that was a little bit of work was
to allow for easy access to the data for the data for the devs. I did
this by writing a few PL functions to automatically union the tables
and produce results. So the function like getData(startData,enddate)
would run a union query for the respective date ranges.

Benjamin

P.S. If I am doing anything that can be improved for the data access
portion please let me know and feel free to voice your opinions. I am
always looking for new ways to make this particular database faster.

Gerald Timothy G Quimpo wrote:

Show quoted text

On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote:

Rebuilding an index can't be the PostgreSQL solution for all
cases. I am dealing with databases in the hundreds of gigs
range and I am adding about 10gigs of data a week. At
some point its going to take longer than a week to rebuild
all of the indexes in the database.

On the other hand, if I am to partition the data into
several tables then it might not be such a big deal since
I am only adding and never deleting... This makes it a
little more of a pain in the ass.

I am leaning toward a policy of always partitioning large
tables. I haven't found the time to do it properly yet,
thinking about it, hoping that someone who'se done it will
chime in with their rules of thumb.

Like Benjamin, I have a database that is close to 600GB
for 2.25 years of data, and if I were to add the other
4 years of data that we have archived away, will easily go
into the terabyte range. There are a few individual tables
which approach 100GB all by themselves.

As it is, I can't afford to do reindex or even backup
(pg_dump or any other method) or other administrative tasks
on those tables since the processes take too long (there are
workarounds, i could backup single tables at slack times,
which would allow me to do a complete backup (but not
self-consistent as a set) over the course of a week or so.

So I'm leaning toward partitioning, perhaps selecting
partition rules so that no table will be larger than
around 5GB, at which point, reindex or admin procedures
that take exclusive locks now become only minor
inconveniences rather than showstoppers.

How do people take consistent backups of very large
databases on Linux/FreeBSD? I'm aware of PITR, but
might not be able to set aside a box with enough
drives for it. LVM Snapshot? performance issues with
LVM, etc?

tiger

#17Gerald Quimpo
bopolissimus.lists@gmail.com
In reply to: Benjamin Arai (#14)
Re: COPY command details

On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote:

Rebuilding an index can't be the PostgreSQL solution for all
cases. I am dealing with databases in the hundreds of gigs
range and I am adding about 10gigs of data a week. At
some point its going to take longer than a week to rebuild
all of the indexes in the database.

On the other hand, if I am to partition the data into
several tables then it might not be such a big deal since
I am only adding and never deleting... This makes it a
little more of a pain in the ass.

I am leaning toward a policy of always partitioning large
tables. I haven't found the time to do it properly yet,
thinking about it, hoping that someone who'se done it will
chime in with their rules of thumb.

Like Benjamin, I have a database that is close to 600GB
for 2.25 years of data, and if I were to add the other
4 years of data that we have archived away, will easily go
into the terabyte range. There are a few individual tables
which approach 100GB all by themselves.

As it is, I can't afford to do reindex or even backup
(pg_dump or any other method) or other administrative tasks
on those tables since the processes take too long (there are
workarounds, i could backup single tables at slack times,
which would allow me to do a complete backup (but not
self-consistent as a set) over the course of a week or so.

So I'm leaning toward partitioning, perhaps selecting
partition rules so that no table will be larger than
around 5GB, at which point, reindex or admin procedures
that take exclusive locks now become only minor
inconveniences rather than showstoppers.

How do people take consistent backups of very large
databases on Linux/FreeBSD? I'm aware of PITR, but
might not be able to set aside a box with enough
drives for it. LVM Snapshot? performance issues with
LVM, etc?

tiger

--
Gerald Timothy Quimpo gerald.quimpo@qualservcentral.com
Business Systems Development, KFC/Mr Donut/Ramcar

There is nothing more dreadful than imagination without taste.
-- Johann Wolfgang von Goethe

#18Gerald Quimpo
bopolissimus.lists@gmail.com
In reply to: Benjamin Arai (#16)
Re: COPY command details

On Thu, 2007-03-29 at 22:15 -0700, Benjamin Arai wrote:

I have one system which I have used partitioning. For this particular
case I have tons of data over about (50 years). What I did is wrote
small loader that breaks data in tables based on date, so I have tables
like abc_2000, abc_2001 etc. The loading script is only a couple
hundred lines of code. The only part that was a little bit of work was
to allow for easy access to the data for the data for the devs. I did
this by writing a few PL functions to automatically union the tables
and produce results. So the function like getData(startData,enddate)
would run a union query for the respective date ranges.

From reading on the list, I think the current recommendations

are:

have a parent table and then create the partitions as
descendants (INHERITS) of the parent table.

Use constraints to ensure that nothing gets inserted
into the parents, and that only the correct data gets
inserted into the descendants (i.e., the constraints
enforce that 2002 data won't insert into the 2001 partition,
the insert would fail if you tried that).

Turn constraint_exclusion on in postgresql.conf (helps
the planner ignore partitions that don't qualify), so
it doesn't need to scan partitions where there will be no
matches anyway.

Use rules on the parent to redirect inserts/updates/deletes
to the right partition (or trigger, i'm having some problems
with both rules and triggers, so I may have the loader
insert straight to the right partition instead, although
I'm still trying to get the rules right.

The advantage of this (parent table plus descendant partitions)
is that you can query from the parent, and the descendants will
be automatically queried, you'll get all matching rows from any
descendants that have them. No need for those unions or
pl/pgsql functions to do the unions for you.

That said, I need to make my current code work with smaller
test data sets so I can isolate the problems I'm having with
rules, etc. E.g., my original plan was to detect automatically
(in a before trigger) if a partition for the to-be-loaded data
set already exists, and if it doesn't, create the partitions
and the rules. I then load the data into the parent table and
expect the newly created rules to redirect the data into the
new partitions. Either my rules are wrogn, or there's something
about transaction semantics that's making that fail since the
rows are inserting into the parent table (no constraint there
yet to disallow inserts into the parent) instead of the right
partition.

To The List:
Are schema changes (such as adding rules and creating child
partitions) part of the same transaction or do they happen
(magically) outside the current transaction, in a new
transaction (e.g., so that the inserts would fail because
the inserts are running in an older transaction that can't
see the new rule or new partition table yet).

As I said, I might just be doing something wrogn there.
I've set that project aside for now since it's not
urgent (the current unpartitioned system works well enough,
I just avoid reindex, vacuum, vacuum full and pg_dump), but
I'll start working on it again when I get some free time in
a week or two (Holy Week, holiday where I am :-).

tiger

#19Jaime Silvela
JSilvela@Bear.com
In reply to: Gerald Quimpo (#17)
time series query

The problem I'm trying to solve is pretty standard. I have a table that
records measurements of different types at different times.

CREATE TABLE measurements (
obj_id int4,
val_type_id int4 references lu_val_type(val_type_id),
val numeric,
observation_date date
);

I want a query as simple and fast as possible to return the latest
observation of each type for each object.
I sent a message to this list a while ago, and the suggestion I found to
be the best compromise of clarity and speed was:
a) create an index on (obj_id, val_type_id, observation_date)
b) the "obvious" query becomes fast thanks to the index.
select ms.*
from (
select obj_id, val_type_id, max(observation_date) as
observation_date
from measurements
group by obj_id, val_type_id
) ms_last
join measurements ms using (obj_id, val_type_id, observation_date);

It still bugged me a bit that this requires a JOIN, especially since in
a procedural language, it would have been so easy to return the value
associated with the max(observation_date).
I think I've found a pretty good alternative. This at the moment works
if we keep track of time with an integer, rather than a date, but it
would be readily extensible.

The idea is to in fact, associate the value with the
max(observation_date) like so:
select obj_id, val_type_id, max(array[observation_date, val])
group by obj_id, val_type_id;

There are two caveats:
a) array requires elements to be of the same type, so observation_type
must be kept as "time from"
b) a row constructor would be ideal here, but there is now max function
for rowtypes.

If I did have a max() function for row types, it would be clean to do this:
select obj_id, val_type_id, max(row(observation_date, val))
group by obj_id, val_type_id;

Now, it seems that since rowtype comparison is built in, it should be
pretty easy to build a max() aggregate for it. Has anybody done this?
I'd have looked at the code for max(anyarray) but I don't know how to
access it. Can someone point me in the right direction?

Also, has someone thought about this before? I'm wondering if there will
be a speed gain coming from this.

Thank you,
Jaime

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

#20Jaime Silvela
JSilvela@Bear.com
In reply to: Jaime Silvela (#19)
Re: time series query

In case anyone is interested, I was able to solve this, more or less.
Here's my new "Latest value" query:

select obj_id, val_type_id, (max(row(observation_date, val))).val
from measurements
group by obj_id, val_type_id

It was only necessary to define a new (date, numeric) type. Below is the
code. The performance is actually slower than using a JOIN between the
table and its GROUP-BY version. I guess for performance, I should code
the functions in C, but at the moment, the value for me is that it
simplifies a lot of my 12-way join queries!

create type dtval as (
dt date,
val numeric
);

create function dtval_smaller(dtval, dtval) returns dtval as $$
select case when $1.dt < $2.dt then $1 else $2 end
$$ language sql;

create aggregate min (
sfunc = dtval_smaller,
basetype = dtval,
stype = dtval
);

create function dtval_larger(dtval, dtval) returns dtval as $$
select case when $1.dt > $2.dt then $1 else $2 end
$$ language sql;

create aggregate max (
sfunc = dtval_larger,
basetype = dtval,
stype = dtval
);

Jaime Silvela wrote:

The problem I'm trying to solve is pretty standard. I have a table
that records measurements of different types at different times.

CREATE TABLE measurements (
obj_id int4,
val_type_id int4 references lu_val_type(val_type_id),
val numeric,
observation_date date
);

I want a query as simple and fast as possible to return the latest
observation of each type for each object.
I sent a message to this list a while ago, and the suggestion I found
to be the best compromise of clarity and speed was:
a) create an index on (obj_id, val_type_id, observation_date)
b) the "obvious" query becomes fast thanks to the index.
select ms.*
from (
select obj_id, val_type_id, max(observation_date) as
observation_date
from measurements
group by obj_id, val_type_id
) ms_last
join measurements ms using (obj_id, val_type_id, observation_date);

It still bugged me a bit that this requires a JOIN, especially since
in a procedural language, it would have been so easy to return the
value associated with the max(observation_date).
I think I've found a pretty good alternative. This at the moment works
if we keep track of time with an integer, rather than a date, but it
would be readily extensible.

The idea is to in fact, associate the value with the
max(observation_date) like so:
select obj_id, val_type_id, max(array[observation_date, val])
group by obj_id, val_type_id;

There are two caveats:
a) array requires elements to be of the same type, so observation_type
must be kept as "time from"
b) a row constructor would be ideal here, but there is now max
function for rowtypes.

If I did have a max() function for row types, it would be clean to do
this:
select obj_id, val_type_id, max(row(observation_date, val))
group by obj_id, val_type_id;

Now, it seems that since rowtype comparison is built in, it should be
pretty easy to build a max() aggregate for it. Has anybody done this?
I'd have looked at the code for max(anyarray) but I don't know how to
access it. Can someone point me in the right direction?

Also, has someone thought about this before? I'm wondering if there
will be a speed gain coming from this.

Thank you,
Jaime

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice. You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of: (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

#21William Garrison
postgres@mobydisk.com
In reply to: Jaime Silvela (#20)
#22Jaime Silvela
JSilvela@Bear.com
In reply to: William Garrison (#21)
#23Ron Johnson
ron.l.johnson@cox.net
In reply to: Gerald Timothy G Quimpo (#15)