Is stats update during COPY IN really a good idea?
We have a TODO item
* Update reltuples in COPY
I was just about to go do this when I realized that it may not be such
a hot idea after all. The problem is that updating pg_class.reltuples
means that concurrent COPY operations will block each other, because
they want to update the same row in pg_class. You can already see this
happen in CREATE INDEX:
create table foo(f1 int);
begin;
create index fooey on foo(f1);
-- in another psql do
create index fooey2 on foo(f1);
-- second backend blocks until first xact is committed or rolled back.
While this doesn't bother me for CREATE INDEX, it does bother me for
COPY, since people often use COPY to avoid per-tuple INSERT overhead.
It seems pretty likely that this will cause blocking problems for real
applications. I think that may be a bigger problem than the benefit of
not needing a VACUUM (or, now, ANALYZE) to get the stats updated.
regards, tom lane
We have a TODO item
* Update reltuples in COPYI was just about to go do this when I realized that it may not be such
a hot idea after all. The problem is that updating pg_class.reltuples
means that concurrent COPY operations will block each other, because
they want to update the same row in pg_class. You can already see this
happen in CREATE INDEX:
People are using COPY into the same table at the same time?
While this doesn't bother me for CREATE INDEX, it does bother me for
COPY, since people often use COPY to avoid per-tuple INSERT overhead.
It seems pretty likely that this will cause blocking problems for real
applications. I think that may be a bigger problem than the benefit of
not needing a VACUUM (or, now, ANALYZE) to get the stats updated.
Oh, well we can either decide to do it or remove the TODO item. Either
way we win!
My vote is to update pg_class. The VACUUM takes much more time than the
update, and we are only updating the pg_class row, right? Can't we just
start a new transaction and update the pg_class row, that way we don't
have to open it for writing during the copy.
FYI, I had a 100k deep directory that caused me problems this morning.
Just catching up.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
People are using COPY into the same table at the same time?
Yes --- we had a message from someone who was doing that (and running
into unrelated performance issues) just last week.
My vote is to update pg_class. The VACUUM takes much more time than the
update, and we are only updating the pg_class row, right?
What? What does VACUUM have to do with this?
The reason this is a significant issue is that the first COPY could be
inside a transaction, in which case the lock will persist until that
transaction commits, which could be awhile.
Can't we just start a new transaction and update the pg_class row,
that way we don't have to open it for writing during the copy.
No, we cannot; requiring COPY to happen outside a transaction block is
not acceptable.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
People are using COPY into the same table at the same time?
Yes --- we had a message from someone who was doing that (and running
into unrelated performance issues) just last week.
OK.
My vote is to update pg_class. The VACUUM takes much more time than the
update, and we are only updating the pg_class row, right?What? What does VACUUM have to do with this?
You have to VACUUM to get pg_class updated after COPY, right?
The reason this is a significant issue is that the first COPY could be
inside a transaction, in which case the lock will persist until that
transaction commits, which could be awhile.
Oh, I see. Can we disable the pg_class update if we are in a
multi-statement transaction?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
My vote is to update pg_class. The VACUUM takes much more time than the
update, and we are only updating the pg_class row, right?What? What does VACUUM have to do with this?
You have to VACUUM to get pg_class updated after COPY, right?
But doing this is only interesting if you need to update reltuples in
order to get the planner to generate reasonable plans. In reality, if
you've added enough data to cause the plans to shift, you probably ought
to do an ANALYZE anyway to update pg_statistic. Given that ANALYZE is a
lot cheaper than it used to be, I think that the notion of making COPY
do this looks fairly obsolete anyhow.
Oh, I see. Can we disable the pg_class update if we are in a
multi-statement transaction?
Ugh. Do you really want COPY's behavior to depend on context like that?
If you did want context-dependent behavior, a saner approach would be to
only try to update reltuples if the copy has more than, say, doubled the
old value. This would be likely to happen in bulk load and unlikely to
happen in concurrent-insertions-that-choose-to-use-COPY. But I'm not
convinced we need it at all.
regards, tom lane
We have a TODO item
* Update reltuples in COPYI was just about to go do this when I realized that it may not be such
a hot idea after all.
Imho it is not a good idea at all. The statistics are a very sensitive area,
that imho should only be calculated on request. I already don't like the
statistics that are implicitly created during create index.
Eighter you have online stats keeping or you don't.
For me this is a definite all or nothing issue. Anything inbetween is
only good for unpleasant surprises.
I have very strong feelings about this, because of bad experience.
I would be willing to go into detail.
A syntactic extension to copy ("with analyze") on the other hand would
be a feature.
Andreas
Import Notes
Resolved by subject fallback
You have to VACUUM to get pg_class updated after COPY, right?
But doing this is only interesting if you need to update reltuples in
order to get the planner to generate reasonable plans. In reality, if
you've added enough data to cause the plans to shift, you probably ought
to do an ANALYZE anyway to update pg_statistic. Given that ANALYZE is a
lot cheaper than it used to be, I think that the notion of making COPY
do this looks fairly obsolete anyhow.
Yes, but remember, we are trying to catch ignorant cases, not
experienced people.
Oh, I see. Can we disable the pg_class update if we are in a
multi-statement transaction?Ugh. Do you really want COPY's behavior to depend on context like that?
If you did want context-dependent behavior, a saner approach would be to
only try to update reltuples if the copy has more than, say, doubled the
old value. This would be likely to happen in bulk load and unlikely to
happen in concurrent-insertions-that-choose-to-use-COPY. But I'm not
convinced we need it at all.
Maybe not. The COPY/pg_class hack is just to quiet people who have done
COPY and forgotten VACUUM or ANALYZE. Maybe the user is only performing
a few operations before deleting the table. Updating pg_class does help
in that case.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
[ Charset ISO-8859-1 unsupported, converting... ]
We have a TODO item
* Update reltuples in COPYI was just about to go do this when I realized that it may not be such
a hot idea after all.Imho it is not a good idea at all. The statistics are a very sensitive area,
that imho should only be calculated on request. I already don't like the
statistics that are implicitly created during create index.
OK, if you feel strongly, and Tom does, I will remove the item.
However, just remember that pg_class already has a row count that we
force in there by default.
test=> create table test (x int);
CREATE
test=> select reltuples from pg_class where relname = 'test';
reltuples
-----------
1000
(1 row)
I was just suggesting we make that accurate if we can, even if we can
make it accurate only 80% of the time. Once we INSERT, it isn't
accurate anymore anyway. This is just an estimate, and in my mind, it
doesn't have to be accurate in all cases.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026