WAL bypass for INSERT, UPDATE and DELETE?
Having just optimized COPY to avoid writing WAL during the transaction
in which a table was first created, it seems worth considering whether
this should occur for INSERT, UPDATE and DELETE also.
It is fairly common to do data transformation using INSERT SELECTs and
UPDATEs. This is usually done with temporary tables however. (DELETE
would most efficiently be handled as an additional NOT clause on the
insert, so it is uncommonly used in this circumstance.)
However, CREATE TABLE AS SELECT (CTAS) does not allow inheritance, so a
new permanent partition has to be created using CREATE TABLE, followed
by an INSERT SELECT or COPY.
Is that sufficient reason to optimise INSERT SELECT and UPDATE also? Or
should I not bother? Or should I try to teach CTAS to use inheritance
(which sounds harder and has a few gotchas).
Currently, CTAS optimization requires a heap_sync during ExecEndPlan. It
would be easy enough to extend this so that it also works for INSERT,
UPDATE and DELETE.
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
Currently, CTAS optimization requires a heap_sync during ExecEndPlan. It
would be easy enough to extend this so that it also works for INSERT,
UPDATE and DELETE.
If you tried to do it that way you'd break the system completely. Not
all updates go through the executor.
I think it's a bad idea anyway; you'd be adding overhead to the lowest
level routines in order to support a feature that would be very seldom
used, at least in comparison to the number of times those routines are
executed.
regards, tom lane
On Thu, Dec 22, 2005 at 02:31:33PM +0000, Simon Riggs wrote:
Having just optimized COPY to avoid writing WAL during the transaction
in which a table was first created, it seems worth considering whether
this should occur for INSERT, UPDATE and DELETE also.It is fairly common to do data transformation using INSERT SELECTs and
UPDATEs. This is usually done with temporary tables however. (DELETE
would most efficiently be handled as an additional NOT clause on the
insert, so it is uncommonly used in this circumstance.)However, CREATE TABLE AS SELECT (CTAS) does not allow inheritance, so a
new permanent partition has to be created using CREATE TABLE, followed
by an INSERT SELECT or COPY.Is that sufficient reason to optimise INSERT SELECT and UPDATE also? Or
should I not bother? Or should I try to teach CTAS to use inheritance
(which sounds harder and has a few gotchas).Currently, CTAS optimization requires a heap_sync during ExecEndPlan. It
would be easy enough to extend this so that it also works for INSERT,
UPDATE and DELETE.
Well, both UPDATE and DELETE seem like pretty odd use cases to me;
typically I'd do any needed data manipulation during the INSERT SELECT.
But it doesn't make sense to me to set this up for INSERT and ignore
UPDATE and DELETE.
I do think this needs to be something that is made either completely
transparent or must be specifically enabled. As described, I believe
this would break PITR, so users should have to specifically request that
behavior (and they should probably get a WARNING message, too).
Though, what's interesting is that theoretically it should be possible
to do this and still protect PITR, by logging the statements (but not
the actual data) to WAL. This isn't very practical with WAL (though it
might be worth looking at storing a compressed version of what's being
fed in to COPY), but in this case if we end up in a recovery situation
the data that the insert is pulling from should exist in the database in
the same state, so it should be possible to re-create the table. There's
still an issue of how to handle the pages from the new table that will
end up in WAL on subsiquent transactions, since presumably they might be
identical, but someone might be able to come up with a clever solution
for that. In the meantime, breaking WAL recovery needs to be something
that users must specifically request, via something like UPDATE NOWAL.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
BTW, this should also probably be moved over to -hackers...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, 2005-12-22 at 12:12 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Currently, CTAS optimization requires a heap_sync during ExecEndPlan. It
would be easy enough to extend this so that it also works for INSERT,
UPDATE and DELETE.If you tried to do it that way you'd break the system completely. Not
all updates go through the executor.
Wow, didn't know that.
I think it's a bad idea anyway; you'd be adding overhead to the lowest
level routines in order to support a feature that would be very seldom
used, at least in comparison to the number of times those routines are
executed.
Agreed.
Maybe just INSERT SELECT then. That's easy enough to test for without
altering the main code path in the executor too much. If anybody is
going to say they want it?
Best Regards, Simon Riggs
On Thu, Dec 22, 2005 at 12:37:51PM -0600, Jim C. Nasby wrote:
I do think this needs to be something that is made either completely
transparent or must be specifically enabled. As described, I believe
this would break PITR, so users should have to specifically request that
behavior (and they should probably get a WARNING message, too).
This reminds me of a friend who used MSSQL that had replication going
that broke every time you did a certain statement. It may have been
SELECT INTO [1]http://msdn.microsoft.com/library/default.asp?url=/library/en-us/trblsql/tr_reslsyserr_1_1r94.asp. His main problem was that the replication would
stop working silently. We need to be waving red flags if we broke
someone's backup procedure.
Considering "WAL bypass" is code for "breaks PITR", I think we really
need to make sure people realise that running such a command breaks
their backups/replication/whatever people are doing.
[1]: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/trblsql/tr_reslsyserr_1_1r94.asp
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
On Thu, 2005-12-22 at 21:18 +0100, Martijn van Oosterhout wrote:
Considering "WAL bypass" is code for "breaks PITR"
No it isn't. All of the WAL bypass logic does *not* operate when PITR is
active. The WAL bypass logic is aimed at Data Warehouses, which
typically never operate in PITR mode for performance reasons, however
the choice is yours.
Best Regards, Simon Riggs
* Simon Riggs (simon@2ndquadrant.com) wrote:
On Thu, 2005-12-22 at 21:18 +0100, Martijn van Oosterhout wrote:
Considering "WAL bypass" is code for "breaks PITR"
No it isn't. All of the WAL bypass logic does *not* operate when PITR is
active. The WAL bypass logic is aimed at Data Warehouses, which
typically never operate in PITR mode for performance reasons, however
the choice is yours.
Eh? PITR mode is bad for performance? Maybe I missed something but I
wouldn't have thought PITR would degrade regular performance all that
badly. So long as it doesn't take 15 minutes or some such to move the
WAL to somewhere else (and I'm not sure that'd even slow things down..).
For a Data Warehouse, have you got a better way of doing backups such
that you don't lose at minimum most of a day's work? I'm not exactly a
big fan do doing a pg_dump every night either given that the database is
360GB. Much nicer to take a weekly dump of the database and then do
PITR for a week or two before taking another dump of the db.
I like the idea of making COPY go faster, but please don't break my
backup system while you're at it. I'm honestly kind of nervous about
what you mean by checking it PITR is active- how is that done, exactly?
Check if you have a script set to rotate the logs elsewhere? Or is it
checking if you're in the taking-a-full-database-backup stage? Or what?
What's the performance decrease when using PITR, and what's it from? Is
it just that COPY isn't as fast? Honestly, I could live with COPY being
not as fast as it could be if my backups work. :)
Sorry for sounding concerned but, well, backups are very important and
so is performance and I'm afraid either I've not read all the
documentation about the issues being discussed here or there isn't
enough out there to make sense of it all yet. :)
Thanks,
Stephen
"Simon Riggs" <simon@2ndquadrant.com> wrote
No it isn't. All of the WAL bypass logic does *not* operate when PITR is
active. The WAL bypass logic is aimed at Data Warehouses, which
typically never operate in PITR mode for performance reasons, however
the choice is yours.
To make things, is it possible to add a GUC to let user disable *all* the
xlogs?
Regards,
Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote
To make things, is it possible to add a GUC to let user disable *all* the
xlogs?
It may work like this:
BEGIN TRANSACTION WITHOUT XLOG; /* forbidden vacuum, PITR etc */
BEGIN
... /* no xlog during this peroid */
END; /* don't mark this transaction committed */
BEGIN
...
END;
END TRANSACTION DO COMMIT; /* at this time issue checkpiont && mark all
transactions committed */
So during this peroid, if any transaction failed, the only consequence is
add invisible garbage data. When everything is going well, then END
TRANSACTION DO COMMIT will mark these transaction permanate. Also, seems
there is no problem even with XLOG_NO_TRAN updates.
Regards,
Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
BEGIN TRANSACTION WITHOUT XLOG; /* forbidden vacuum, PITR etc */
So during this peroid, if any transaction failed, the only consequence is
add invisible garbage data.
No, the likely consequence is irretrievable corruption of any table or
index page touched by the transaction. You're going to have a very hard
time selling this as a good idea.
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
BEGIN TRANSACTION WITHOUT XLOG; /* forbidden vacuum, PITR etc */
So during this peroid, if any transaction failed, the only consequence is
add invisible garbage data.No, the likely consequence is irretrievable corruption of any table or
index page touched by the transaction.
I guess I know (at least part) of what you mean. This is because we rely on
replay all the xlog no matter it belongs to a committed transaction or not.
Why? Because a failed transaction is not totally useless since later
transaction may reply on some physical thing it creates - for example, a new
page and its links of a btree. So for heap, there is(95% sure) no such
problem. Our heap redo algorithm can automatically add empty pages to a
heap. For index, there are problems, but I suspect they are solvable by not
bypassing these records ... if this is not totally-nonstarter, I will
investigate details of how to do it.
Regards,
Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
I guess I know (at least part) of what you mean. This is because we rely on
replay all the xlog no matter it belongs to a committed transaction or not.
Why? Because a failed transaction is not totally useless since later
transaction may reply on some physical thing it creates - for example, a new
page and its links of a btree. So for heap, there is(95% sure) no such
problem.
Torn pages (partial page write) are still a problem.
regards, tom lane
On Thu, 2005-12-22 at 17:36 -0500, Stephen Frost wrote:
* Simon Riggs (simon@2ndquadrant.com) wrote:
On Thu, 2005-12-22 at 21:18 +0100, Martijn van Oosterhout wrote:
Considering "WAL bypass" is code for "breaks PITR"
No it isn't. All of the WAL bypass logic does *not* operate when PITR is
active. The WAL bypass logic is aimed at Data Warehouses, which
typically never operate in PITR mode for performance reasons, however
the choice is yours.
OK, thanks for saying all of that; you probably speak for many in
raising these concerns. I'll answer each bit as we come to it. Suffice
to say, your concerns are good and so are the answers:
Eh? PITR mode is bad for performance? Maybe I missed something but I
wouldn't have thought PITR would degrade regular performance all that
badly.
PITR mode is *not* bad for performance. On a very heavily loaded
write-intensive test system, the general PITR overhead on regular
performance was around 1% - so almost negligible.
We have been discussing a number of optimizations to specific commands
that would allow them to avoid writing WAL and thus speed up their
performance. If archive_command is set then WAL will always be written;
if it is not set then these commands will (or could) go faster:
- CREATE TABLE AS SELECT (in 8.1)
- COPY LOCK (patch submitted)
- COPY in same transaction as CREATE TABLE (patch submitted)
- INSERT SELECT in same transaction as CREATE TABLE (this discussion)
(There are a number of other conditions also, such as there must be no
indexes on a table. All of which now documented with the patch)
So long as it doesn't take 15 minutes or some such to move the
WAL to somewhere else (and I'm not sure that'd even slow things down..).
For a Data Warehouse, have you got a better way of doing backups such
that you don't lose at minimum most of a day's work?
Yes. Don't just use the backup facilities on their own. Think about how
the architecture of your systems will work and see if there is a better
way when you look at very large systems.
I'm not exactly a
big fan do doing a pg_dump every night either given that the database is
360GB. Much nicer to take a weekly dump of the database and then do
PITR for a week or two before taking another dump of the db.
e.g. Keep your reference data (low volume) in an Operational Data Store
(ODS) database, protected by archiving. Keep your main fact data (high
volume) in the Data Warehouse, but save the data in slices as you load
it, so that a recovery is simply a reload of the database: no PITR or
pg_dump required, so high performance data transformation and load work
is possible. This is a commonly used architectural design pattern.
I like the idea of making COPY go faster, but please don't break my
backup system while you're at it.
On a personal note, I would only add that I spent a long time working on
PITR and I would never design anything that would intentionally break it
(nor would patches be accepted that did that). That probably gives me
the confidence to approach designs that might look like I'm doing that,
but without actually straying over the edge.
I'm honestly kind of nervous about
what you mean by checking it PITR is active- how is that done, exactly?
Check if you have a script set to rotate the logs elsewhere? Or is it
checking if you're in the taking-a-full-database-backup stage? Or what?
Internally, we use XLogArchivingActive(). Externally this will be set
when the admin sets archive_command to a particular value.
My original preference was for a parameter called archive_mode= ON | OFF
which would allow us to more easily discuss this, but this does not
currently exist.
What's the performance decrease when using PITR, and what's it from? Is
it just that COPY isn't as fast? Honestly, I could live with COPY being
not as fast as it could be if my backups work. :)
These commands will not be optimized for speed when archive_command is set:
- CREATE TABLE AS SELECT (in 8.1)
- COPY LOCK (patch submitted)
Sorry for sounding concerned but, well, backups are very important and
so is performance and I'm afraid either I've not read all the
documentation about the issues being discussed here or there isn't
enough out there to make sense of it all yet. :)
If you choose PITR, then you are safe. If you do not, the crash recovery
of the database is not endangered by these optimizations.
Hope that covers all of your concerns?
I'm just writing a course that explains many of these techniques,
available in the New Year.
Best Regards, Simon Riggs
Torn pages (partial page write) are still a problem.
I revised the idea with MINIMAL XLOG (instead of WITHOUT XLOG) like the
below. I think in this way, we can always gaurantee its correctness and
can always improve it.
To Use It
----------
A "BEGIN TRANSACTION MINIMAL XLOG/END" block is a speicial "BEGIN/END"
transaction block. It tries to avoid unnessary xlogs but still perserves
transaction semantics. It is good for the situation that the user wants to
do a big data load. It is issued like this:
1. BEGIN TRANSACTION MINIMAL XLOG
2. ... /* statements */
3. END;
From user's view, it is almost the same as ordinary transaction: if
everything run smoothly from step 1 to 3, the transaction will be made
durable. If any step failed (including ABORT, transaction error, system
crash), it looks like nothing happened. To make life easier, no
subtransactions is allowed.
To Implement It
----------------
At step 1, we will disallow some operations, including vacuum, PITR.
At step 2, only minimal xlog entries are logged. If anything inside
failed, handle it like ordinary transaction.
At step 3, we issue a checkpoint, then mark the transaction commited. If
step 8 itself failed, handle it like ordinary transaction.
The correctness is easy: if we let "minimal xlog" equal to "all xlog",
then it is exactly the same as an ordinary transaction plus a checkpoint
inside the transaction block.
Based on the above proof, we can have the following implementation steps:
1. We first make the framework without revising any XLogInsert() - thus
the implementation is correct;
2. Examine each XLogInsert() and differenciate the content under MINIAML
XLOG is set or not.
The importance of the above steps is that it implies that there is no need
to completely pick up what are the MINIAL XLOG content are, we can do them
gradually in a non-invasive way.
Minimal Xlog
-------------
The xlog of failed transaction is not totally useless since later
transaction may reply on something it creates - for example, a new page
and its links of a btree. We have to pick up these xlogs.
RM_HEAP_ID: The problem of heap is torn page prevention. We currently copy
the whole page into xlog if it is the first time touched after a
checkpoint. So we can always have this copy to replace the data file page
which might be torn written. I didn't come up with any good method to
handle it so far, so we keep this. (We can possibly avoid copy a P_NEW
page, that's another story though). So what we can avoid xlog at least
include the insert/update/delete happened on a page that's no need to be
copied, which will give us a 50% xlog volumn/contention reduction I think.
RM_BTREE_ID/RM_HASH_ID/RM_GIST_ID: For index, things get more complex. We
need the xlogs to maintain the structure of the btree index, like the
pointers, high key etc, but the content is not necessarily needed. Need
more research here.
RM_XLOG_ID/RM_XACT_ID/RM_SMGR_ID/RM_CLOG_ID/RM_DBASE_ID/RM_TBLSPC_ID/RM_MULTIXACT_ID/RM_SEQ_ID:
It is hard to avoid much here, but they are not the important volume
contribution of xlogs.
Regards,
Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
I revised the idea with MINIMAL XLOG (instead of WITHOUT XLOG) like the
below. I think in this way, we can always gaurantee its correctness and
can always improve it.
I think the entire idea is a waste of time anyway. If we have the COPY
case covered then we have done the only case that has any practical use
AFAICS --- you don't use UPDATE or DELETE for bulk loads, and you don't
use retail INSERTs either if you are striving for speed.
I don't want to see us mucking with the WAL logic without a *whole* lot
better rationale than has been presented so far.
regards, tom lane
On Fri, 23 Dec 2005, Tom Lane wrote:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
I revised the idea with MINIMAL XLOG (instead of WITHOUT XLOG) like the
below. I think in this way, we can always gaurantee its correctness and
can always improve it.I think the entire idea is a waste of time anyway. If we have the COPY
case covered then we have done the only case that has any practical use
AFAICS --- you don't use UPDATE or DELETE for bulk loads, and you don't
use retail INSERTs either if you are striving for speed.
There are several posts on the list asking about NOLOGGING option
presented in Oracle. User may need it to do bulk updates against the
database. I don't think we plan to support it since it does not gaurantee
transaction semantics. But MINIMAL XLOG is something that we are afford to
do and not invasive change needed AFAICS.
Regards,
Qingqing
Tom Lane <tgl@sss.pgh.pa.us> writes:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
I revised the idea with MINIMAL XLOG (instead of WITHOUT XLOG) like the
below. I think in this way, we can always gaurantee its correctness and
can always improve it.I think the entire idea is a waste of time anyway. If we have the COPY
case covered then we have done the only case that has any practical use
AFAICS --- you don't use UPDATE or DELETE for bulk loads, and you don't
use retail INSERTs either if you are striving for speed.
Well the "industry standard" approach for loading data is called "ETL". I
forget what the E stands for, but the idea is to first load the data into a
table exactly as it appears in the input data.
Then the "T" stands for "transform". This could include just normalizing the
input data into the correct format but it could also include looking up
foreign key values for input strings, checking for duplicate records, etc.
This would be a series of UPDATEs and DELETEs.
I guess the "L" stands for load, where you insert it into the real tables.
Up until the point where you load it into the real tables you don't really
care about the transactional integrity of the data. If it's lost you can
always just restart the process.
In many cases you could use temporary tables, but sometimes you might want
multiple processes or multiple transactions to be able to see the data.
Consider for example a loading process that includes a human operator
reviewing the data before the data is approved to be loaded into the final
tables.
But I don't see turning on and off the WAL on a per-transaction basis to be
useful. Every transaction in the system is affected by the WAL status of every
other transaction working with the same tables. It doesn't serve any purpose
to have one transaction bypassing the WAL while everyone else does WAL logging
for the same table; they're all going to lose if the system crashes.
It seems to me the only rational way to approach this is to have a per-table
flag that sets that table to be non-logged. Essentially changing a table's
behaviour to that of a temporary table except that other transactions can see
it. If the system crashes the table is truncated on system restore.
The only problem I have with this is that it smells too much like MySQL MyISAM
tables...
--
greg
"Greg Stark" <gsstark@mit.edu> wrote
But I don't see turning on and off the WAL on a per-transaction basis to
be
useful. Every transaction in the system is affected by the WAL status of
every
other transaction working with the same tables. It doesn't serve any
purpose
to have one transaction bypassing the WAL while everyone else does WAL
logging
for the same table; they're all going to lose if the system crashes.
Sure, so a minimal amount xlog is required. And to make finished transaction
durable, issue a checkpoint.
It seems to me the only rational way to approach this is to have a
per-table
flag that sets that table to be non-logged. Essentially changing a table's
behaviour to that of a temporary table except that other transactions can
see
it. If the system crashes the table is truncated on system restore.The only problem I have with this is that it smells too much like MySQL
MyISAM
tables...
Table are related, so table A references table B. So set a per-table flag is
hard to use or doesn't work.
Regards,
Qingqing
Greg Stark <gsstark@mit.edu> writes:
It seems to me the only rational way to approach this is to have a per-table
flag that sets that table to be non-logged. Essentially changing a table's
behaviour to that of a temporary table except that other transactions can see
it.
But what's the point? Nowhere in your scenario is there a reason why
we need to have multiple sessions working on the data being loaded.
So a temp table solves the problem perfectly. (Temp tables do span
transactions, just not sessions.)
I've got a fundamental philosophical objection to proposals in this
line, which I suppose I'd better state for the record. I don't like
inventing nonstandard SQL extensions or peculiar semantics just to gain
performance. It imposes extra complexity on users that they could do
without; the first time someone loses critical data because he didn't
fully understand the tradeoffs involved, he'll have a justifiable gripe.
I also don't like playing Oracle's or MySQL's game by inventing
proprietary syntax. We claim to believe in standards compliance, so we
should have a pretty high bar for inventing nonstandard syntax. When
the proposed change has a narrow use-case and is only going to yield
marginal improvements even in that case, I think we should just say no.
Bottom line: if we can't figure out how to do it transparently, I think
we haven't thought hard enough.
regards, tom lane