WAL bypass for CTAS

Started by Simon Riggsalmost 21 years ago14 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

This contains all the lower level machinery required to do the same
thing for COPY, as discussed on hackers. The machinery includes some
additional freespace thinkery, aimed mainly at the forthcoming COPY
patch, which solely needs to be integrated with Alon's work.

Patch is diff -c format, compiles and make checks on cvstip as of now.

No performance tests *on this patch*, though the general principle has
already been proven via a similar prototype patch not published on list.

Best Regards, Simon Riggs

Attachments:

ctas.patchtext/x-patch; charset=UTF-8; name=ctas.patchDownload+172-101
#2Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
Re: WAL bypass for CTAS

Simon Riggs wrote:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database? I would _love_ to see
pg_dump loads use this automatically, without having to add clauses to
pg_dump output.

I think we decided we can't do it automatically for all zero-row COPYs
because of locking concerns.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Neil Conway
neilc@samurai.com
In reply to: Bruce Momjian (#2)
Re: WAL bypass for CTAS

Bruce Momjian wrote:

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

I would _love_ to see pg_dump loads use this automatically, without
having to add clauses to pg_dump output.

What's wrong with adding clauses to the pg_dump output?

-Neil

#4Bruce Momjian
bruce@momjian.us
In reply to: Neil Conway (#3)
Re: WAL bypass for CTAS

Neil Conway wrote:

Bruce Momjian wrote:

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

I would _love_ to see pg_dump loads use this automatically, without
having to add clauses to pg_dump output.

What's wrong with adding clauses to the pg_dump output?

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#5Neil Conway
neilc@samurai.com
In reply to: Bruce Momjian (#4)
Re: WAL bypass for CTAS

Bruce Momjian wrote:

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

-Neil

#6Bruce Momjian
bruce@momjian.us
In reply to: Neil Conway (#5)
Re: [PATCHES] WAL bypass for CTAS

Neil Conway wrote:

Bruce Momjian wrote:

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

Right, usually the schema changes are not backward compatibible, but the
COPY commands are. But now that I look at this example:

COPY test (x) FROM stdin;
1
\.

The column name "(x)" actually broke backward compatibility when we
added it, so yea, we could add a new option now too. No one complained
when we added the column names, so another option would be fine.

I suppose no one would like adding an option to turn off locking during
COPY, so the non-WAL logging would become the default? (Just asking.
You know me, I like automatic.)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: [PATCHES] WAL bypass for CTAS

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I suppose no one would like adding an option to turn off locking during
COPY, so the non-WAL logging would become the default?

When and if we add LOCK or some equivalent option to COPY, I'm sure
we'll change pg_dump to specify that option in its output. But trying
to get that behavior by default for existing dumps seems to me to be
far more dangerous than it's worth. Not every performance improvement
has to automatically apply to existing dumps...

regards, tom lane

#8Russell Smith
mr-russ@pws.com.au
In reply to: Neil Conway (#5)
Re: WAL bypass for CTAS

On Sun, 5 Jun 2005 10:29 am, Neil Conway wrote:

Bruce Momjian wrote:

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

Don't we usually suggest using the new pg_dump to dump the old database anyway?

If that's the case, then we just add the locking options in there. Otherwise, yes you are
stuck with the original locking mechanism. But if people are smart and want faster loading
they will play with sed and friends to make it work.

Even if people for 8.1 just get the supposed 500% speed increase because of a better parser,
lots of people will be happy.

Regards

Russell Smith

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Neil Conway (#3)
Re: WAL bypass for CTAS

On Sun, 2005-06-05 at 10:20 +1000, Neil Conway wrote:

Bruce Momjian wrote:

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?

Remember that this patch doe NOT yet handle COPY, but that is planned...

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

Agreed.

Best Regards, Simon Riggs

#10Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
Re: WAL bypass for CTAS

Tom has applied this patch. Thanks.

---------------------------------------------------------------------------

Simon Riggs wrote:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

This contains all the lower level machinery required to do the same
thing for COPY, as discussed on hackers. The machinery includes some
additional freespace thinkery, aimed mainly at the forthcoming COPY
patch, which solely needs to be integrated with Alon's work.

Patch is diff -c format, compiles and make checks on cvstip as of now.

No performance tests *on this patch*, though the general principle has
already been proven via a similar prototype patch not published on list.

Best Regards, Simon Riggs

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: WAL bypass for CTAS

Simon Riggs <simon@2ndquadrant.com> writes:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

Applied after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(, and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries. It will
be a very real hazard for COPY however. The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones. To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit. We can just omit updating the FSM's
running average, if it even has one. (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

Patch as applied is attached.

regards, tom lane

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#11)
Re: WAL bypass for CTAS

On Mon, 2005-06-20 at 14:50 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

Applied

Thanks

after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(

Sorry, I thought I had corrected that error before submission. I was
aware that I had made that error earlier.

and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

I followed the logic as seen in nbtsort.c as you suggested. That code
doesn't perform a FlushRelationBuffers and it looks like I fooled myself
into thinking the CTAS/SELECT INTO case was also in local.

Perhaps we should be building CTAS/SELECT INTO in local buffers anyway?
It looks like we could save time by avoiding shared_buffers completely
and build up a whole page before writing it anywhere. (But thats a story
for another day).

Perhaps this is also related to metapage errors, since the metapage is
always the last page to be written?

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries. It will
be a very real hazard for COPY however.

OK, but I haven't written that patch yet!

The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones. To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

Not very clean, but will do as you suggest.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit. We can just omit updating the FSM's
running average, if it even has one. (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

I was striving for completeness only. I was doubtful about that part of
the patch, but thought I'd add that rather than have you say I hadn't
thought about the FSM avg_request_size.

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Best Regards, Simon Riggs

#13Alvaro Herrera
alvherre@surnet.cl
In reply to: Simon Riggs (#12)
Re: WAL bypass for CTAS

On Mon, Jun 20, 2005 at 09:55:12PM +0100, Simon Riggs wrote:

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Is that a problem? If the pages don't fit in FSM, then maybe the system
is misconfigured anyway. The person running the DW should just increase
the FSM settings, which is hardly a costly thing because it uses so
little memory.

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"No renuncies a nada. No te aferres a nada."

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#13)
Re: WAL bypass for CTAS

On Mon, 2005-06-20 at 17:09 -0400, Alvaro Herrera wrote:

On Mon, Jun 20, 2005 at 09:55:12PM +0100, Simon Riggs wrote:

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Is that a problem?

Not for me, but I wanted to explain the change in behaviour that
implies.

If the pages don't fit in FSM, then maybe the system
is misconfigured anyway. The person running the DW should just increase
the FSM settings, which is hardly a costly thing because it uses so
little memory.

If you aren't on the relation list you don't get any more pages than the
minimum. No matter how many fsm_pages you allocate. If fsm_pages covers
everything, then you are right, there is no problem.

Best Regards, Simon Riggs