PostgreSQL data loads - turn off WAL

Started by hartrcover 13 years ago4 messagesgeneral
Jump to latest
#1hartrc
rhart2@mt.gov

My version: PostgreSQL v9.1.5
Version string: "PostgreSQL 9.1.5 on x86_64-unknown-linux-gnu, compiled by
gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 64-bit"

Basically my question is:
Is there currently any way to avoid wal generation during data load for
given tables and then have point in time recovery after that?

Background and blurb
The situation i'm referring to here is for a development environment. I
require point in time recovery because if there is crash etc I don't want to
lose up to a days work for 12 developers. I'm fairly new to PostgreSQL so
please forgive any gaps in my knowledge.

A developer did a data load yesterday of approximately 5GB of data into a
new schema. This generated approximately 7GB of wal. The situation arises
where if something is incorrect in the data load the data load may need to
be repeated 2 or 3 times (thus generating 20GB +of WAL). For a data load i
don't want wal to be generated. I accept the fact before there was nothing
and from the point of the next pg_basebackup there was everything. It is
from the point i say ok that is everything (the next backup) that i want
point in time recovery to apply to that table.
It is doesn't seem practical, and appears very risky to turn off wal_archive
during the data load.

I'd appreciate your thoughts and suggestions,
Thanks,
Rob

--
View this message in context: http://postgresql.1045698.n5.nabble.com/PostgreSQL-data-loads-turn-off-WAL-tp5725374.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#2Jayadevan M
Jayadevan.Maymala@ibsplc.com
In reply to: hartrc (#1)
Re: PostgreSQL data loads - turn off WAL

Hi,

Basically my question is:
Is there currently any way to avoid wal generation during data load for given
tables and then have point in time recovery after that?

Please have a look at unlogged and temporary options here -
http://www.postgresql.org/docs/9.1/static/sql-createtable.html
I don't think they are crash safe and point in time recovery may not be possible for these tables.
If this is something similar to a daily load in a data warehouse, you could consider using temporary tables for all the processing/aggregation and then move data to the target tables (real, logged tables). This url might also help -
http://www.postgresql.org/docs/9.1/static/populate.html
Regards,
Jayadevan

DISCLAIMER: "The information in this e-mail and any attachment is intended only for the person to whom it is addressed and may contain confidential and/or privileged material. If you have received this e-mail in error, kindly contact the sender and destroy all copies of the original communication. IBS makes no warranty, express or implied, nor guarantees the accuracy, adequacy or completeness of the information contained in this email or any attachment and is not liable for any errors, defects, omissions, viruses or for resultant loss or damage, if any, direct or indirect."

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: hartrc (#1)
Re: PostgreSQL data loads - turn off WAL

hartrc wrote:

Basically my question is:
Is there currently any way to avoid wal generation during data load

for

given tables and then have point in time recovery after that?

Background and blurb
The situation i'm referring to here is for a development environment.

I

require point in time recovery because if there is crash etc I don't

want to

lose up to a days work for 12 developers. I'm fairly new to PostgreSQL

so

please forgive any gaps in my knowledge.

A developer did a data load yesterday of approximately 5GB of data

into a

new schema. This generated approximately 7GB of wal. The situation

arises

where if something is incorrect in the data load the data load may

need to

be repeated 2 or 3 times (thus generating 20GB +of WAL). For a data

load i

don't want wal to be generated. I accept the fact before there was

nothing

and from the point of the next pg_basebackup there was everything. It

is

from the point i say ok that is everything (the next backup) that i

want

point in time recovery to apply to that table.
It is doesn't seem practical, and appears very risky to turn off

wal_archive

during the data load.

But that's exactly what I'd recommend.

Set archive_mode=off, restart the server, load your data,
set archive_mode=on, restart the server, perform a backup.

If something goes wrong during data load, restore the
previous backup and PITR to the end.

Turning off WAL archiving is no data corruption risk.
The server will still be able to recover from crashes.

Yours,
Laurenz Albe

#4Jeff Janes
jeff.janes@gmail.com
In reply to: hartrc (#1)
Re: PostgreSQL data loads - turn off WAL

On Tue, Sep 25, 2012 at 9:09 AM, hartrc <rhart2@mt.gov> wrote:

My version: PostgreSQL v9.1.5
Version string: "PostgreSQL 9.1.5 on x86_64-unknown-linux-gnu, compiled by
gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973], 64-bit"

Basically my question is:
Is there currently any way to avoid wal generation during data load for
given tables and then have point in time recovery after that?

Background and blurb
The situation i'm referring to here is for a development environment. I
require point in time recovery because if there is crash etc I don't want to
lose up to a days work for 12 developers.

What is it that is being developed? If you are developing
applications that use postgres for storage, then with a crash of that
database you should not lose the developers' work, unless you are
using the same cluster as both the source-code repository and the
database to which the test applications being developed connect (which
seems like a bad idea)

Or are you not documenting your database schema and code anywhere
other than in the database itself? That too seems like a bad idea, at
least for a busy multi-person development shop.

A developer did a data load yesterday of approximately 5GB of data into a
new schema. This generated approximately 7GB of wal. The situation arises
where if something is incorrect in the data load the data load may need to
be repeated 2 or 3 times (thus generating 20GB +of WAL).

Is the problem the IO generated by this, or the short-term storage, or
the long term storage of it? If it is the storage, you could do a
specially-scheduled backup as soon as the load is done, and then
delete the WAL soon. Are you compressing your WAL? If not, then
doing that might be enough to solve the problem.

It sounds like the main thing your group "produces" is code
development, and from what you describe the database is a "production"
database since losing it loses your product. If that is the case, you
really need a "development" database as well. This could run in
archive_mode=off, and then the 5GB would not go to "production" until
they are fairly confident it will work and not have to be repeated.
So that would cut out 2/3 of the WAL.

Or, they could create the table as "unlogged" until they know the bulk
load works, then recreate it as logged and repeat the load. (I think
that currently there is no way to make an unlogged table then promote
it to logged, that would probably be ideal for you, but someday....)

For a data load i
don't want wal to be generated. I accept the fact before there was nothing
and from the point of the next pg_basebackup there was everything. It is
from the point i say ok that is everything (the next backup) that i want
point in time recovery to apply to that table.
It is doesn't seem practical, and appears very risky to turn off wal_archive
during the data load.

How long does the bulk load plus one backup plus two database restarts
take? That would probably be far less than one day, so the risk of
turning off archive_mode temporarily seems like it should be minimal.

If your hard drives were to die horribly, how long would it take you
replace them and restore the database from the last backup and roll it
forward? That could calibrate how much past work you are willing to
lose in the event it comes to that.

Cheers,

Jeff