Recovery Assistance

Started by Brian Millsover 9 years ago14 messagesgeneral
Jump to latest
#1Brian Mills
brian@trybooking.com

Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I haven't
much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped working.
When the admin realised this he increased the disk size (its in a cloud, so
that was easy to do) however I think the way it shutdown left Postgres in
an inconsistent state for some reason.
Now when I start it up I get this message in the log over and over again:
"FATAL: the database system is starting up"

I have a backup, which I have successfully recovered, but it is 24 hours
old, the next backup was the cause of the disk filling. So I'm using this
as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file level
backup with the service not running then tried 2 things. I haven't found
much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't) but it did
sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
$now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR: unexpected chunk size 104
(expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body,
contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an inconsistent
state.

*Attempt 2 - startup manually and let it try recovery*

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the output
(I'd done it as a service startup a number of times to nearly the same
effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT
LOG: database system was interrupted while in recovery at 2017-01-27
20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly shut down;
automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log time
2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up

I've left it that way for 12 hours, and its still not allowing connections.
I assume its doing some kind of consistency check?

Does anyone have any suggestions on what I should be doing to try and
restore this database?

- The amount of change is minimal in the DB after 6pm it should be
basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is that right?
- The last completed transaction time 2017-01-24 02:08:00.023064+11 the log
mentions would be fine to use, so loosing even a few hours before that
would be more than adequate.

I'm just not clear what the database is doing now, or how I should be
trying to recover it.

Any help anyone can suggest would be great! I've given myself this weekend
to attempt to recover more than the last backup, after that I need to
restore the service for my team to use and suck up the lost last day of
updates.

Thanks,
Brian

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Brian Mills (#1)
Re: Recovery Assistance

On 01/27/2017 01:31 PM, Brian Mills wrote:

Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I
haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped working.
When the admin realised this he increased the disk size (its in a cloud,
so that was easy to do) however I think the way it shutdown left
Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and over again:
"FATAL: the database system is starting up"

I have a backup, which I have successfully recovered, but it is 24 hours
old, the next backup was the cause of the disk filling. So I'm using
this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file level
backup with the service not running then tried 2 things. I haven't found
much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't) but it did
sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
$now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed:
PQgetResult() failed.
pg_dump: Error message from server: ERROR: unexpected chunk size 104
(expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body,
contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an
inconsistent state.

*Attempt 2 - startup manually and let it try recovery*

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the
output (I'd done it as a service startup a number of times to nearly the
same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT
LOG: database system was interrupted while in recovery at 2017-01-27
20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly shut
down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?

I've left it that way for 12 hours, and its still not allowing connections.
I assume its doing some kind of consistency check?

What does it say when you attempt a connection?

Does anyone have any suggestions on what I should be doing to try and
restore this database?

- The amount of change is minimal in the DB after 6pm it should be
basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is that right?
- The last completed transaction time 2017-01-24 02:08:00.023064+11 the
log mentions would be fine to use, so loosing even a few hours before
that would be more than adequate.

I'm just not clear what the database is doing now, or how I should be
trying to recover it.

Any help anyone can suggest would be great! I've given myself this
weekend to attempt to recover more than the last backup, after that I
need to restore the service for my team to use and suck up the lost last
day of updates.

Thanks,
Brian

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Brian Mills
brian@trybooking.com
In reply to: Adrian Klaver (#2)
Re: Recovery Assistance

First of all, Thank you for your time to assist me learning. I really
appreciate it.

root# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
7930 pts/4 S+ 0:00 grep --color=auto post

Its a single machine postgres database server, so I'm assuming there is no
cluster log. If there is one, where would I look for it?
The only log in /var/log/postgres is postgresql-9.3-main.log
which shows (tail):

2017-01-27 20:27:01 AEDT LOG: database system was shut down at 2017-01-27
20:26:29 AEDT
2017-01-27 20:27:01 AEDT LOG: MultiXact member wraparound protections are
now enabled
2017-01-27 20:27:01 AEDT LOG: autovacuum launcher started
2017-01-27 20:27:01 AEDT LOG: database system is ready to accept
connections
2017-01-27 20:27:02 AEDT LOG: incomplete startup packet
2017-01-27 20:28:54 AEDT ERROR: unexpected chunk size 104 (expected 1996)
in chunk 3 of 22 for toast value 48862 in pg_toast_20028
2017-01-27 20:28:54 AEDT STATEMENT: COPY public.bodycontent
(bodycontentid, body, contentid, bodytypeid) TO stdout;
2017-01-27 20:30:13 AEDT LOG: received fast shutdown request
2017-01-27 20:30:13 AEDT LOG: aborting any active transactions
2017-01-27 20:30:13 AEDT LOG: autovacuum launcher shutting down
2017-01-27 20:30:13 AEDT LOG: shutting down
2017-01-27 20:30:13 AEDT LOG: database system is shut down

I ran the screen utility so I could leave the db started using the pg_ctl
command. The later logs in that session have not progressed, its last entry
is still
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
Which is still later datetime than the /var/log/postgres... log.

Connection attempt shows the same log
postgres@atlassian:/home/myuser$ psql
psql: FATAL: the database system is starting up

Nothing in the syslog that seems connected.

*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* *Sydney* 02 8064 3600 *|*
*Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 01/27/2017 01:31 PM, Brian Mills wrote:

Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I
haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped working.
When the admin realised this he increased the disk size (its in a cloud,
so that was easy to do) however I think the way it shutdown left
Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and over again:
"FATAL: the database system is starting up"

I have a backup, which I have successfully recovered, but it is 24 hours
old, the next backup was the cause of the disk filling. So I'm using
this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file level
backup with the service not running then tried 2 things. I haven't found
much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't) but it did
sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
$now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed:
PQgetResult() failed.
pg_dump: Error message from server: ERROR: unexpected chunk size 104
(expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body,
contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an
inconsistent state.

*Attempt 2 - startup manually and let it try recovery*

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the
output (I'd done it as a service startup a number of times to nearly the
same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT
LOG: database system was interrupted while in recovery at 2017-01-27
20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly shut
down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?

I've left it that way for 12 hours, and its still not allowing
connections.
I assume its doing some kind of consistency check?

What does it say when you attempt a connection?

Does anyone have any suggestions on what I should be doing to try and
restore this database?

- The amount of change is minimal in the DB after 6pm it should be
basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is that
right?
- The last completed transaction time 2017-01-24 02:08:00.023064+11 the
log mentions would be fine to use, so loosing even a few hours before
that would be more than adequate.

I'm just not clear what the database is doing now, or how I should be
trying to recover it.

Any help anyone can suggest would be great! I've given myself this
weekend to attempt to recover more than the last backup, after that I
need to restore the service for my team to use and suck up the lost last
day of updates.

Thanks,
Brian

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Brian Mills (#3)
Re: Recovery Assistance

On 01/27/2017 05:40 PM, Brian Mills wrote:

First of all, Thank you for your time to assist me learning. I really
appreciate it.

root# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
7930 pts/4 S+ 0:00 grep --color=auto post

So if you check back does the recovering XXXXXXXX part change?

If so Postgres is walking through the WAL files as it should.

Its a single machine postgres database server, so I'm assuming there is
no cluster log. If there is one, where would I look for it?
The only log in /var/log/postgres is postgresql-9.3-main.log

That would be it. A single Postgres instance has multiple databases in
it, by default it starts with template0, template1 and postgres
databases. Then add whatever databases you created and you have a
cluster of databases.

which shows (tail):

2017-01-27 20:27:01 AEDT LOG: database system was shut down at
2017-01-27 20:26:29 AEDT
2017-01-27 20:27:01 AEDT LOG: MultiXact member wraparound protections
are now enabled
2017-01-27 20:27:01 AEDT LOG: autovacuum launcher started
2017-01-27 20:27:01 AEDT LOG: database system is ready to accept
connections
2017-01-27 20:27:02 AEDT LOG: incomplete startup packet
2017-01-27 20:28:54 AEDT ERROR: unexpected chunk size 104 (expected
1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
2017-01-27 20:28:54 AEDT STATEMENT: COPY public.bodycontent
(bodycontentid, body, contentid, bodytypeid) TO stdout;
2017-01-27 20:30:13 AEDT LOG: received fast shutdown request
2017-01-27 20:30:13 AEDT LOG: aborting any active transactions
2017-01-27 20:30:13 AEDT LOG: autovacuum launcher shutting down
2017-01-27 20:30:13 AEDT LOG: shutting down
2017-01-27 20:30:13 AEDT LOG: database system is shut down

That would be your Attempt 1 log.

I ran the screen utility so I could leave the db started using
the pg_ctl command. The later logs in that session have not progressed,
its last entry is still
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
Which is still later datetime than the /var/log/postgres... log.

So it is just logging to stdout and not to the log file.

Connection attempt shows the same log
postgres@atlassian:/home/myuser$ psql
psql: FATAL: the database system is starting up

Nothing in the syslog that seems connected.

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570 <tel:07%203173%201570>
Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt; <https://www.linkedin.com/company/trybooking-com&gt;

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 01/27/2017 01:31 PM, Brian Mills wrote:

Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I
haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped
working.
When the admin realised this he increased the disk size (its in
a cloud,
so that was easy to do) however I think the way it shutdown left
Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and
over again:
"FATAL: the database system is starting up"

I have a backup, which I have successfully recovered, but it is
24 hours
old, the next backup was the cause of the disk filling. So I'm using
this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file
level
backup with the service not running then tried 2 things. I
haven't found
much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't)
but it did
sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >
$now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed:
PQgetResult() failed.
pg_dump: Error message from server: ERROR: unexpected chunk
size 104
(expected 1996) in chunk 3 of 22 for toast value 48862 in
pg_toast_20028
pg_dump: The command was: COPY public.bodycontent
(bodycontentid, body,
contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an
inconsistent state.

*Attempt 2 - startup manually and let it try recovery*

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the
output (I'd done it as a service startup a number of times to
nearly the
same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27
20:36:08 AEDT
LOG: database system was interrupted while in recovery at
2017-01-27
20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some
data is
corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly shut
down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?

I've left it that way for 12 hours, and its still not allowing
connections.
I assume its doing some kind of consistency check?

What does it say when you attempt a connection?

Does anyone have any suggestions on what I should be doing to
try and
restore this database?

- The amount of change is minimal in the DB after 6pm it should be
basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is
that right?
- The last completed transaction time 2017-01-24
02:08:00.023064+11 the
log mentions would be fine to use, so loosing even a few hours
before
that would be more than adequate.

I'm just not clear what the database is doing now, or how I
should be
trying to recover it.

Any help anyone can suggest would be great! I've given myself this
weekend to attempt to recover more than the last backup, after
that I
need to restore the service for my team to use and suck up the
lost last
day of updates.

Thanks,
Brian

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Brian Mills
brian@trybooking.com
In reply to: Adrian Klaver (#4)
Re: Recovery Assistance

Hi,

No, it hasn't changed since the first time I looked at it.

root@atlassian:/home/tbadmin# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
11161 pts/4 S+ 0:00 grep --color=auto post

The /var/log/postgres logs haven't got anything new in them.
Also the screen session hasn't got anything new in them, except it adds a
line every time I attempt to connect to it using psql. The last attempt
below was my connection attempt for this email, a bit over 8 hours after my
last connection attempt.

Screen session:
2017-01-28 12:38:35 AEDT FATAL: the database system is starting up
2017-01-28 23:00:01 AEDT FATAL: the database system is starting up
2017-01-28 23:00:01 AEDT FATAL: the database system is starting up
2017-01-29 07:14:00 AEDT FATAL: the database system is starting up

I also still can't connect.
postgres@atlassian:/home/tbadmin$ psql
psql: FATAL: the database system is starting up

*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* *Sydney* 02 8064 3600 *|*
*Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

On 28 January 2017 at 16:08, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 01/27/2017 05:40 PM, Brian Mills wrote:

First of all, Thank you for your time to assist me learning. I really
appreciate it.

root# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
7930 pts/4 S+ 0:00 grep --color=auto post

So if you check back does the recovering XXXXXXXX part change?

If so Postgres is walking through the WAL files as it should.

Its a single machine postgres database server, so I'm assuming there is
no cluster log. If there is one, where would I look for it?
The only log in /var/log/postgres is postgresql-9.3-main.log

That would be it. A single Postgres instance has multiple databases in it,
by default it starts with template0, template1 and postgres databases. Then
add whatever databases you created and you have a cluster of databases.

which shows (tail):

2017-01-27 20:27:01 AEDT LOG: database system was shut down at
2017-01-27 20:26:29 AEDT
2017-01-27 20:27:01 AEDT LOG: MultiXact member wraparound protections
are now enabled
2017-01-27 20:27:01 AEDT LOG: autovacuum launcher started
2017-01-27 20:27:01 AEDT LOG: database system is ready to accept
connections
2017-01-27 20:27:02 AEDT LOG: incomplete startup packet
2017-01-27 20:28:54 AEDT ERROR: unexpected chunk size 104 (expected
1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
2017-01-27 20:28:54 AEDT STATEMENT: COPY public.bodycontent
(bodycontentid, body, contentid, bodytypeid) TO stdout;
2017-01-27 20:30:13 AEDT LOG: received fast shutdown request
2017-01-27 20:30:13 AEDT LOG: aborting any active transactions
2017-01-27 20:30:13 AEDT LOG: autovacuum launcher shutting down
2017-01-27 20:30:13 AEDT LOG: shutting down
2017-01-27 20:30:13 AEDT LOG: database system is shut down

That would be your Attempt 1 log.

I ran the screen utility so I could leave the db started using
the pg_ctl command. The later logs in that session have not progressed,
its last entry is still
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
Which is still later datetime than the /var/log/postgres... log.

So it is just logging to stdout and not to the log file.

Connection attempt shows the same log
postgres@atlassian:/home/myuser$ psql
psql: FATAL: the database system is starting up

Nothing in the syslog that seems connected.

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570
<tel:07%203173%201570>
Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt; <
https://www.linkedin.com/company/trybooking-com&gt;

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 01/27/2017 01:31 PM, Brian Mills wrote:

Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I
haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped
working.
When the admin realised this he increased the disk size (its in
a cloud,
so that was easy to do) however I think the way it shutdown left
Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and
over again:
"FATAL: the database system is starting up"

I have a backup, which I have successfully recovered, but it is
24 hours
old, the next backup was the cause of the disk filling. So I'm
using
this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file
level
backup with the service not running then tried 2 things. I
haven't found
much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't)
but it did
sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump
confluencedb >
$now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed:
PQgetResult() failed.
pg_dump: Error message from server: ERROR: unexpected chunk
size 104
(expected 1996) in chunk 3 of 22 for toast value 48862 in
pg_toast_20028
pg_dump: The command was: COPY public.bodycontent
(bodycontentid, body,
contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an
inconsistent state.

*Attempt 2 - startup manually and let it try recovery*

I restored my file level backup and started again.
This time I tried to startup manually on the command line to see
the
output (I'd done it as a service startup a number of times to
nearly the
same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27
20:36:08 AEDT
LOG: database system was interrupted while in recovery at
2017-01-27
20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some
data is
corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly
shut
down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at
log
time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting
up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting
up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?

I've left it that way for 12 hours, and its still not allowing
connections.
I assume its doing some kind of consistency check?

What does it say when you attempt a connection?

Does anyone have any suggestions on what I should be doing to
try and
restore this database?

- The amount of change is minimal in the DB after 6pm it should be
basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is
that right?
- The last completed transaction time 2017-01-24
02:08:00.023064+11 the
log mentions would be fine to use, so loosing even a few hours
before
that would be more than adequate.

I'm just not clear what the database is doing now, or how I
should be
trying to recover it.

Any help anyone can suggest would be great! I've given myself this
weekend to attempt to recover more than the last backup, after
that I
need to restore the service for my team to use and suck up the
lost last
day of updates.

Thanks,
Brian

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#6rob stone
floriparob@gmail.com
In reply to: Brian Mills (#5)
Re: Recovery Assistance

Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:

Hi,

No, it hasn't changed since the first time I looked at it. 

root@atlassian:/home/tbadmin# ps ax | grep post
 1364 ?        Ss     0:00 /usr/lib/postfix/master
 5198 pts/3    S      0:00 su postgres
 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
 5222 ?        Ss     0:10 postgres: startup process   recovering
0000000100000005000000A3
11161 pts/4    S+     0:00 grep --color=auto post

Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Brian Mills
brian@trybooking.com
In reply to: rob stone (#6)
Re: Recovery Assistance

Yes, its the last one in the directory, pg_xlog directory

...more files...
-rw------- 1 postgres postgres 16777216 Jan 21 10:05
0000000100000005000000A1
-rw------- 1 postgres postgres 16777216 Jan 22 21:29
0000000100000005000000A2
-rw------- 1 postgres postgres 16777216 Jan 24 02:08
0000000100000005000000A3

*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* *Sydney* 02 8064 3600 *|*
*Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com> wrote:

Show quoted text

Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:

Hi,

No, it hasn't changed since the first time I looked at it.

root@atlassian:/home/tbadmin# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
11161 pts/4 S+ 0:00 grep --color=auto post

Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Brian Mills (#7)
Re: Recovery Assistance

On 01/28/2017 01:55 PM, Brian Mills wrote:

Yes, its the last one in the directory, pg_xlog directory

...more files...
-rw------- 1 postgres postgres 16777216 Jan 21 10:05
0000000100000005000000A1
-rw------- 1 postgres postgres 16777216 Jan 22 21:29
0000000100000005000000A2
-rw------- 1 postgres postgres 16777216 Jan 24 02:08
0000000100000005000000A3

Best guess is the last WAL is not complete.

From your original post:
"Attempt 2 - startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570 <tel:07%203173%201570>
Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt; <https://www.linkedin.com/company/trybooking-com&gt;

On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
<mailto:floriparob@gmail.com>> wrote:

Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:

Hi,

No, it hasn't changed since the first time I looked at it.

root@atlassian:/home/tbadmin# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
11161 pts/4 S+ 0:00 grep --color=auto post

Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Brian Mills
brian@trybooking.com
In reply to: Adrian Klaver (#8)
Re: Recovery Assistance

I have a consistent sql dump from 24 hour previous.

The file level backup was done with rsync -a of full data directory after
the issue occurred so could reset as I learned.

Brian

On Sun, 29 Jan 2017 at 9:18 am, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 01/28/2017 01:55 PM, Brian Mills wrote:

Yes, its the last one in the directory, pg_xlog directory

...more files...
-rw------- 1 postgres postgres 16777216 Jan 21 10:05
0000000100000005000000A1
-rw------- 1 postgres postgres 16777216 Jan 22 21:29
0000000100000005000000A2
-rw------- 1 postgres postgres 16777216 Jan 24 02:08
0000000100000005000000A3

Best guess is the last WAL is not complete.

From your original post:
"Attempt 2 - startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570

<tel:07%203173%201570>

Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;

<https://www.linkedin.com/company/trybooking-com&gt;

On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
<mailto:floriparob@gmail.com>> wrote:

Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:

Hi,

No, it hasn't changed since the first time I looked at it.

root@atlassian:/home/tbadmin# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres -D
/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
11161 pts/4 S+ 0:00 grep --color=auto post

Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

--
Adrian Klaver
adrian.klaver@aklaver.com

--
*Brian Mills*
CTO

*Mob: *0410660003 <javascript:void(0);>
*Melbourne* 03 9012 3460 <javascript:void(0);> or 03 8376 6327
<javascript:void(0);> *|* *Sydney* 02 8064 3600 <javascript:void(0);> *|*
*Brisbane* 07 3173 1570 <javascript:void(0);>
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

#10Brian Mills
brian@trybooking.com
In reply to: Brian Mills (#9)
Re: Recovery Assistance

I presume this is a binary log file for the database.

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in time, say a
bit before the last completed transaction time the log mentions, then take
a backup. Is that possible in postgres?

The log mentions this:
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log time
2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up doing a db
backup dump)

*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* *Sydney* 02 8064 3600 *|*
*Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

On 29 January 2017 at 12:58, Brian Mills <brian@trybooking.com> wrote:

Show quoted text

I have a consistent sql dump from 24 hour previous.

The file level backup was done with rsync -a of full data directory after
the issue occurred so could reset as I learned.

Brian

On Sun, 29 Jan 2017 at 9:18 am, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 01/28/2017 01:55 PM, Brian Mills wrote:

Yes, its the last one in the directory, pg_xlog directory

...more files...
-rw------- 1 postgres postgres 16777216 Jan 21 10:05
0000000100000005000000A1
-rw------- 1 postgres postgres 16777216 Jan 22 21:29
0000000100000005000000A2
-rw------- 1 postgres postgres 16777216 Jan 24 02:08
0000000100000005000000A3

Best guess is the last WAL is not complete.

From your original post:
"Attempt 2 - startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

*Brian Mills*
CTO

*Mob: *0410660003 <0410%20660%20003> <tel:0410660003 <0410%20660%20003>

*Melbourne* 03 9012 3460 <(03)%209012%203460> <tel:03%209012%203460>

or 03 8376 6327 <(03)%208376%206327>

<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600

<(02)%208064%203600>

<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570

<(07)%203173%201570> <tel:07%203173%201570>

Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;

<https://www.linkedin.com/company/trybooking-com&gt;

On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com
<mailto:floriparob@gmail.com>> wrote:

Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:

Hi,

No, it hasn't changed since the first time I looked at it.

root@atlassian:/home/tbadmin# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00 /usr/lib/postgresql/9.3/bin/postgres

-D

/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process recovering
0000000100000005000000A3
11161 pts/4 S+ 0:00 grep --color=auto post

Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

--
Adrian Klaver
adrian.klaver@aklaver.com

--
*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* *Sydney* 02 8064 3600 *|*
*Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Brian Mills (#10)
Re: Recovery Assistance

On 01/28/2017 11:23 PM, Brian Mills wrote:

I presume this is a binary log file for the database.

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in time, say a
bit before the last completed transaction time the log mentions, then
take a backup. Is that possible in postgres?

Yes, though I am not sure you have the setup to do it. I would suggest
reading the below to see how much of it applies:

https://www.postgresql.org/docs/9.3/static/continuous-archiving.html

In particular:
24.3.4. Recovering Using a Continuous Archive Backup

https://www.postgresql.org/docs/9.3/static/recovery-target-settings.html

The log mentions this:
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up doing a db
backup dump)

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570 <tel:07%203173%201570>
Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt; <https://www.linkedin.com/company/trybooking-com&gt;

On 29 January 2017 at 12:58, Brian Mills <brian@trybooking.com
<mailto:brian@trybooking.com>> wrote:

I have a consistent sql dump from 24 hour previous.

The file level backup was done with rsync -a of full data directory
after the issue occurred so could reset as I learned.

Brian

On Sun, 29 Jan 2017 at 9:18 am, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

On 01/28/2017 01:55 PM, Brian Mills wrote:

Yes, its the last one in the directory, pg_xlog directory

...more files...
-rw------- 1 postgres postgres 16777216 Jan 21 10:05
0000000100000005000000A1
-rw------- 1 postgres postgres 16777216 Jan 22 21:29
0000000100000005000000A2
-rw------- 1 postgres postgres 16777216 Jan 24 02:08
0000000100000005000000A3

Best guess is the last WAL is not complete.

From your original post:
"Attempt 2 - startup manually and let it try recovery

I restored my file level backup and started again. "

How was the file level backup done?

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410%20660%20003> <tel:0410660003

<tel:0410%20660%20003>>

*Melbourne* 03 9012 3460 <tel:(03)%209012%203460>

<tel:03%209012%203460> or 03 8376 6327 <tel:(03)%208376%206327>

<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600

<tel:(02)%208064%203600>

<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570

<tel:(07)%203173%201570> <tel:07%203173%201570>

Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/

<https://www.facebook.com/TryBooking/&gt;&gt;
<https://twitter.com/trybooking
<https://twitter.com/trybooking&gt;&gt;
<https://www.linkedin.com/company/trybooking-com
<https://www.linkedin.com/company/trybooking-com&gt;&gt;

On 29 January 2017 at 08:18, rob stone <floriparob@gmail.com

<mailto:floriparob@gmail.com>

<mailto:floriparob@gmail.com <mailto:floriparob@gmail.com>>>

wrote:

Hello Brian,
On Sun, 2017-01-29 at 07:16 +1100, Brian Mills wrote:

Hi,

No, it hasn't changed since the first time I looked at it.

root@atlassian:/home/tbadmin# ps ax | grep post
1364 ? Ss 0:00 /usr/lib/postfix/master
5198 pts/3 S 0:00 su postgres
5221 pts/3 S 0:00

/usr/lib/postgresql/9.3/bin/postgres -D

/etc/postgresql/9.3/main
5222 ? Ss 0:10 postgres: startup process

recovering

0000000100000005000000A3
11161 pts/4 S+ 0:00 grep --color=auto post

Does this WAL file exist "0000000100000005000000A3"?

Cheers,
Rob

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* * **Sydney* 02 8064
3600 *|* *Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt; <https://www.linkedin.com/company/trybooking-com&gt;

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Brian Mills (#10)
Re: Recovery Assistance

On 01/28/2017 11:23 PM, Brian Mills wrote:

I presume this is a binary log file for the database.

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in time, say a
bit before the last completed transaction time the log mentions, then
take a backup. Is that possible in postgres?

Had another thought. If I remember correctly you are using this as an
exercise in Postgres recovery. If that is indeed the case you might try:

1) Stop the Postgres instance you have running now.

2) Move the WAL file that Postgres is currently stalled on,
0000000100000005000000A3, out of pg_xlog.

3) Restart the Postgres instance.

My guess it it will not bring it back to the exact point you want, but
close. You can get a general idea by running(before and after removing
the WAL), as the postgres user:

pg_controldata -D /etc/postgresql/9.3/main

The log mentions this:
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up doing a db
backup dump)

*Brian Mills*
CTO

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Brian Mills
brian@trybooking.com
In reply to: Adrian Klaver (#12)
Re: Recovery Assistance

OK. I think I'm on to something here, I first reset back to my file level
backup.
I created a recovery.conf file in the root of the data directory like this:
---------------------
restore_command = 'cp /mnt/archive/%f %p'
recovery_target_time = '2017-01-24 02:08:00.023064+11'
recovery_target_inclusive = 'true'
pause_at_recovery_target = 'false'
---------------------
Note, the archive directory had no files in it, I left the WAL files in the
pg_xlog directory.

Then I started up the database again:
postgres@atlassian:~/9.3/main$ /usr/lib/postgresql/9.3/bin/pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:~/9.3/main$ 2017-01-30 10:07:28 AEDT LOG: database
system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
2017-01-30 10:07:28 AEDT HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2017-01-30 10:07:28 AEDT LOG: starting point-in-time recovery to
2017-01-24 02:08:00.023064+11
2017-01-30 10:07:28 AEDT LOG: database system was not properly shut down;
automatic recovery in progress
2017-01-30 10:07:28 AEDT WARNING: WAL was generated with
wal_level=minimal, data may be missing
2017-01-30 10:07:28 AEDT HINT: This happens if you temporarily set
wal_level=minimal without taking a new base backup.
2017-01-30 10:07:28 AEDT LOG: redo starts at 5/528B4558
2017-01-30 10:07:40 AEDT LOG: consistent recovery state reached at
5/A3FFFA30
cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or
directory
cp: cannot stat ‘/mnt/archive/0000000100000005000000A4’: No such file or
directory
2017-01-30 10:07:40 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-30 10:07:40 AEDT LOG: last completed transaction was at log time
2017-01-24 02:08:00.023064+11
cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or
directory
cp: cannot stat ‘/mnt/archive/00000002.history’: No such file or directory
2017-01-30 10:07:40 AEDT LOG: selected new timeline ID: 2
cp: cannot stat ‘/mnt/archive/00000001.history’: No such file or directory
2017-01-30 10:07:40 AEDT LOG: archive recovery complete
2017-01-30 10:08:55 AEDT FATAL: the database system is starting up
2017-01-30 10:08:57 AEDT FATAL: the database system is starting up

This time it looks like it has actually finished the startup and recovery.
However I think I might have something wrong about the process.
Any thoughts on the above log?

*Brian Mills*
CTO

*Mob: *0410660003
*Melbourne* 03 9012 3460 or 03 8376 6327 *|* *Sydney* 02 8064 3600 *|*
*Brisbane* 07 3173 1570
Level 1 *|* 600 Chapel Street *|* South Yarra*|* VIC *|* 3141 *|*
Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt;
<https://www.linkedin.com/company/trybooking-com&gt;

On 30 January 2017 at 04:49, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 01/28/2017 11:23 PM, Brian Mills wrote:

I presume this is a binary log file for the database.

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in time, say a
bit before the last completed transaction time the log mentions, then
take a backup. Is that possible in postgres?

Had another thought. If I remember correctly you are using this as an
exercise in Postgres recovery. If that is indeed the case you might try:

1) Stop the Postgres instance you have running now.

2) Move the WAL file that Postgres is currently stalled on,
0000000100000005000000A3, out of pg_xlog.

3) Restart the Postgres instance.

My guess it it will not bring it back to the exact point you want, but
close. You can get a general idea by running(before and after removing the
WAL), as the postgres user:

pg_controldata -D /etc/postgresql/9.3/main

The log mentions this:
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up doing a db
backup dump)

*Brian Mills*
CTO

--
Adrian Klaver
adrian.klaver@aklaver.com

#14Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Brian Mills (#13)
Re: Recovery Assistance

On 01/29/2017 03:12 PM, Brian Mills wrote:

OK. I think I'm on to something here, I first reset back to my file
level backup.
I created a recovery.conf file in the root of the data directory like this:
---------------------
restore_command = 'cp /mnt/archive/%f %p'
recovery_target_time = '2017-01-24 02:08:00.023064+11'
recovery_target_inclusive = 'true'
pause_at_recovery_target = 'false'
---------------------
Note, the archive directory had no files in it, I left the WAL files in
the pg_xlog directory.

Then I started up the database again:
postgres@atlassian:~/9.3/main$ /usr/lib/postgresql/9.3/bin/pg_ctl -D
/etc/postgresql/9.3/main start
server starting
postgres@atlassian:~/9.3/main$ 2017-01-30 10:07:28 AEDT LOG: database
system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
2017-01-30 10:07:28 AEDT HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
2017-01-30 10:07:28 AEDT LOG: starting point-in-time recovery to
2017-01-24 02:08:00.023064+11
2017-01-30 10:07:28 AEDT LOG: database system was not properly shut
down; automatic recovery in progress
2017-01-30 10:07:28 AEDT WARNING: WAL was generated with
wal_level=minimal, data may be missing

This would be a problem:

https://www.postgresql.org/docs/9.3/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

"But minimal WAL does not contain enough information to reconstruct the
data from a base backup and the WAL logs, so either archive or
hot_standby level must be used to enable WAL archiving (archive_mode)
and streaming replication. "

2017-01-30 10:07:28 AEDT HINT: This happens if you temporarily set
wal_level=minimal without taking a new base backup.
2017-01-30 10:07:28 AEDT LOG: redo starts at 5/528B4558
2017-01-30 10:07:40 AEDT LOG: consistent recovery state reached at
5/A3FFFA30
cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or
directory
cp: cannot stat ‘/mnt/archive/0000000100000005000000A4’: No such file or
directory
2017-01-30 10:07:40 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-30 10:07:40 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11
cp: cannot stat ‘/mnt/archive/0000000100000005000000A3’: No such file or
directory
cp: cannot stat ‘/mnt/archive/00000002.history’: No such file or directory
2017-01-30 10:07:40 AEDT LOG: selected new timeline ID: 2
cp: cannot stat ‘/mnt/archive/00000001.history’: No such file or directory
2017-01-30 10:07:40 AEDT LOG: archive recovery complete
2017-01-30 10:08:55 AEDT FATAL: the database system is starting up
2017-01-30 10:08:57 AEDT FATAL: the database system is starting up

This time it looks like it has actually finished the startup and
recovery. However I think I might have something wrong about the process.
Any thoughts on the above log?

See above.

*Brian Mills*
CTO

*Mob: *0410660003 <tel:0410660003>
*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327
<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600
<tel:02%208064%203600> *|* *Brisbane* 07 3173 1570 <tel:07%203173%201570>
Level 1 *|* 600 Chapel Street *|* South
Yarra*|* VIC *|* 3141 *|* Australia

<https://www.facebook.com/TryBooking/&gt; <https://twitter.com/trybooking&gt; <https://www.linkedin.com/company/trybooking-com&gt;

On 30 January 2017 at 04:49, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 01/28/2017 11:23 PM, Brian Mills wrote:

I presume this is a binary log file for the database.

Am I able to recover to a point in time using this log file?

What I would do in SQL Server would be recover to a point in
time, say a
bit before the last completed transaction time the log mentions,
then
take a backup. Is that possible in postgres?

Had another thought. If I remember correctly you are using this as
an exercise in Postgres recovery. If that is indeed the case you
might try:

1) Stop the Postgres instance you have running now.

2) Move the WAL file that Postgres is currently stalled on,
0000000100000005000000A3, out of pg_xlog.

3) Restart the Postgres instance.

My guess it it will not bring it back to the exact point you want,
but close. You can get a general idea by running(before and after
removing the WAL), as the postgres user:

pg_controldata -D /etc/postgresql/9.3/main

The log mentions this:
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log
time 2017-01-24 02:08:00.023064+11

(which is moments before, or possibly as the disk filled up
doing a db
backup dump)

*Brian Mills*
CTO

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general