pg_dump strangeness
I'm having an issue with pg_dump crashing one of my servers. I was
running PG 7.2.1 now running 7.2.4 on RedHat 7.3 with up to date
patches. It happens when I'm dumping a largish (for me) database. The
database has two tables one with 1.2 million entries the other has 3.5
million entries, there are also about 700,000 blobs with signatures. The
exact command I'm using is.
pg_dump -Fc -b docarc >docarc.cust
It usually doesn't happen on the first iteration it's the second that
brings the box down. I ran it by hand on the console Saturday and it
slowly destabilized the system. I lost the title bars on the windows and
then the gnome task bar. Only the mouse cursor moved but it did not
responded to clicks or keyboard. I was able to restart the box from a
telnet session.
I added more memory to the box and that seems to be helping. It now
takes four runs to kill the box.
Any clue to the root of the problem? OS, hardware, postgresql, something
misconfigured???
Thanks,
Lane
From the system logfile -
Mar 10 02:34:08 internal kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020
Mar 10 02:34:08 internal kernel: printing eip:
Mar 10 02:34:08 internal kernel: c013bbee
Mar 10 02:34:08 internal kernel: *pde = 00000000
Mar 10 02:34:08 internal kernel: Oops: 0000
Mar 10 02:34:08 internal kernel: sis sisfb agpgart 8139too mii usb-ohci
usbcore ext3 jbd dpt_i2o sd_mod scsi_mod
Mar 10 02:34:08 internal kernel: CPU: 0
Mar 10 02:34:08 internal kernel: EIP: 0010:[<c013bbee>] Not
tainted
Mar 10 02:34:08 internal kernel: EFLAGS: 00010286
Mar 10 02:34:08 internal kernel:
Mar 10 02:34:08 internal kernel: EIP is at block_read_full_page [kernel]
0xe (2.4.18-26.7.x)
Mar 10 02:34:08 internal kernel: eax: 00000000 ebx: e1025d34 ecx:
00000000 edx: 00000000
Mar 10 02:34:08 internal kernel: esi: c15d46f0 edi: c02d4a24 ebp:
c15d470c esp: e2261d90
Mar 10 02:34:08 internal kernel: ds: 0018 es: 0018 ss: 0018
Mar 10 02:34:08 internal kernel: Process pg_dump (pid: 13593,
stackpage=e2261000)
Mar 10 02:34:08 internal kernel: Stack: 00000001 ded15500 e1043540
c01cc410 dd36b600 c020dd17 e2260000 0000000c
Mar 10 02:34:08 internal kernel: e2261eb0 0000000c e1043540
00000282 c01cc431 dd36b600 00000000 00000000
Mar 10 02:34:08 internal kernel: c01cd39b 00000283 0000000c
e1025d34 c15d46f0 c02d4a24 00001417 c0128a23
Mar 10 02:34:08 internal kernel: Call Trace: [<c01cc410>] sock_wfree
[kernel] 0x0 (0xe2261d9c))
Mar 10 02:34:08 internal kernel: [<c020dd17>] unix_write_space [kernel]
0x37 (0xe2261da4))
Mar 10 02:34:08 internal kernel: [<c01cc431>] sock_wfree [kernel] 0x21
(0xe2261dc0))
Mar 10 02:34:08 internal kernel: [<c01cd39b>] kfree_skbmem [kernel] 0xb
(0xe2261dd0))
Mar 10 02:34:08 internal kernel: [<c0128a23>] __remove_inode_page
[kernel] 0x33(0xe2261dec))
Mar 10 02:34:08 internal kernel: [<e7946a20>] ext3_get_block [ext3] 0x0
(0xe2261df4))
Mar 10 02:34:08 internal kernel: [<c012fdac>] reclaim_page [kernel]
0x1ec (0xe2261dfc))
Mar 10 02:34:08 internal kernel: [<c0132171>] __alloc_pages_limit
[kernel] 0x71(0xe2261e1c))
Mar 10 02:34:08 internal kernel: [<c0132239>] __alloc_pages [kernel]
0x99 (0xe2261e30))
Mar 10 02:34:08 internal kernel: [<c0126cb0>] do_anonymous_page [kernel]
0x50 (0xe2261e64))
Mar 10 02:34:08 internal kernel: [<e7948e65>] ext3_mark_iloc_dirty
[ext3] 0x35 (0xe2261e68))
Mar 10 02:34:08 internal kernel: [<c0126da3>] do_no_page [kernel] 0x33
(0xe2261e88))
Mar 10 02:34:08 internal kernel: [<c01cb02c>] sys_recvfrom [kernel] 0xec
(0xe2261eac))
Mar 10 02:34:08 internal kernel: [<c0126fea>] handle_mm_fault [kernel]
0xca (0xe2261ec0))
Mar 10 02:34:08 internal kernel: [<c01324a0>] __get_free_pages [kernel]
0x10 (0xe2261ee0))
Mar 10 02:34:08 internal kernel: [<c0146b83>] __pollwait [kernel] 0x33
(0xe2261ee4))
Mar 10 02:34:08 internal kernel: [<c011456a>] do_page_fault [kernel]
0x12a (0xe2261f08))
Mar 10 02:34:08 internal kernel: [<c01286e9>] do_brk [kernel] 0x249
(0xe2261f44))
Mar 10 02:34:08 internal kernel: [<c01cb05d>] sys_recv [kernel] 0x1d
(0xe2261f6c))
Mar 10 02:34:08 internal kernel: [<c0127452>] sys_brk [kernel] 0xb2
(0xe2261f94))
Mar 10 02:34:08 internal kernel: [<c0114440>] do_page_fault [kernel] 0x0
(0xe2261fb0))
Mar 10 02:34:08 internal kernel: [<c0108a4c>] error_code [kernel] 0x34
(0xe2261fb8))
Lane Rollins wrote:
Do you notice a lot of memory being allocated? Use the free command. Is uptime high?
What is the output of this command (assuming the db is being run by user 'postgres')?
ps -w -w -o pid,rss,size,args --sort size -u postgres
I tried stopping and starting postmaster to see if it would release any
memory and it didn't.
I'll try doing the ps later tonight and see what happens. But here is
some info from top if that helps at all.
The machine in a quit state
5:02pm up 1:30, 1 user, load average: 0.00, 0.00, 0.00
77 processes: 74 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 0.0% user, 0.1% system, 0.0% nice, 99.8% idle
Mem: 1015028K av, 205732K used, 809296K free, 0K shrd, 38036K
buff
Swap: 136512K av, 0K used, 136512K free 86884K
cached
After 3 consecutive pg_dumps
3:15pm up 5:47, 4 users, load average: 0.06, 0.13, 0.21
111 processes: 108 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 3.5% user, 1.7% system, 0.3% nice, 94.2% idle
Mem: 1015028K av, 997768K used, 17260K free, 0K shrd, 76016K
buff
Swap: 136512K av, 10484K used, 126028K free 801560K
cached
Last update before died
3:27pm up 6:00, 4 users, load average: 1.65, 1.90, 1.23
110 processes: 106 sleeping, 4 running, 0 zombie, 0 stopped
CPU states: 20.0% user, 20.2% system, 0.1% nice, 59.4% idle
Mem: 1015028K av, 1006508K used, 8520K free, 0K shrd, 76968K
buff
Swap: 136512K av, 10484K used, 126028K free 726256K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
3641 postgres 16 0 110M 110M 26840 R 29.8 11.1 5:20 postmaster
1227 root 5 -10 26152 6128 3008 S < 3.3 0.6 40:35 X
1538 laner 15 0 12096 11M 6124 S 1.5 1.1 2:02 rhn-applet
Thanks again,
Lane
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Joseph Shraibman
Sent: Monday, March 10, 2003 4:38 PM
To: Lane RollinsLane Rollins wrote:
Do you notice a lot of memory being allocated? Use the free command.
Is
uptime high?
What is the output of this command (assuming the db is being run by
user
Show quoted text
'postgres')?
ps -w -w -o pid,rss,size,args --sort size -u postgres
On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:
Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020
Looks like a kernel bug -- there's not much we can do to help, AFAIK.
Have you tried applying any errata that RH have put out for your kernel,
and/or reporting the problem to the appropriate source? (lkml, RH, etc.)
Cheers,
Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
On Tue, 2003-03-11 at 16:57, Neil Conway wrote:
On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:
Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020Looks like a kernel bug -- there's not much we can do to help, AFAIK.
Have you tried applying any errata that RH have put out for your kernel,
and/or reporting the problem to the appropriate source? (lkml, RH, etc.)Cheers,
Neil
It could also be bad memory - try memtest86 for an hour or two.
Stephen
The problem seems to be either bad mainboard or memory. The machine
decided it was going to start crashing very regularly and finally
stopped even booting. I ended up moving the raid board, drives and one
of the sticks of memory to another box and so far it's still running.
I'll try running the memory tests tomorrow. I'm not stuck in meetings
all day.
Thanks for the help and suggestions,
-Lane
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Stephen Robert Norris
Sent: Tuesday, March 11, 2003 9:19 PM
To: Neil Conway
Cc: Lane Rollins; PostgreSQL General
Subject: Re: [GENERAL] pg_dump strangenessOn Tue, 2003-03-11 at 16:57, Neil Conway wrote:
On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:
Mar 10 02:34:08internal kernel: Unable to handle kernel NULL
pointer
dereference at virtual address 00000020
Looks like a kernel bug -- there's not much we can do to help,
AFAIK.
Have you tried applying any errata that RH have put out for your
kernel,
and/or reporting the problem to the appropriate source? (lkml, RH,
etc.)
Show quoted text
Cheers,
Neil
It could also be bad memory - try memtest86 for an hour or two.
Stephen