pg_dump strangeness

Started by Lane Rollinsabout 23 years ago6 messagesgeneral
Jump to latest
#1Lane Rollins
laner@boyds.com

I'm having an issue with pg_dump crashing one of my servers. I was
running PG 7.2.1 now running 7.2.4 on RedHat 7.3 with up to date
patches. It happens when I'm dumping a largish (for me) database. The
database has two tables one with 1.2 million entries the other has 3.5
million entries, there are also about 700,000 blobs with signatures. The
exact command I'm using is.

pg_dump -Fc -b docarc >docarc.cust

It usually doesn't happen on the first iteration it's the second that
brings the box down. I ran it by hand on the console Saturday and it
slowly destabilized the system. I lost the title bars on the windows and
then the gnome task bar. Only the mouse cursor moved but it did not
responded to clicks or keyboard. I was able to restart the box from a
telnet session.

I added more memory to the box and that seems to be helping. It now
takes four runs to kill the box.

Any clue to the root of the problem? OS, hardware, postgresql, something
misconfigured???

Thanks,
Lane

From the system logfile -

Mar 10 02:34:08 internal kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020

Mar 10 02:34:08 internal kernel: printing eip:

Mar 10 02:34:08 internal kernel: c013bbee

Mar 10 02:34:08 internal kernel: *pde = 00000000

Mar 10 02:34:08 internal kernel: Oops: 0000

Mar 10 02:34:08 internal kernel: sis sisfb agpgart 8139too mii usb-ohci
usbcore ext3 jbd dpt_i2o sd_mod scsi_mod

Mar 10 02:34:08 internal kernel: CPU: 0

Mar 10 02:34:08 internal kernel: EIP: 0010:[<c013bbee>] Not
tainted

Mar 10 02:34:08 internal kernel: EFLAGS: 00010286

Mar 10 02:34:08 internal kernel:

Mar 10 02:34:08 internal kernel: EIP is at block_read_full_page [kernel]
0xe (2.4.18-26.7.x)

Mar 10 02:34:08 internal kernel: eax: 00000000 ebx: e1025d34 ecx:
00000000 edx: 00000000

Mar 10 02:34:08 internal kernel: esi: c15d46f0 edi: c02d4a24 ebp:
c15d470c esp: e2261d90

Mar 10 02:34:08 internal kernel: ds: 0018 es: 0018 ss: 0018

Mar 10 02:34:08 internal kernel: Process pg_dump (pid: 13593,
stackpage=e2261000)

Mar 10 02:34:08 internal kernel: Stack: 00000001 ded15500 e1043540
c01cc410 dd36b600 c020dd17 e2260000 0000000c

Mar 10 02:34:08 internal kernel: e2261eb0 0000000c e1043540
00000282 c01cc431 dd36b600 00000000 00000000

Mar 10 02:34:08 internal kernel: c01cd39b 00000283 0000000c
e1025d34 c15d46f0 c02d4a24 00001417 c0128a23

Mar 10 02:34:08 internal kernel: Call Trace: [<c01cc410>] sock_wfree
[kernel] 0x0 (0xe2261d9c))

Mar 10 02:34:08 internal kernel: [<c020dd17>] unix_write_space [kernel]
0x37 (0xe2261da4))

Mar 10 02:34:08 internal kernel: [<c01cc431>] sock_wfree [kernel] 0x21
(0xe2261dc0))

Mar 10 02:34:08 internal kernel: [<c01cd39b>] kfree_skbmem [kernel] 0xb
(0xe2261dd0))

Mar 10 02:34:08 internal kernel: [<c0128a23>] __remove_inode_page
[kernel] 0x33(0xe2261dec))

Mar 10 02:34:08 internal kernel: [<e7946a20>] ext3_get_block [ext3] 0x0
(0xe2261df4))

Mar 10 02:34:08 internal kernel: [<c012fdac>] reclaim_page [kernel]
0x1ec (0xe2261dfc))

Mar 10 02:34:08 internal kernel: [<c0132171>] __alloc_pages_limit
[kernel] 0x71(0xe2261e1c))

Mar 10 02:34:08 internal kernel: [<c0132239>] __alloc_pages [kernel]
0x99 (0xe2261e30))

Mar 10 02:34:08 internal kernel: [<c0126cb0>] do_anonymous_page [kernel]
0x50 (0xe2261e64))

Mar 10 02:34:08 internal kernel: [<e7948e65>] ext3_mark_iloc_dirty
[ext3] 0x35 (0xe2261e68))

Mar 10 02:34:08 internal kernel: [<c0126da3>] do_no_page [kernel] 0x33
(0xe2261e88))

Mar 10 02:34:08 internal kernel: [<c01cb02c>] sys_recvfrom [kernel] 0xec
(0xe2261eac))

Mar 10 02:34:08 internal kernel: [<c0126fea>] handle_mm_fault [kernel]
0xca (0xe2261ec0))

Mar 10 02:34:08 internal kernel: [<c01324a0>] __get_free_pages [kernel]
0x10 (0xe2261ee0))

Mar 10 02:34:08 internal kernel: [<c0146b83>] __pollwait [kernel] 0x33
(0xe2261ee4))

Mar 10 02:34:08 internal kernel: [<c011456a>] do_page_fault [kernel]
0x12a (0xe2261f08))

Mar 10 02:34:08 internal kernel: [<c01286e9>] do_brk [kernel] 0x249
(0xe2261f44))

Mar 10 02:34:08 internal kernel: [<c01cb05d>] sys_recv [kernel] 0x1d
(0xe2261f6c))

Mar 10 02:34:08 internal kernel: [<c0127452>] sys_brk [kernel] 0xb2
(0xe2261f94))

Mar 10 02:34:08 internal kernel: [<c0114440>] do_page_fault [kernel] 0x0
(0xe2261fb0))

Mar 10 02:34:08 internal kernel: [<c0108a4c>] error_code [kernel] 0x34
(0xe2261fb8))

#2Joseph Shraibman
jks@selectacast.net
In reply to: Lane Rollins (#1)
Re: pg_dump strangeness

Lane Rollins wrote:

Do you notice a lot of memory being allocated? Use the free command. Is uptime high?

What is the output of this command (assuming the db is being run by user 'postgres')?
ps -w -w -o pid,rss,size,args --sort size -u postgres

#3Lane Rollins
laner@boyds.com
In reply to: Joseph Shraibman (#2)
Re: pg_dump strangeness

I tried stopping and starting postmaster to see if it would release any
memory and it didn't.

I'll try doing the ps later tonight and see what happens. But here is
some info from top if that helps at all.

The machine in a quit state
5:02pm up 1:30, 1 user, load average: 0.00, 0.00, 0.00
77 processes: 74 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 0.0% user, 0.1% system, 0.0% nice, 99.8% idle
Mem: 1015028K av, 205732K used, 809296K free, 0K shrd, 38036K
buff
Swap: 136512K av, 0K used, 136512K free 86884K
cached

After 3 consecutive pg_dumps
3:15pm up 5:47, 4 users, load average: 0.06, 0.13, 0.21
111 processes: 108 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 3.5% user, 1.7% system, 0.3% nice, 94.2% idle
Mem: 1015028K av, 997768K used, 17260K free, 0K shrd, 76016K
buff
Swap: 136512K av, 10484K used, 126028K free 801560K
cached

Last update before died
3:27pm up 6:00, 4 users, load average: 1.65, 1.90, 1.23
110 processes: 106 sleeping, 4 running, 0 zombie, 0 stopped
CPU states: 20.0% user, 20.2% system, 0.1% nice, 59.4% idle
Mem: 1015028K av, 1006508K used, 8520K free, 0K shrd, 76968K
buff
Swap: 136512K av, 10484K used, 126028K free 726256K
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
3641 postgres 16 0 110M 110M 26840 R 29.8 11.1 5:20 postmaster
1227 root 5 -10 26152 6128 3008 S < 3.3 0.6 40:35 X
1538 laner 15 0 12096 11M 6124 S 1.5 1.1 2:02 rhn-applet

Thanks again,
Lane

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Joseph Shraibman
Sent: Monday, March 10, 2003 4:38 PM
To: Lane Rollins

Lane Rollins wrote:

Do you notice a lot of memory being allocated? Use the free command.

Is

uptime high?

What is the output of this command (assuming the db is being run by

user

Show quoted text

'postgres')?
ps -w -w -o pid,rss,size,args --sort size -u postgres

#4Neil Conway
neilc@samurai.com
In reply to: Lane Rollins (#1)
Re: pg_dump strangeness

On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:

Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020

Looks like a kernel bug -- there's not much we can do to help, AFAIK.
Have you tried applying any errata that RH have put out for your kernel,
and/or reporting the problem to the appropriate source? (lkml, RH, etc.)

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#5Stephen Robert Norris
srn@commsecure.com.au
In reply to: Neil Conway (#4)
Re: pg_dump strangeness

On Tue, 2003-03-11 at 16:57, Neil Conway wrote:

On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:

Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000020

Looks like a kernel bug -- there's not much we can do to help, AFAIK.
Have you tried applying any errata that RH have put out for your kernel,
and/or reporting the problem to the appropriate source? (lkml, RH, etc.)

Cheers,

Neil

It could also be bad memory - try memtest86 for an hour or two.

Stephen

#6Lane Rollins
laner@boyds.com
In reply to: Stephen Robert Norris (#5)
Re: pg_dump strangeness

The problem seems to be either bad mainboard or memory. The machine
decided it was going to start crashing very regularly and finally
stopped even booting. I ended up moving the raid board, drives and one
of the sticks of memory to another box and so far it's still running.

I'll try running the memory tests tomorrow. I'm not stuck in meetings
all day.

Thanks for the help and suggestions,
-Lane

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Stephen Robert Norris
Sent: Tuesday, March 11, 2003 9:19 PM
To: Neil Conway
Cc: Lane Rollins; PostgreSQL General
Subject: Re: [GENERAL] pg_dump strangeness

On Tue, 2003-03-11 at 16:57, Neil Conway wrote:

On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:

Mar 10 02:34:08internal kernel: Unable to handle kernel NULL

pointer

dereference at virtual address 00000020

Looks like a kernel bug -- there's not much we can do to help,

AFAIK.

Have you tried applying any errata that RH have put out for your

kernel,

and/or reporting the problem to the appropriate source? (lkml, RH,

etc.)

Show quoted text

Cheers,

Neil

It could also be bad memory - try memtest86 for an hour or two.

Stephen