BUG #15438: Standby corruption after "Too many open files in system" error

Started by PG Bug reporting formover 7 years ago5 messagesbugs

noreply@postgresql.org

over 7 years ago

The following bug has been logged on the website:

Bug reference: 15438
Logged by: Juanjo Santamaría
Email address: juanjo.santamaria@gmail.com
PostgreSQL version: 10.4
Operating system: Redhat 7.3
Description:

After some time working under poor conditions, the open file limit was too
low for the postgres user, the streaming replica ended in a corrupted
state.

The primary was working the whole time and recreating the standby replica,
after configuring the user limits, seems to solve the issue.

The log is quite verbose, so I have tried to summarize at least one
occurrency for every error:

Oct 15 10:47:54 postgres[9507]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:48:38 postgres[9507]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 10:48:38 postgres[40098]: [15-1] LOG: record with incorrect
prev-link 6/A1A993C8 at 6/BEA99400
Oct 15 10:48:38 postgres[9550]: [9-1] FATAL: could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:48:38 postgres[9550]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:48:38 postgres[9550]: [9-3] #011#011before or while processing the
request.

Oct 15 10:48:53 postgres[9586]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:49:32 postgres[9586]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open f
iles in system
Oct 15 10:49:32 postgres[40098]: [16-1] LOG: invalid record length at
6/BEABD0F8: wanted 24, got 0

Oct 15 10:50:31 postgres[9765]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:50:39 postgres[9765]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 10:50:39 postgres[40098]: [19-1] LOG: invalid resource manager ID
101 at 6/BEAE3708
Oct 15 10:51:01 postgres[9767]: [9-1] FATAL: could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:51:01 postgres[9767]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:51:01 postgres[9767]: [9-3] #011#011before or while processing the
request.

Oct 15 10:51:06 postgres[9783]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:51:39 postgres[9783]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 10:51:39 postgres[40098]: [20-1] LOG: record with incorrect
prev-link 3E/746F6F72 at 6/BEC3D5E8

Oct 15 11:03:39 postgres[10850]: [9-1] FATAL: could not connect to the
primary server: FATAL: could not open relation mapping file
"global/pg_filenode.map": Too many open files in system

Oct 15 12:00:10 postgres[40101]: [3732-1] LOG: restartpoint complete: wrote
61 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 re
cycled; write=6.159 s, sync=0.018 s, total=6.223 s; sync files=16,
longest=0.006 s, average=0.001 s; distance=451 kB, estimate=2215 kB
Oct 15 12:01:40 postgres[15182]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open
files in system
Oct 15 12:01:40 postgres[40098]: [80-1] LOG: invalid resource manager ID 64
at 6/BEE34F00
Oct 15 12:01:42 postgres[15711]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 12:05:38 postgres[15711]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 12:05:38 postgres[40098]: [81-1] LOG: unexpected pageaddr 6/A1E6E000
in log segment 0000000300000006000000BE, offset 15130624

Oct 16 01:34:15 postgres[18890]: [9-1] LOG: started streaming WAL from
primary at 6/C3000000 on timeline 3
Oct 16 01:34:15 postgres[18890]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: could not open file "pg_wal/0000000300000006000000C3":
Too many open files in system

Oct 16 17:58:44 postgres[38030]: [9-1] FATAL: the database system is
starting up
Oct 16 17:58:47 postgres[38029]: [9-1] LOG: entering standby mode
Oct 16 17:58:47 postgres[38029]: [10-1] LOG: redo starts at 6/CF9B3598
Oct 16 17:58:47 postgres[38029]: [11-1] LOG: consistent recovery state
reached at 6/CF9C53B8
Oct 16 17:58:47 postgres[38026]: [8-1] LOG: database system is ready to
accept read only connections
Oct 16 17:58:48 postgres[38029]: [12-1] FATAL: invalid memory alloc request
size 1715044352
Oct 16 17:58:48 postgres[38026]: [9-1] LOG: startup process (PID 38029)
exited with exit code 1
Oct 16 17:58:48 postgres[38026]: [10-1] LOG: terminating any other active
server processes
Oct 16 17:58:48 postgres[38026]: [11-1] LOG: database system is shut down

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #15438: Standby corruption after "Too many open files in system" error

On 2018-Oct-18, PG Bug reporting form wrote:

The primary was working the whole time and recreating the standby replica,
after configuring the user limits, seems to solve the issue.

Hmm, we recently fixed a bug were file descriptors were being leaked,
after 10.5 was tagged (commit e00f4b68dc87 [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8) but that was in logical
decoding, and the symptoms seem completely different anyway. I'm not
sure it would affect you. The epoll_create stuff in latch.c is pretty
young though.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa31b6f4e9696f3c9777bf4ec2faea822826ce9f

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Alvaro Herrera (#2)

Re: BUG #15438: Standby corruption after "Too many open files in system" error

Hi,

On 2018-10-22 14:45:04 -0300, Alvaro Herrera wrote:

On 2018-Oct-18, PG Bug reporting form wrote:

The primary was working the whole time and recreating the standby replica,
after configuring the user limits, seems to solve the issue.

When you say corruption, do you mean that the data was corrupted
afterwards? Or just that the standby spewed errors until you restarted
with adjusted limits?

Hmm, we recently fixed a bug were file descriptors were being leaked,
after 10.5 was tagged (commit e00f4b68dc87 [1]) but that was in logical
decoding, and the symptoms seem completely different anyway. I'm not
sure it would affect you. The epoll_create stuff in latch.c is pretty
young though.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e00f4b68dc878dcee46833a742844346daa1e3c8
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fa31b6f4e9696f3c9777bf4ec2faea822826ce9f

I'd assume the fact that this hit epoll_create is more owing to the fact
that the file descriptor used for epoll isn't going through fd.c. That
means we don't previously release an fd if we're at our limit. That's
supposed to be catered for by a reserve of fds, but that reserve isn't
that large.

There previously have been discussions whether we can make fd.c handle
fds that need to be created outside of fd.c's remit (by making sure
there's space, and then tracking that similar to OpenTransientFile()).

Greetings,

Andres Freund

Juan José Santamaría Flecha

juanjo.santamaria@gmail.com

over 7 years ago

In reply to: Andres Freund (#3)

Re: BUG #15438: Standby corruption after "Too many open files in system" error

Hi,

When you say corruption, do you mean that the data was corrupted

afterwards? Or just that the standby spewed errors until you restarted
with adjusted limits?

We didn't do a proper low level check to verify that the database was
corrupted, but the symtopms took us to belive so: after a restart the
system crashed almost immediately with and 'invalid memory alloc request
size', no matter the user limits.

There is a backup still available if any test can be of interest.

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Juan José Santamaría Flecha (#4)

Re: BUG #15438: Standby corruption after "Too many open files in system" error

Hi,

On 2018-10-22 20:18:32 +0200, Juan Josï¿½ Santamarï¿½a Flecha wrote:

When you say corruption, do you mean that the data was corrupted

afterwards? Or just that the standby spewed errors until you restarted
with adjusted limits?

We didn't do a proper low level check to verify that the database was
corrupted, but the symtopms took us to belive so: after a restart the
system crashed almost immediately with and 'invalid memory alloc request
size', no matter the user limits.

There is a backup still available if any test can be of interest.

Yea, I think looking at that would be worthwhile. That an FD leak or
such could cause replay to get stuck is not that suprrising, but it
shouldn't cause corruption. That hints that something further is borked.

Greetings,

Andres Freund