Strange failures on chipmunk
Hi,
chipmunk (an armv6l-powered original Raspberry Pi model 1?) has failed
in a couple of weird ways recently on 14 and master.
On 14 I see what appears to be a corrupted log file name:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-16%2006%3A48%3A07
cp: cannot stat
\342\200\230/home/pgbfarm/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000003\342\200\231:
No such file or directory
On master, you can ignore this failure, because it was addressed by 93759c66:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-05-11%2015%3A26%3A01
Then there's this one-off, that smells like WAL corruption:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44
2022-06-13 23:02:06.988 EEST [30121:5] LOG: incorrect resource
manager data checksum in record at 0/79B4FE0
Hmmm. I suppose it's remotely possible that Linux/armv6l ext4 suffers
from concurrency bugs like Linux/sparc. In that particular kernel
bug's case it's zeroes, so I guess it'd be easier to speculate about
if the log message included the checksum when it fails like that...
On Thu, Jun 30, 2022 at 10:07:18AM +1200, Thomas Munro wrote:
Then there's this one-off, that smells like WAL corruption:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44
2022-06-13 23:02:06.988 EEST [30121:5] LOG: incorrect resource
manager data checksum in record at 0/79B4FE0Hmmm. I suppose it's remotely possible that Linux/armv6l ext4 suffers
from concurrency bugs like Linux/sparc.
Running sparc64-ext4-zeros.c from
https://marc.info/?l=linux-sparc&m=164539269632667&w=2 could confirm that
possibility.
On 30/06/2022 09:31, Noah Misch wrote:
On Thu, Jun 30, 2022 at 10:07:18AM +1200, Thomas Munro wrote:
Then there's this one-off, that smells like WAL corruption:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44
2022-06-13 23:02:06.988 EEST [30121:5] LOG: incorrect resource
manager data checksum in record at 0/79B4FE0Hmmm. I suppose it's remotely possible that Linux/armv6l ext4 suffers
from concurrency bugs like Linux/sparc.Running sparc64-ext4-zeros.c from
https://marc.info/?l=linux-sparc&m=164539269632667&w=2 could confirm that
possibility.
I ran sparc64-ext4-zeros on chipmunk for 10 minutes, and it didn't print
anything.
It's possible that the SD card on chipmunk is simply wearing out and
flipping bits. I can try to replace it. Anyone have suggestions on a
test program I could run on the SD card, after replacing it, to verify
if it was indeed worn out?
- Heikki
On Thu, Jun 30, 2022 at 8:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I ran sparc64-ext4-zeros on chipmunk for 10 minutes, and it didn't print
anything.
Thanks for checking.
It's possible that the SD card on chipmunk is simply wearing out and
flipping bits. I can try to replace it. Anyone have suggestions on a
test program I could run on the SD card, after replacing it, to verify
if it was indeed worn out?
BTW its disk is full.
FWIW I run RPi4 build bots on higher end USB3.x sticks (SanDisk
Extreme Pro, I'm sure there are others), and the performance is orders
of magnitude higher and more consistent than the micro SD and
cheap/random USB sticks I tried. Admittedly they cost more than the
RPi4 board themselves (back when you could get them).
I noticed another (presumed) Raspberry Pi apparently behaving
strangely at the storage level (guessing it's a Pi by the armv7l
architecture): dangomushi appears to get files mixed up. Here it is
trying to compile a log file last week:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-14%2017%3A58%3A38
And the week before it tried to compile some Perl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-09%2015%3A30%3A07
On Fri, Jul 22, 2022 at 04:35:30PM +1200, Thomas Munro wrote:
I noticed another (presumed) Raspberry Pi apparently behaving
strangely at the storage level (guessing it's a Pi by the armv7l
architecture): dangomushi appears to get files mixed up. Here it is
trying to compile a log file last week:
This is a PI2.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-14%2017%3A58%3A38
And the week before it tried to compile some Perl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-09%2015%3A30%3A07
The buildfarm runs are part of a SD card that's been running for a
couple of years now, so I would not be surprised that the issue comes
from the years using it. A couple of fsck's did not show up anything,
though, but I am keeping an eye on it.
--
Michael