[BUG] PostgreSQL 14.5: Data corruption in table tb_workstation_message (possible fstrim-related on eMMC)

Started by 第108次明天4 months ago3 messagesbugs

Jump to latest

第108次明天

87326549@qq.com

4 months ago

To: pgsql-bugs@postgresql.org

Subject: [BUG] PostgreSQL 14.5: Data corruption in table tb_workstation_message (possible fstrim-related on eMMC)

Environment Information

PostgreSQL Version: 14.5

OS: Debian GNU/Linux 11 (bullseye)

Architecture: aarch64 (8-core CPU)

Memory: 8GB

Storage Medium: eMMC

Problem Description
Data corruption occurred in the tb_workstation_message table (record guid: ce04bd3f-232d-4c7f-9a91-540d0e581649). Queries targeting this record fail with:

plaintext

ERROR: invalid memory alloc request size 18446744073709551613

Corruption details of the record:

Only the guid field remains valid;

All int-type fields (e.g., upload_status) are set to 0;

All timestamp-type fields (e.g., create_time) are reset to 2000-01-01 00:00:00;

String-type fields (e.g., operator, content) are inaccessible (queryable only after forcing UPDATE to null).

Timeline

Record insertion: 2025-11-25 21:23:51

Last modification: 2025-11-27 09:41:10

Anomaly window: 2025-12-01 03:00:00 ~ 09:02:31 (record was intact before 03:00:00)

First error detected: 2025-12-01 09:02:31 (triggered by a 30s-interval business cron job querying upload_status = 0)

Error Output (Query Example)
sql

postgres=# select * from tb_workstation_message where guid = 'ce04bd3f-232d-4c7f-9a91-540d0e581649'; ERROR: invalid memory alloc request size 18446744073709551613

Investigation Details

Database Logs: No output between 2025-12-01 05:37:05.932 ~ 09:02:31.533 (no abnormalities found).

Dynamic Library Check: libpq.so had environment variable misconfiguration (not pointing to PG install dir), but the business app does NOT depend on this library — low corruption probability.

eMMC Storage Analysis:

Used dd to extract the corrupted block; hex analysis shows only guid bytes are valid (others are 0, confirming physical data corruption).

fstrim Check:

fstrim executed at 2025-12-01 00:27 (asynchronous, runs on disk idle); anomaly occurred after this.

Reproduction attempt: Simulated data insertion + 10+ fstrim runs — corruption not reproduced.

Impact

Business failure: Query errors trigger business alerts.

Backup failure: Database backups are affected by corrupted data.

Additional Notes

No system/app/PG restarts during the anomaly window.

No writes to tb_workstation_message after 2025-12-01 05:34:00.

No online operations/terminals between 05:34:00 ~ 09:02:31.

Request:

We suspect this may relate to fstrim on eMMC (with PG 14.5) but cannot reproduce it. Could the community help analyze the root cause, or guide further troubleshooting?

第108次明天
87326549@qq.com

从QQ邮箱发来的超大附件

postgresql_2025-12-01_000000(1).csv (739.5KB, 2026年1月17日 14:42) 进入下载页面：https://wx.mail.qq.com/ftn/download?func=3&k=c6cd4a393619073df4be1b3961663338e7bf3e3963663338111d585f4c530b5e0e05060d534b57090f54180d025650155805500f4e5f030f0153575d5a5302590b1135490c15475f4b5746480f3901080b071808514b03096602050953560310081b1b5a1010272c6b8d6cf59bcb17826e22a5c9c7abe665fb3a453863663338393235&key=c6cd4a393619073df4be1b3961663338e7bf3e3963663338111d585f4c530b5e0e05060d534b57090f54180d025650155805500f4e5f030f0153575d5a5302590b1135490c15475f4b5746480f3901080b071808514b03096602050953560310081b1b5a1010272c6b8d6cf59bcb17826e22a5c9c7abe665fb3a453863663338393235&code=9259cf38&from=

Laurenz Albe

laurenz.albe@cybertec.at

4 months ago

In reply to: 第108次明天 (#1)

Re: [BUG] PostgreSQL 14.5: Data corruption in table tb_workstation_message (possible fstrim-related on eMMC)

On Thu, 2025-12-18 at 14:42 +0800, 第108次明天 wrote:

Environment Information
* PostgreSQL Version: 14.5
* OS: Debian GNU/Linux 11 (bullseye)
* Architecture: aarch64 (8-core CPU)
* Memory: 8GB
* Storage Medium: eMMC
Problem Description
Data corruption occurred in the tb_workstation_message table (record guid: ce04bd3f-232d-4c7f-9a91-540d0e581649).
Queries targeting this record fail with:
ERROR: invalid memory alloc request size 18446744073709551613
Corruption details of the record:
* Only the guid field remains valid;
* All int-type fields (e.g., upload_status) are set to 0;
* All timestamp-type fields (e.g., create_time) are reset to 2000-01-01 00:00:00;
* String-type fields (e.g., operator, content) are inaccessible (queryable only after forcing UPDATE to null).
Investigation Details 1. Database Logs: No output between 2025-12-01 05:37:05.932 ~ 09:02:31.533 (no abnormalities found).
   2. Dynamic Library Check: libpq.so had environment variable misconfiguration (not pointing to PG install dir),
but the business app does NOT depend on this library — low corruption probability.
   3. eMMC Storage Analysis:
       - Used dd to extract the corrupted block; hex analysis shows only guid bytes are valid (others are 0,
confirming physical data corruption).
   4. fstrim Check:
       - fstrim executed at 2025-12-01 00:27 (asynchronous, runs on disk idle); anomaly occurred after this.
       - Reproduction attempt: Simulated data insertion + 10+ fstrim runs — corruption not reproduced.
Impact * Business failure: Query errors trigger business alerts.
* Backup failure: Database backups are affected by corrupted data.
Additional Notes * No system/app/PG restarts during the anomaly window.
* No writes to tb_workstation_message after 2025-12-01 05:34:00.
* No online operations/terminals between 05:34:00 ~ 09:02:31.
Request:
We suspect this may relate to fstrim on eMMC (with PG 14.5) but cannot reproduce it. Could the community
help analyze the root cause, or guide further troubleshooting?

"fstrim" should not destroy data...

I am wondering how this problem can lead to backup failure - unless you are using "pg_dump" for backups.

You probably figured out that you will have to delete the row to get rid of the problem.
In addition, you should dump and restore the database to a new cluster - who knows what
other data corruption is lurking there.

I don't know what caused your problem, but I would suspect hardware failure.
Perhaps you should check or replace the disk.

Yours,
Laurenz Albe

Greg Sabino Mullane

greg@turnstep.com

4 months ago

In reply to: 第108次明天 (#1)

Re: [BUG] PostgreSQL 14.5: Data corruption in table tb_workstation_message (possible fstrim-related on eMMC)

- PostgreSQL Version: 14.5

That is way too old. Upgrade to 14.20

or guide further troubleshooting?

Run the pg_checksums program to get a handle on the extent of the corruption

https://www.postgresql.org/docs/14/app-pgchecksums.html

Database backups are affected by corrupted data.

I assume you are not using a good backup system then. pgBackRest, for
example, will not only check for corrupted pages during the backup, but
report on it and allow the backup to proceed.

[BUG] PostgreSQL 14.5: Data corruption in table tb_workstation_message (possible fstrim-related on eMMC)

Attachments: