postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Started by Thomas Tignorabout 7 years ago10 messagesgeneral
Jump to latest
#1Thomas Tignor
tptignor@yahoo.com

Hoping someone may be able to offer some guidance on this recurring problem. I am providing this "problem report" to the general list as I understand the bugs list requires a set of reproduction steps we do not yet have. Please advise if I have the process wrong. I have tried to provide all known relevant info here. Thanks in advance for any insights.--> A description of what you are trying to achieve and what results you expect.:
We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to identify and eliminate all sources. We are using two independent services for data replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based on COPY operations. DB corruption is observed when COPY operations fail with an error of the form: 'invalid byte sequence for encoding "UTF8"'. This occurs with a frequency ranging between a few days and several weeks. Each incident is followed by a race to find and repair or remove corrupted data, which we are getting good at. With well over a dozen incidents, the great majority originally showed corruption in a single VARCHAR(2000) column (value) in a single table (alert_attribute). In this time, we read about suspected and real problems with TOAST functionality and so made the decision to change alert_attribute.value to PLAIN storage. Since that change was made, most new incidents show corruption in the alert_attribute.name column instead (VARCHAR(200)). Another table (alert_instance) has been impacted as well. See below for their schemas.

We have looked high and low through system logs and device reporting utility output for any sign of hardware failures. We haven't turned up anything yet. We also tried rebuilding an entire DB from scratch. That did not seem to help.

We have not been performing routine reindexing. This is a problem we are working to correct. Normally our master DB serves for an 8-12 week period without reindexing before we failover to a peer. Before assuming the master role, the peer always begins by truncating the alert_instance and alert_attribute tables and loading all data from the current master.

Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 direct-attached disks. We can provide additional info as needed.

ams=# \d ams.alert_attribute
Table "ams.alert_attribute"
Column | Type | Modifiers
-------------------+-------------------------+-----------
alert_instance_id | integer | not null
name | character varying(200) | not null
data_type | smallint | not null
value | character varying(2000) |
Indexes:
"pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5"
"idx_aa_aval" btree (name, value)
Foreign-key constraints:
"fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
Disabled user triggers:
_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#
ams=# \d ams.alert_instance
Table "ams.alert_instance"
Column | Type | Modifiers
---------------------+--------------------------------+-----------
alert_instance_id | integer | not null
alert_definition_id | integer | not null
alert_instance_key | character varying(500) | not null
start_active_date | timestamp(0) without time zone | not null
stop_active_date | timestamp(0) without time zone |
active | smallint | not null
acknowledged | smallint | not null
ack_clear_time | timestamp(0) without time zone |
user_set_clear_time | smallint |
category_id | integer | not null
condition_start | timestamp(0) without time zone | not null
unack_reason | character varying(1) |
viewer_visible | smallint | not null
Indexes:
"pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5"
"idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
"idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
"idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"
Check constraints:
"ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)
"ck_alert_inst_active" CHECK (active = 0 OR active = 1)
"ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1)
"ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)
Foreign-key constraints:
"fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id)
"fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id)
"fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason)
Referenced by:
TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k')
_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')
Disabled user triggers:
_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

--> PostgreSQL version number you are running:
9.5.14

--> How you installed PostgreSQL:
We use a customized installer. It copies a “make install”-based distribution into place on a target install host.
Our compiler is gcc 5.4.0.

CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall

$P/CONFIGURE_OPTIONS := \
--with-perl \
--with-openssl \
--with-libxml \
--with-libxslt \
--with-includes="$(COMMON)/include/perl $(COMMON)/include" \
--with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \
--enable-debug \
--disable-nls

--> Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all.
ams=# SELECT version();
version
------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit
(1 row)

ams=# SELECT name, current_setting(name), SOURCE
ams-# FROM pg_settings
ams-# WHERE SOURCE NOT IN ('default', 'override');
name | current_setting | source
--------------------------------+--------------------+----------------------
application_name | psql | client
autovacuum_vacuum_cost_delay | 100ms | configuration file
checkpoint_completion_target | 0.9 | configuration file
client_encoding | SQL_ASCII | client
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.002 | configuration file
DateStyle | ISO, MDY | configuration file
default_statistics_target | 150 | configuration file
default_text_search_config | pg_catalog.english | configuration file
effective_cache_size | 8007MB | configuration file
lc_messages | C | configuration file
lc_monetary | C | configuration file
lc_numeric | C | configuration file
lc_time | C | configuration file
listen_addresses | 0.0.0.0 | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_disconnections | on | configuration file
log_line_prefix | %m %p %u %h %d | configuration file
log_min_duration_statement | 1min | configuration file
log_min_messages | notice | configuration file
log_temp_files | 0 | configuration file
maintenance_work_mem | 512MB | configuration file
max_connections | 600 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 15360TB | configuration file
random_page_cost | 2.5 | configuration file
restart_after_crash | off | configuration file
shared_buffers | 2GB | configuration file
ssl | on | configuration file
ssl_ca_file | root.crt | configuration file
standard_conforming_strings | on | configuration file
statement_timeout | 1min | configuration file
superuser_reserved_connections | 10 | configuration file
syslog_ident | postgresql9.5-1.12 | configuration file
tcp_keepalives_idle | 0 | configuration file
unix_socket_permissions | 0770 | configuration file
work_mem | 30MB | configuration file
(40 rows)

ams=#

--> Operating system and version:
Ubuntu 16.04.5 LTS

--> What program you're using to connect to PostgreSQL:
Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver v42.0.0.

--> Is there anything relevant or unusual in the PostgreSQL server logs?:
We have sometimes but not always seen "invalid memory alloc request size" errors around the time of these events. Our experience suggests that "invalid memory alloc request size" errors are a symptom specifically of index corruption and are remediated by reindexing.

--> For questions about any kind of error:
--> What you were doing when the error happened / how to cause the error:
Periodic DML (inserts and updates) focused primarily on two DB tables (ams.alert_instance and ams.alert_attribute) with hourly batched transfers of inactive alerts to history tables (deletes from ams.alert_instance, ams.alert_attribute and corresponding inserts to ams.alert_instance_hist, ams.alert_attribute_hist within the same transaction).

--> The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)
2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR: invalid byte sequence for encoding "UTF8": 0x86
The specific identified byte varies with each incident.

--> CPU Info
2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz

--> Mem Info
16GB RAM

--> Storage Info
root@ams4.dfw.netmgmt:~# lsscsi -l

[0:0:0:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sda
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:1:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdb
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:2:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdc
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:3:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdd
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:4:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sde
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:5:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdf
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:6:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdg
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:7:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdh
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
root@ams4.dfw.netmgmt:~#

root@ams4.dfw.netmgmt:~# modinfo md_mod
filename: /lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko
alias: block-major-9-*
alias: md
description: MD RAID framework
license: GPL
depends:
retpoline: Y
intree: Y
name: md_mod
vermagic: 4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload modversions
parm: start_dirty_degraded:int
parm: create_on_open:bool
root@ams4.dfw.netmgmt:~#
Tom    :-)

#2Vijaykumar Jain
vjain@opentable.com
In reply to: Thomas Tignor (#1)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

i think the experts will chime in soon,

but why do you think this as db corruption and not just a bad input?
https://github.com/postgres/postgres/blob/master/src/pl/plperl/expected/plperl_lc_1.out

or it may also be encoding issue.
https://pganalyze.com/docs/log-insights/app-errors/U137

can you do a pg_dump and restore on a parallel instance? does it
result in failure?

we also ask the app to log data (temporarily) inserted so that we
could figure out directly if there was bad data upstream or have
validations to prevent inserts when there is bad data.

also, in our case the query was stuck at "PARSE" (if you do ps aux |
grep postgres) and in some cases did result in oom.
but upgrading the client and using session mode pooling in pgbouncer
worked for us.

Regards,
Vijay

Show quoted text

On Tue, Mar 26, 2019 at 12:17 AM Thomas Tignor <tptignor@yahoo.com> wrote:

Hoping someone may be able to offer some guidance on this recurring problem. I am providing this "problem report" to the general list as I understand the bugs list requires a set of reproduction steps we do not yet have. Please advise if I have the process wrong. I have tried to provide all known relevant info here. Thanks in advance for any insights.

--> A description of what you are trying to achieve and what results you expect.:

We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to identify and eliminate all sources. We are using two independent services for data replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based on COPY operations. DB corruption is observed when COPY operations fail with an error of the form: 'invalid byte sequence for encoding "UTF8"'. This occurs with a frequency ranging between a few days and several weeks. Each incident is followed by a race to find and repair or remove corrupted data, which we are getting good at. With well over a dozen incidents, the great majority originally showed corruption in a single VARCHAR(2000) column (value) in a single table (alert_attribute). In this time, we read about suspected and real problems with TOAST functionality and so made the decision to change alert_attribute.value to PLAIN storage. Since that change was made, most new incidents show corruption in the alert_attribute.name column instead (VARCHAR(200)). Another table (alert_instance) has been impacted as well. See below for their schemas.

We have looked high and low through system logs and device reporting utility output for any sign of hardware failures. We haven't turned up anything yet. We also tried rebuilding an entire DB from scratch. That did not seem to help.

We have not been performing routine reindexing. This is a problem we are working to correct. Normally our master DB serves for an 8-12 week period without reindexing before we failover to a peer. Before assuming the master role, the peer always begins by truncating the alert_instance and alert_attribute tables and loading all data from the current master.

Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 direct-attached disks. We can provide additional info as needed.

ams=# \d ams.alert_attribute
Table "ams.alert_attribute"
Column | Type | Modifiers
-------------------+-------------------------+-----------
alert_instance_id | integer | not null
name | character varying(200) | not null
data_type | smallint | not null
value | character varying(2000) |
Indexes:
"pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5"
"idx_aa_aval" btree (name, value)
Foreign-key constraints:
"fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
Disabled user triggers:
_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#
ams=# \d ams.alert_instance
Table "ams.alert_instance"
Column | Type | Modifiers
---------------------+--------------------------------+-----------
alert_instance_id | integer | not null
alert_definition_id | integer | not null
alert_instance_key | character varying(500) | not null
start_active_date | timestamp(0) without time zone | not null
stop_active_date | timestamp(0) without time zone |
active | smallint | not null
acknowledged | smallint | not null
ack_clear_time | timestamp(0) without time zone |
user_set_clear_time | smallint |
category_id | integer | not null
condition_start | timestamp(0) without time zone | not null
unack_reason | character varying(1) |
viewer_visible | smallint | not null
Indexes:
"pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5"
"idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
"idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
"idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"
Check constraints:
"ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)
"ck_alert_inst_active" CHECK (active = 0 OR active = 1)
"ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1)
"ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)
Foreign-key constraints:
"fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id)
"fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id)
"fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason)
Referenced by:
TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k')
_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')
Disabled user triggers:
_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

--> PostgreSQL version number you are running:
9.5.14

--> How you installed PostgreSQL:
We use a customized installer. It copies a “make install”-based distribution into place on a target install host.
Our compiler is gcc 5.4.0.

CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall

$P/CONFIGURE_OPTIONS := \
--with-perl \
--with-openssl \
--with-libxml \
--with-libxslt \
--with-includes="$(COMMON)/include/perl $(COMMON)/include" \
--with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \
--enable-debug \
--disable-nls

--> Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all.
ams=# SELECT version();
version
------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit
(1 row)

ams=# SELECT name, current_setting(name), SOURCE
ams-# FROM pg_settings
ams-# WHERE SOURCE NOT IN ('default', 'override');
name | current_setting | source
--------------------------------+--------------------+----------------------
application_name | psql | client
autovacuum_vacuum_cost_delay | 100ms | configuration file
checkpoint_completion_target | 0.9 | configuration file
client_encoding | SQL_ASCII | client
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.002 | configuration file
DateStyle | ISO, MDY | configuration file
default_statistics_target | 150 | configuration file
default_text_search_config | pg_catalog.english | configuration file
effective_cache_size | 8007MB | configuration file
lc_messages | C | configuration file
lc_monetary | C | configuration file
lc_numeric | C | configuration file
lc_time | C | configuration file
listen_addresses | 0.0.0.0 | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_disconnections | on | configuration file
log_line_prefix | %m %p %u %h %d | configuration file
log_min_duration_statement | 1min | configuration file
log_min_messages | notice | configuration file
log_temp_files | 0 | configuration file
maintenance_work_mem | 512MB | configuration file
max_connections | 600 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 15360TB | configuration file
random_page_cost | 2.5 | configuration file
restart_after_crash | off | configuration file
shared_buffers | 2GB | configuration file
ssl | on | configuration file
ssl_ca_file | root.crt | configuration file
standard_conforming_strings | on | configuration file
statement_timeout | 1min | configuration file
superuser_reserved_connections | 10 | configuration file
syslog_ident | postgresql9.5-1.12 | configuration file
tcp_keepalives_idle | 0 | configuration file
unix_socket_permissions | 0770 | configuration file
work_mem | 30MB | configuration file
(40 rows)

ams=#

--> Operating system and version:
Ubuntu 16.04.5 LTS

--> What program you're using to connect to PostgreSQL:
Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver v42.0.0.

--> Is there anything relevant or unusual in the PostgreSQL server logs?:
We have sometimes but not always seen "invalid memory alloc request size" errors around the time of these events. Our experience suggests that "invalid memory alloc request size" errors are a symptom specifically of index corruption and are remediated by reindexing.

--> For questions about any kind of error:
--> What you were doing when the error happened / how to cause the error:
Periodic DML (inserts and updates) focused primarily on two DB tables (ams.alert_instance and ams.alert_attribute) with hourly batched transfers of inactive alerts to history tables (deletes from ams.alert_instance, ams.alert_attribute and corresponding inserts to ams.alert_instance_hist, ams.alert_attribute_hist within the same transaction).

--> The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)
2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR: invalid byte sequence for encoding "UTF8": 0x86
The specific identified byte varies with each incident.

--> CPU Info
2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz

--> Mem Info
16GB RAM

--> Storage Info
root@ams4.dfw.netmgmt:~# lsscsi -l

[0:0:0:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sda
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:1:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdb
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:2:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdc
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:3:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdd
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:4:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sde
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:5:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdf
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:6:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdg
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:7:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdh
state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
root@ams4.dfw.netmgmt:~#

root@ams4.dfw.netmgmt:~# modinfo md_mod
filename: /lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko
alias: block-major-9-*
alias: md
description: MD RAID framework
license: GPL
depends:
retpoline: Y
intree: Y
name: md_mod
vermagic: 4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload modversions
parm: start_dirty_degraded:int
parm: create_on_open:bool
root@ams4.dfw.netmgmt:~#

Tom :-)

#3Brad Nicholson
bradn@ca.ibm.com
In reply to: Vijaykumar Jain (#2)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Vijaykumar Jain <vjain@opentable.com> wrote on 03/25/2019 03:07:19 PM:

but why do you think this as db corruption and not just a bad input?
https://urldefense.proofpoint.com/v2/url?

u=https-3A__github.com_postgres_postgres_blob_master_src_pl_plperl_expected_plperl-5Flc-5F1.out&d=DwIFaQ&c=jf_iaSHvJObTbx-

siA1ZOg&r=BX8eA7xgfVJIpaY_30xSZQ&m=7u71qfQylE2M0dQlbUBn399O53IK1HQHm-
Unxl9LUzw&s=K6nXHvrx3aX4riGMLnucLoRa76QNC0_TOS5R4AziTVM&e=

This looked interesting to me in the settings below:

client_encoding | SQL_ASCII | client

Unless you have set this explicitly, it will use the default encoding for
the database. If it hasn't been explicitly set, then the source database
(assuming that that output was from the source) is SQL_ASCII.

Double check the database encoding for the source database and target
database. I'm wondering if you have SQL_ASCII for the source, and UTF8 for
the target. If that is the case, you can take invalid UTF8 characters into
the source, and they will fail to replicate to the target. That's not a
Postgres problem, but an encoding mismatch

Brad

#4Thomas Tignor
tptignor@yahoo.com
In reply to: Vijaykumar Jain (#2)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Hi Vijay,Thanks for writing. To your questions:

i think the experts will chime in soon,

but why do you think this as db corruption and not just a bad input?
https://github.com/postgres/postgres/blob/master/src/pl/plperl/expected/plperl_lc_1.out

tptignor: Our JVM logs data before the insert and update ops. It all looks fine there, and is usually straight ASCII. 
or it may also be encoding issue.
U137: Invalid byte sequence for encoding · pganalyze
tptignor: There are a limited number of settings we apply to our driver, and the encoding isn't one of them. I expect it's using the server-configured SQL_ASCII encoding, which I understand is like having no rules and just takes data as is.

can you do a pg_dump and restore on a parallel instance? does it
result in failure?
tptignor: We can do better than that. When this occurs, we isolate and COPY the problem data to file (which always works fine), then COPY it back in. (Without the COPY in ops, the corruption might even go unnoticed...) The error on COPY in shows us a line number which takes us right to the corrupted data.

we also ask the app to log data (temporarily) inserted  so that we
could figure out directly if there was bad data upstream or have
validations to prevent inserts when there is bad data.

also, in our case the query was stuck at "PARSE"  (if you do ps aux |
grep postgres) and in some cases did result in oom.
but upgrading the client and using session mode pooling in pgbouncer
worked for us.

tptignor: We depend on a JVM which works through the v42.0.0 postgres driver. We haven't taken a hard look yet at the driver, but certainly would if there was a reason. I would have expected the server to catch bad data on input from the driver in the same way it catches on COPY in.
Tom    :-)

Regards,
Vijay

On Tue, Mar 26, 2019 at 12:17 AM Thomas Tignor <tptignor@yahoo.com> wrote:

Hoping someone may be able to offer some guidance on this recurring problem. I am providing this "problem report" to the general list as I understand the bugs list requires a set of reproduction steps we do not yet have. Please advise if I have the process wrong. I have tried to provide all known relevant info here. Thanks in advance for any insights.

--> A description of what you are trying to achieve and what results you expect.:

We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to identify and eliminate all sources. We are using two independent services for data replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based on COPY operations. DB corruption is observed when COPY operations fail with an error of the form: 'invalid byte sequence for encoding "UTF8"'. This occurs with a frequency ranging between a few days and several weeks. Each incident is followed by a race to find and repair or remove corrupted data, which we are getting good at. With well over a dozen incidents, the great majority originally showed corruption in a single VARCHAR(2000) column (value) in a single table (alert_attribute). In this time, we read about suspected and real problems with TOAST functionality and so made the decision to change alert_attribute.value to PLAIN storage. Since that change was made, most new incidents show corruption in the alert_attribute.name column instead (VARCHAR(200)). Another table (alert_instance) has been impacted as well. See below for their schemas.

We have looked high and low through system logs and device reporting utility output for any sign of hardware failures. We haven't turned up anything yet. We also tried rebuilding an entire DB from scratch. That did not seem to help.

We have not been performing routine reindexing. This is a problem we are working to correct. Normally our master DB serves for an 8-12 week period without reindexing before we failover to a peer. Before assuming the master role, the peer always begins by truncating the alert_instance and alert_attribute tables and loading all data from the current master.

Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 direct-attached disks. We can provide additional info as needed.

ams=# \d ams.alert_attribute
                Table "ams.alert_attribute"
      Column      |          Type          | Modifiers
-------------------+-------------------------+-----------
  alert_instance_id | integer                | not null
  name              | character varying(200)  | not null
  data_type        | smallint                | not null
  value            | character varying(2000) |
Indexes:
    "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5"
    "idx_aa_aval" btree (name, value)
Foreign-key constraints:
    "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
Disabled user triggers:
    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#
ams=# \d ams.alert_instance
                    Table "ams.alert_instance"
        Column        |              Type              | Modifiers
---------------------+--------------------------------+-----------
  alert_instance_id  | integer                        | not null
  alert_definition_id | integer                        | not null
  alert_instance_key  | character varying(500)        | not null
  start_active_date  | timestamp(0) without time zone | not null
  stop_active_date    | timestamp(0) without time zone |
  active              | smallint                      | not null
  acknowledged        | smallint                      | not null
  ack_clear_time      | timestamp(0) without time zone |
  user_set_clear_time | smallint                      |
  category_id        | integer                        | not null
  condition_start    | timestamp(0) without time zone | not null
  unack_reason        | character varying(1)          |
  viewer_visible      | smallint                      | not null
Indexes:
    "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5"
    "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
    "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
    "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"
Check constraints:
    "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)
    "ck_alert_inst_active" CHECK (active = 0 OR active = 1)
    "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1)
    "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)
Foreign-key constraints:
    "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id)
    "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id)
    "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason)
Referenced by:
    TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k')
    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')
Disabled user triggers:
    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=#

--> PostgreSQL version number you are running:
9.5.14

--> How you installed PostgreSQL:
We use a customized installer. It copies a “make install”-based distribution into place on a target install host.
Our compiler is gcc 5.4.0.

CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall

$P/CONFIGURE_OPTIONS := \
        --with-perl \
        --with-openssl \
        --with-libxml \
        --with-libxslt \
        --with-includes="$(COMMON)/include/perl $(COMMON)/include" \
        --with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \
        --enable-debug \
        --disable-nls

--> Changes made to the settings in the postgresql.conf file:  see Server Configuration for a quick way to list them all.
ams=# SELECT version();
                                                        version
------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit
(1 row)

ams=# SELECT name, current_setting(name), SOURCE
ams-#  FROM pg_settings
ams-#  WHERE SOURCE NOT IN ('default', 'override');
              name              |  current_setting  |        source
--------------------------------+--------------------+----------------------
  application_name              | psql              | client
  autovacuum_vacuum_cost_delay  | 100ms              | configuration file
  checkpoint_completion_target  | 0.9                | configuration file
  client_encoding                | SQL_ASCII          | client
  cpu_index_tuple_cost          | 0.001              | configuration file
  cpu_operator_cost              | 0.0005            | configuration file
  cpu_tuple_cost                | 0.002              | configuration file
  DateStyle                      | ISO, MDY          | configuration file
  default_statistics_target      | 150                | configuration file
  default_text_search_config    | pg_catalog.english | configuration file
  effective_cache_size          | 8007MB            | configuration file
  lc_messages                    | C                  | configuration file
  lc_monetary                    | C                  | configuration file
  lc_numeric                    | C                  | configuration file
  lc_time                        | C                  | configuration file
  listen_addresses              | 0.0.0.0            | configuration file
  log_autovacuum_min_duration    | 0                  | configuration file
  log_checkpoints                | on                | configuration file
  log_connections                | on                | configuration file
  log_disconnections            | on                | configuration file
  log_line_prefix                | %m %p %u %h %d    | configuration file
  log_min_duration_statement    | 1min              | configuration file
  log_min_messages              | notice            | configuration file
  log_temp_files                | 0                  | configuration file
  maintenance_work_mem          | 512MB              | configuration file
  max_connections                | 600                | configuration file
  max_stack_depth                | 2MB                | environment variable
  max_wal_size                  | 15360TB            | configuration file
  random_page_cost              | 2.5                | configuration file
  restart_after_crash            | off                | configuration file
  shared_buffers                | 2GB                | configuration file
  ssl                            | on                | configuration file
  ssl_ca_file                    | root.crt          | configuration file
  standard_conforming_strings    | on                | configuration file
  statement_timeout              | 1min              | configuration file
  superuser_reserved_connections | 10                | configuration file
  syslog_ident                  | postgresql9.5-1.12 | configuration file
  tcp_keepalives_idle            | 0                  | configuration file
  unix_socket_permissions        | 0770              | configuration file
  work_mem                      | 30MB              | configuration file
(40 rows)

ams=#

--> Operating system and version:
Ubuntu 16.04.5 LTS

--> What program you're using to connect to PostgreSQL:
Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver v42.0.0.

--> Is there anything relevant or unusual in the PostgreSQL server logs?:
We have sometimes but not always seen "invalid memory alloc request size" errors around the time of these events. Our experience suggests that "invalid memory alloc request size" errors are a symptom specifically of index corruption and are remediated by reindexing.

--> For questions about any kind of error:
--> What you were doing when the error happened / how to cause the error:
Periodic DML (inserts and updates) focused primarily on two DB tables (ams.alert_instance and ams.alert_attribute) with hourly batched transfers of inactive alerts to history tables (deletes from ams.alert_instance, ams.alert_attribute and corresponding inserts to ams.alert_instance_hist, ams.alert_attribute_hist within the same transaction).

--> The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)
2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR:  invalid byte sequence for encoding "UTF8": 0x86
The specific identified byte varies with each incident.

--> CPU Info
2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz

--> Mem Info
16GB RAM

--> Storage Info
root@ams4.dfw.netmgmt:~# lsscsi -l

[0:0:0:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sda
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:1:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdb
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:2:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdc
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:3:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdd
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:4:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sde
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:5:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdf
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:6:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdg
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
[0:0:7:0]    disk    ATA      INTEL SSDSA2BW12 0362  /dev/sdh
  state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
root@ams4.dfw.netmgmt:~#

root@ams4.dfw.netmgmt:~# modinfo md_mod
filename:      /lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko
alias:          block-major-9-*
alias:          md
description:    MD RAID framework
license:        GPL
depends:
retpoline:      Y
intree:        Y
name:          md_mod
vermagic:      4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload modversions
parm:          start_dirty_degraded:int
parm:          create_on_open:bool
root@ams4.dfw.netmgmt:~#

Tom    :-)

|
|
| |
U137: Invalid byte sequence for encoding · pganalyze

DBAs and developers use pganalyze to identify the root cause of performance issues, optimize queries and to get ...
|

|

|

#5Thomas Tignor
tptignor@yahoo.com
In reply to: Brad Nicholson (#3)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Hi Brad,Thanks for writing. As I mentioned to Vijay, the "source" is a JVM using the postgres v42.0.0 JDBC driver. I do not believe we have any explicit encoding set, and so I expect the client encoding is SQL_ASCII. The DB is most definitely UTF8. Our log shows no issue with the input data we've discovered (at the time that it's logged.) If the data is somehow corrupted before inserting, won't the server encoding kick in and generate an error? We can certainly test that.
Tom    :-)

On Monday, March 25, 2019, 3:56:04 PM EDT, Brad Nicholson <bradn@ca.ibm.com> wrote:

Vijaykumar Jain <vjain@opentable.com> wrote on 03/25/2019 03:07:19 PM:

but why do you think this as db corruption and not just a bad input?
INVALID URI REMOVED
u=https-3A__github.com_postgres_postgres_blob_master_src_pl_plperl_expected_plperl-5Flc-5F1.out&d=DwIFaQ&c=jf_iaSHvJObTbx-
siA1ZOg&r=BX8eA7xgfVJIpaY_30xSZQ&m=7u71qfQylE2M0dQlbUBn399O53IK1HQHm-
Unxl9LUzw&s=K6nXHvrx3aX4riGMLnucLoRa76QNC0_TOS5R4AziTVM&e=

This looked interesting to me in the settings below:

  client_encoding                | SQL_ASCII          | client

Unless you have set this explicitly, it will use the default encoding for the database.  If it hasn't been explicitly set, then the source database (assuming that that output was from the source) is SQL_ASCII.

Double check the database encoding for the source database and target database.  I'm wondering if you have SQL_ASCII for the source, and UTF8 for the target.  If that is the case, you can take invalid UTF8 characters into the source, and they will fail to replicate to the target.  That's not a Postgres problem, but an encoding mismatch

Brad

#6Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Thomas Tignor (#1)
Re: postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Thomas Tignor wrote:

We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to
identify and eliminate all sources. We are using two independent services for data
replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based
on COPY operations. DB corruption is observed when COPY operations fail with an error
of the form: 'invalid byte sequence for encoding "UTF8"'.
This occurs with a frequency ranging between a few days and several weeks.

It might be a storage problem, but it could also be a problem with how the data
get into the database in the first place.

I would consider it unlikely that there is a bug in the TOAST code that changes
just one byte.

You could use block checksums to see if the storage is at fault, but unfortunately
it requires an "initdb", dump and restore to switch to checksums.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#7Thomas Tignor
tptignor@yahoo.com
In reply to: Laurenz Albe (#6)
Re: postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Hi Laurenz. Thanks for writing. I can tell you that while the error message usually identifies just one byte (though sometimes up to three), inspection of the data has frequently shown several bytes impacted. It often seems that the corruption begins at one point in the row and continues to the end.
At this point, we've taken a few mitigating steps: we have switched all the VARCHAR columns in the alert_instance and alert_attribute tables to use "plain" storage. Also, we've dropped a (name, value) index we were using for the alert_attribute table.
I looked at the checksums feature sometime earlier. This is something we could try, with some effort, but it appears a sum is applied to a page leaving shared_buffers and then checked when the page is read back. Our Slony-I replication service is pulling the latest changes into subscriber DBs every couple seconds. Normally, when data corruption occurs, it appears very quickly with a Slony-I COPY failure. It seems there may not be time to write a checksum.
Tom    :-)

On Tuesday, March 26, 2019, 4:25:33 AM EDT, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Thomas Tignor wrote:

We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to
identify and eliminate all sources. We are using two independent services for data
replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based
on COPY operations. DB corruption is observed when COPY operations fail with an error
of the form: 'invalid byte sequence for encoding "UTF8"'.
This occurs with a frequency ranging between a few days and several weeks.

It might be a storage problem, but it could also be a problem with how the data
get into the database in the first place.

I would consider it unlikely that there is a bug in the TOAST code that changes
just one byte.

You could use block checksums to see if the storage is at fault, but unfortunately
it requires an "initdb", dump and restore to switch to checksums.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#8Brad Nicholson
bradn@ca.ibm.com
In reply to: Thomas Tignor (#5)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Thomas Tignor <tptignor@yahoo.com> wrote on 03/25/2019 08:25:49 PM:

Hi Brad,
Thanks for writing. As I mentioned to Vijay, the "source" is a JVM
using the postgres v42.0.0 JDBC driver. I do not believe we have any
explicit encoding set, and so I expect the client encoding is
SQL_ASCII. The DB is most definitely UTF8.

These statements are contradictory.

The value of client_encoding from your select on pg_settings is SQL_ASCII.

The docs clearly state:
https://www.postgresql.org/docs/9.5/runtime-config-client.html
"Sets the client-side encoding (character set). The default is to use the
database encoding. "

If you don't have client_encoding explicitly, then it is using the database
encoding.

Run this on both the primary and replica to confirm the DB's are indeed
both in UTF8 encoding:

SELECT datname, pg_encoding_to_char(encoding) FROM pg_database ;

Brad

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brad Nicholson (#8)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

"Brad Nicholson" <bradn@ca.ibm.com> writes:

Thomas Tignor <tptignor@yahoo.com> wrote on 03/25/2019 08:25:49 PM:

Thanks for writing. As I mentioned to Vijay, the "source" is a JVM
using the postgres v42.0.0 JDBC driver. I do not believe we have any
explicit encoding set, and so I expect the client encoding is
SQL_ASCII. The DB is most definitely UTF8.

These statements are contradictory.

The value of client_encoding from your select on pg_settings is SQL_ASCII.

The docs clearly state:
https://www.postgresql.org/docs/9.5/runtime-config-client.html
"Sets the client-side encoding (character set). The default is to use the
database encoding. "

If you don't have client_encoding explicitly, then it is using the database
encoding.

Umm ... not necessarily. That bit in runtime-config-client.html
correctly states what the *server's* default for client_encoding is,
but an awful lot of client-side code will immediately override that.
psql will try to set it based on its LANG/LC_CTYPE environment,
for example. I don't know what JDBC does; it might be different.

But in any case, yes, it'd be more reliable to check pg_database.encoding
explicitly.

regards, tom lane

#10Thomas Tignor
tptignor@yahoo.com
In reply to: Tom Lane (#9)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

Hello again Brad, and hello Tom. Thanks for writing, and Tom I'll add some extra thanks for the many google searches over the years which have been answered by your posts. :)
It seems there's been some confusion on the data flow so I'll try to explain. The original "source" I referred to was our JVM generating alert data and performing DML on the DB via the v42.0.0 driver. I do not know of any encoding setting for the driver itself.
The driver writes directly to our primary (Slony-I master) DB. Changes are then replicated by our replica (Slony-I subscriber) DBs. Each of these DBs has a "UTF8" server encoding. COPY out ops (performed by Slony-I or anything else) always work fine, and COPY in ops fail if there is data corruption.

ams@ams3.dfw.netmgmt:/a/ams/lib/ext/database$ sum postgresql-42.0.0.jar 

12476   691

ams@ams3.dfw.netmgmt:/a/ams/lib/ext/database$ 

ams@ams3.dfw.netmgmt:~$ echo ---- Slony-I Master ----

---- Slony-I Master ----

ams@ams3.dfw.netmgmt:~$ psql -U akamai -d ams -c "SELECT datname, pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'ams'"

 datname | pg_encoding_to_char 

---------+---------------------

 ams     | UTF8

(1 row)

ams@ams3.dfw.netmgmt:~$ 

root@ams-repl7.sjc.netmgmt:~# echo ---- Slony-I Replica ----

---- Slony-I Replica ----

root@ams-repl7.sjc.netmgmt:~# psql -U akamai -d ams -c "SELECT datname, pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'ams'"

 datname | pg_encoding_to_char 

---------+---------------------

 ams     | UTF8

(1 row)

root@ams-repl7.sjc.netmgmt:~# 

Tom    :-)

On Tuesday, March 26, 2019, 10:14:59 AM EDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Brad Nicholson" <bradn@ca.ibm.com> writes:

Thomas Tignor <tptignor@yahoo.com> wrote on 03/25/2019 08:25:49 PM:

Thanks for writing. As I mentioned to Vijay, the "source" is a JVM
using the postgres v42.0.0 JDBC driver. I do not believe we have any
explicit encoding set, and so I expect the client encoding is
SQL_ASCII. The DB is most definitely UTF8.

These statements are contradictory.

The value of client_encoding from your select on pg_settings is SQL_ASCII.

The docs clearly state:
https://www.postgresql.org/docs/9.5/runtime-config-client.html
"Sets the client-side encoding (character set). The default is to use the
database encoding. "

If you don't have client_encoding explicitly, then it is using the database
encoding.

Umm ... not necessarily.  That bit in runtime-config-client.html
correctly states what the *server's* default for client_encoding is,
but an awful lot of client-side code will immediately override that.
psql will try to set it based on its LANG/LC_CTYPE environment,
for example.  I don't know what JDBC does; it might be different.

But in any case, yes, it'd be more reliable to check pg_database.encoding
explicitly.

            regards, tom lane