[PATCH] Report the query string that caused a memory error under Valgrind
We use Valgrind --together with the suppression file provided in Postgres repo-- to test Citus extension against memory errors.
We replace /bin/postgres executable with a simple bash script that executes the original postgres executable under Valgrind and then we run our usual regression tests.
However, it is quite hard to understand which query caused a memory error in the stack traces that has been dumped into valgrind logfile.
For this reason, we propose the attached patch to allow Valgrind to report the query string that caused a memory error right after the relevant stack trace.
I belive this would not only be useful for Citus but also for Postgres and other extensions in their valgrind-testing process.
An example piece of valgrind test output for a memory error found in Citus is as follows:
==67222== VALGRINDERROR-BEGIN
==67222== Invalid write of size 8
==67222== at 0x7A6F040: dlist_delete (home/pguser/postgres-installation/include/postgresql/server/lib/ilist.h:360)
==67222== by 0x7A6F040: ResetRemoteTransaction (home/pguser/citus/src/backend/distributed/transaction/remote_transaction.c:872)
==67222== by 0x79CF606: AfterXactHostConnectionHandling (home/pguser/citus/src/backend/distributed/connection/connection_management.c:1468)
==67222== by 0x79CF65E: AfterXactConnectionHandling (home/pguser/citus/src/backend/distributed/connection/connection_management.c:175)
==67222== by 0x7A6FEDA: CoordinatedTransactionCallback (home/pguser/citus/src/backend/distributed/transaction/transaction_management.c:309)
==67222== by 0x544F30: CallXactCallbacks (home/pguser/postgres-source/postgresql-15.1/src/backend/access/transam/xact.c:3661)
==67222== by 0x548E12: CommitTransaction (home/pguser/postgres-source/postgresql-15.1/src/backend/access/transam/xact.c:2298)
==67222== by 0x549BBC: CommitTransactionCommand (home/pguser/postgres-source/postgresql-15.1/src/backend/access/transam/xact.c:3048)
==67222== by 0x832C30: finish_xact_command (home/pguser/postgres-source/postgresql-15.1/src/backend/tcop/postgres.c:2750)
==67222== by 0x8352AF: exec_simple_query (home/pguser/postgres-source/postgresql-15.1/src/backend/tcop/postgres.c:1279)
==67222== by 0x837312: PostgresMain (home/pguser/postgres-source/postgresql-15.1/src/backend/tcop/postgres.c:4595)
==67222== by 0x79F7B5: BackendRun (home/pguser/postgres-source/postgresql-15.1/src/backend/postmaster/postmaster.c:4504)
==67222== by 0x7A24E6: BackendStartup (home/pguser/postgres-source/postgresql-15.1/src/backend/postmaster/postmaster.c:4232)
==67222== Address 0x7486378 is 3,512 bytes inside a recently re-allocated block of size 8,192 alloc'd
==67222== at 0x484486F: malloc (builddir/build/BUILD/valgrind-3.19.0/coregrind/m_replacemalloc/vg_replace_malloc.c:381)
==67222== by 0x98B6EB: AllocSetContextCreateInternal (home/pguser/postgres-source/postgresql-15.1/src/backend/utils/mmgr/aset.c:469)
==67222== by 0x79CEABA: InitializeConnectionManagement (home/pguser/citus/src/backend/distributed/connection/connection_management.c:107)
==67222== by 0x799FE9F: _PG_init (home/pguser/citus/src/backend/distributed/shared_library_init.c:464)
==67222== by 0x96AE6B: internal_load_library (home/pguser/postgres-source/postgresql-15.1/src/backend/utils/fmgr/dfmgr.c:289)
==67222== by 0x96B09A: load_file (home/pguser/postgres-source/postgresql-15.1/src/backend/utils/fmgr/dfmgr.c:156)
==67222== by 0x973122: load_libraries (home/pguser/postgres-source/postgresql-15.1/src/backend/utils/init/miscinit.c:1668)
==67222== by 0x974680: process_shared_preload_libraries (home/pguser/postgres-source/postgresql-15.1/src/backend/utils/init/miscinit.c:1686)
==67222== by 0x7A336A: PostmasterMain (home/pguser/postgres-source/postgresql-15.1/src/backend/postmaster/postmaster.c:1026)
==67222== by 0x6F303C: main (home/pguser/postgres-source/postgresql-15.1/src/backend/main/main.c:202)
==67222==
==67222== VALGRINDERROR-END
**67222** The query for which valgrind reported a memory error was: REFRESH MATERIALIZED VIEW other_schema.mat_view;
Attachments:
v1-0001-Report-the-query-string-that-caused-a-mem-error.patchapplication/octet-stream; name=v1-0001-Report-the-query-string-that-caused-a-mem-error.patchDownload
From 5eaca7af9a228ac5a675877362ca3be382e11431 Mon Sep 17 00:00:00 2001
From: Onur Tirtir <onurcantirtir@gmail.com>
Date: Tue, 31 Jan 2023 15:40:49 +0300
Subject: [PATCH v1] Report the query string that caused a memory error
If we run multiple queries under Valgrind, it becomes quite hard to
understand which query caused a memory error by looking into the stack
traces that Valgrind reported.
For this reason, this patch introduces some changes to
exec_simple_query() to report the queries that cause memory errors. At
the high-level:
* We compare number of memory errors reported by Valgrind before and
after this query-run by using VALGRIND_COUNT_ERRORS macro provided in
valgrind.h.
* If the query caused a memory error, then we report this query to
Valgrind log file (right after the relevant stack trace) by using
VALGRIND_PRINTF macro provided in valgrind.h.
---
src/backend/tcop/postgres.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 470b734e9e..e0032165ae 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -73,6 +73,9 @@
#include "utils/snapmgr.h"
#include "utils/timeout.h"
#include "utils/timestamp.h"
+#ifdef USE_VALGRIND
+#include <valgrind/valgrind.h>
+#endif
/* ----------------
* global variables
@@ -986,6 +989,9 @@ exec_simple_query(const char *query_string)
bool was_logged = false;
bool use_implicit_block;
char msec_str[32];
+#ifdef USE_VALGRIND
+ unsigned valgrind_errors_before_query = VALGRIND_COUNT_ERRORS;
+#endif
/*
* Report query to various monitoring facilities.
@@ -1343,6 +1349,15 @@ exec_simple_query(const char *query_string)
TRACE_POSTGRESQL_QUERY_DONE(query_string);
debug_query_string = NULL;
+
+#ifdef USE_VALGRIND
+ if (VALGRIND_COUNT_ERRORS > valgrind_errors_before_query)
+ {
+ VALGRIND_PRINTF("The query for which valgrind reported a "
+ "memory error was: %s\n",
+ query_string);
+ }
+#endif
}
/*
--
2.25.1
On 31.01.23 15:00, Onur Tirtir wrote:
We use Valgrind --together with the suppression file provided in
Postgres repo-- to test Citus extension against memory errors.We replace /bin/postgres executable with a simple bash script that
executes the original postgres executable under Valgrind and then we run
our usual regression tests.However, it is quite hard to understand which query caused a memory
error in the stack traces that has been dumped into valgrind logfile.For this reason, we propose the attached patch to allow Valgrind to
report the query string that caused a memory error right after the
relevant stack trace.I belive this would not only be useful for Citus but also for Postgres
and other extensions in their valgrind-testing process.
I can see how this could be useful. But this only handles queries using
the simple protocol. At least the extended protocol should be handled
as well. Maybe it would be better to move this up to PostgresMain() and
handle all protocol messages?
Hey Peter,
Thank you for reviewing the patch and for your feedback. I believe the v2 patch should be able to handle other protocol messages too.
-----Original Message-----
From: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Sent: Wednesday, March 22, 2023 7:00 PM
To: Onur Tirtir <Onur.Tirtir@microsoft.com>; pgsql-hackers@lists.postgresql.org
Subject: [EXTERNAL] Re: [PATCH] Report the query string that caused a memory error under Valgrind
[You don't often get email from peter.eisentraut@enterprisedb.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
On 31.01.23 15:00, Onur Tirtir wrote:
We use Valgrind --together with the suppression file provided in
Postgres repo-- to test Citus extension against memory errors.We replace /bin/postgres executable with a simple bash script that
executes the original postgres executable under Valgrind and then we
run our usual regression tests.However, it is quite hard to understand which query caused a memory
error in the stack traces that has been dumped into valgrind logfile.For this reason, we propose the attached patch to allow Valgrind to
report the query string that caused a memory error right after the
relevant stack trace.I belive this would not only be useful for Citus but also for Postgres
and other extensions in their valgrind-testing process.
I can see how this could be useful. But this only handles queries using the simple protocol. At least the extended protocol should be handled as well. Maybe it would be better to move this up to PostgresMain() and handle all protocol messages?
Attachments:
v2-0001-Report-the-query-string-that-caused-a-memory-erro.patchapplication/octet-stream; name=v2-0001-Report-the-query-string-that-caused-a-memory-erro.patchDownload
From 1d1aafdbe6b640ab66ec2fcf77852ba1bd7df084 Mon Sep 17 00:00:00 2001
From: Onur Tirtir <onurcantirtir@gmail.com>
Date: Thu, 23 Mar 2023 20:04:59 +0300
Subject: [PATCH v2] Report the query string that caused a memory error
If we run multiple queries under Valgrind, it becomes quite hard to
understand which query caused a memory error by looking into the stack
traces that Valgrind reported.
For this reason, this patch introduces some changes to
PostgresMain() to report the messages that cause memory errors. At
the high-level:
* We compare number of memory errors reported by Valgrind before and
after this message-run by using VALGRIND_COUNT_ERRORS macro provided
in valgrind.h.
* If the message caused a memory error, then we report this query to
Valgrind log file (right after the relevant stack trace) by
using VALGRIND_PRINTF macro provided in valgrind.h.
---
src/backend/tcop/postgres.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..aeb8ad34d7 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -73,6 +73,9 @@
#include "utils/snapmgr.h"
#include "utils/timeout.h"
#include "utils/timestamp.h"
+#ifdef USE_VALGRIND
+#include <valgrind/valgrind.h>
+#endif
/* ----------------
* global variables
@@ -4070,6 +4073,9 @@ PostgresMain(const char *dbname, const char *username)
volatile bool send_ready_for_query = true;
bool idle_in_transaction_timeout_enabled = false;
bool idle_session_timeout_enabled = false;
+#ifdef USE_VALGRIND
+ unsigned valgrind_errors_before_message = 0;
+#endif
Assert(dbname != NULL);
Assert(username != NULL);
@@ -4365,6 +4371,10 @@ PostgresMain(const char *dbname, const char *username)
for (;;)
{
+#ifdef USE_VALGRIND
+ valgrind_errors_before_message = VALGRIND_COUNT_ERRORS;
+#endif
+
/*
* At top of loop, reset extended-query-message flag, so that any
* errors encountered in "idle" state don't provoke skip.
@@ -4802,6 +4812,22 @@ PostgresMain(const char *dbname, const char *username)
errmsg("invalid frontend message type %d",
firstchar)));
}
+
+#ifdef USE_VALGRIND
+
+ /*
+ * Report the message (and its type) that caused a memory error if
+ * it's not something that resulted in a backend exit or an ereport()
+ * call.
+ */
+ if (VALGRIND_COUNT_ERRORS > valgrind_errors_before_message)
+ {
+ VALGRIND_PRINTF("The message for which valgrind reported a "
+ "memory error was (of type '%c'):\n%s\n\n",
+ firstchar,
+ input_message.len ? input_message.data : "(empty)");
+ }
+#endif
} /* end of input-reading loop */
}
--
2.25.1
Onur Tirtir <Onur.Tirtir@microsoft.com> writes:
Thank you for reviewing the patch and for your feedback. I believe the v2 patch should be able to handle other protocol messages too.
I like the concept here, but the reporting that the v2 patch provides
would be seriously horrid: it's trying to print stuff that isn't
necessarily text, and for bind and execute messages it's substantially
dumber than the existing debug_query_string infrastructure. Another
thing that is not great is that if Postgres itself throws an error
later in the query, nothing will be reported since we don't reach the
bottom of the processing loop.
I suggest that we need something closer to the attached. Some
bikeshedding is possible on the specific printouts, but I'm not
sure it's worth working harder than this.
regards, tom lane
Attachments:
v3-0001-Report-the-query-string-that-caused-a-mem-error.patchtext/x-diff; charset=us-ascii; name*0=v3-0001-Report-the-query-string-that-caused-a-mem-error.pat; name*1=chDownload
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cab709b07b..a10ecbaf50 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -27,6 +27,10 @@
#include <sys/socket.h>
#include <sys/time.h>
+#ifdef USE_VALGRIND
+#include <valgrind/valgrind.h>
+#endif
+
#include "access/parallel.h"
#include "access/printtup.h"
#include "access/xact.h"
@@ -191,6 +195,36 @@ static void enable_statement_timeout(void);
static void disable_statement_timeout(void);
+/* ----------------------------------------------------------------
+ * infrastructure for valgrind debugging
+ * ----------------------------------------------------------------
+ */
+#ifdef USE_VALGRIND
+/* This variable should be set at the top of the main loop. */
+static unsigned int old_valgrind_error_count;
+
+/*
+ * If Valgrind detected any errors since old_valgrind_error_count was updated,
+ * report the current query as the cause. This should be called at the end
+ * of message processing.
+ */
+static void
+valgrind_report_error_query(const char *query)
+{
+ unsigned int valgrind_error_count = VALGRIND_COUNT_ERRORS;
+
+ if (unlikely(valgrind_error_count != old_valgrind_error_count) &&
+ query != NULL)
+ VALGRIND_PRINTF("Valgrind detected %u error(s) during execution of \"%s\"\n",
+ valgrind_error_count - old_valgrind_error_count,
+ query);
+}
+
+#else /* !USE_VALGRIND */
+#define valgrind_report_error_query(query) ((void) 0)
+#endif /* USE_VALGRIND */
+
+
/* ----------------------------------------------------------------
* routines to obtain user input
* ----------------------------------------------------------------
@@ -2041,6 +2075,8 @@ exec_bind_message(StringInfo input_message)
if (save_log_statement_stats)
ShowUsage("BIND MESSAGE STATISTICS");
+ valgrind_report_error_query(debug_query_string);
+
debug_query_string = NULL;
}
@@ -2292,6 +2328,8 @@ exec_execute_message(const char *portal_name, long max_rows)
if (save_log_statement_stats)
ShowUsage("EXECUTE MESSAGE STATISTICS");
+ valgrind_report_error_query(debug_query_string);
+
debug_query_string = NULL;
}
@@ -4287,6 +4325,12 @@ PostgresMain(const char *dbname, const char *username)
/* Report the error to the client and/or server log */
EmitErrorReport();
+ /*
+ * If Valgrind noticed something during the erroneous query, print the
+ * query string, assuming we have one.
+ */
+ valgrind_report_error_query(debug_query_string);
+
/*
* Make sure debug_query_string gets reset before we possibly clobber
* the storage it points at.
@@ -4371,6 +4415,13 @@ PostgresMain(const char *dbname, const char *username)
*/
doing_extended_query_message = false;
+ /*
+ * For valgrind reporting purposes, the "current query" begins here.
+ */
+#ifdef USE_VALGRIND
+ old_valgrind_error_count = VALGRIND_COUNT_ERRORS;
+#endif
+
/*
* Release storage left over from prior query cycle, and create a new
* query input buffer in the cleared MessageContext.
@@ -4571,6 +4622,8 @@ PostgresMain(const char *dbname, const char *username)
else
exec_simple_query(query_string);
+ valgrind_report_error_query(query_string);
+
send_ready_for_query = true;
}
break;
@@ -4600,6 +4653,8 @@ PostgresMain(const char *dbname, const char *username)
exec_parse_message(query_string, stmt_name,
paramTypes, numParams);
+
+ valgrind_report_error_query(query_string);
}
break;
@@ -4614,6 +4669,8 @@ PostgresMain(const char *dbname, const char *username)
* the field extraction out-of-line
*/
exec_bind_message(&input_message);
+
+ /* exec_bind_message does valgrind_report_error_query */
break;
case 'E': /* execute */
@@ -4631,6 +4688,8 @@ PostgresMain(const char *dbname, const char *username)
pq_getmsgend(&input_message);
exec_execute_message(portal_name, max_rows);
+
+ /* exec_execute_message does valgrind_report_error_query */
}
break;
@@ -4664,6 +4723,8 @@ PostgresMain(const char *dbname, const char *username)
/* commit the function-invocation transaction */
finish_xact_command();
+ valgrind_report_error_query("fastpath function call");
+
send_ready_for_query = true;
break;
@@ -4708,6 +4769,8 @@ PostgresMain(const char *dbname, const char *username)
if (whereToSendOutput == DestRemote)
pq_putemptymessage('3'); /* CloseComplete */
+
+ valgrind_report_error_query("CLOSE message");
}
break;
@@ -4740,6 +4803,8 @@ PostgresMain(const char *dbname, const char *username)
describe_type)));
break;
}
+
+ valgrind_report_error_query("DESCRIBE message");
}
break;
@@ -4752,6 +4817,7 @@ PostgresMain(const char *dbname, const char *username)
case 'S': /* sync */
pq_getmsgend(&input_message);
finish_xact_command();
+ valgrind_report_error_query("SYNC message");
send_ready_for_query = true;
break;
Hey Tom,
Thank you for sharing your proposal as a patch. It looks much nicer and useful than mine.
I've also tested it for a few cases --by injecting a memory error on purpose-- and seen that it helps reporting the problematic query.
Should we move forward with v3 then?
==13210== VALGRINDERROR-BEGIN
==13210== Conditional jump or move depends on uninitialised value(s)
==13210== at 0x75B88C: exec_simple_query (home/onurctirtir/postgres/src/backend/tcop/postgres.c:1070)
==13210== by 0x760ABB: PostgresMain (home/onurctirtir/postgres/src/backend/tcop/postgres.c:4624)
==13210== by 0x688F1A: BackendRun (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:4461)
==13210== by 0x688801: BackendStartup (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:4189)
==13210== by 0x684D21: ServerLoop (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:1779)
==13210== by 0x6845F6: PostmasterMain (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:1463)
==13210== by 0x540351: main (home/onurctirtir/postgres/src/backend/main/main.c:200)
==13210== Uninitialised value was created by a heap allocation
==13210== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==13210== by 0x75B812: exec_simple_query (home/onurctirtir/postgres/src/backend/tcop/postgres.c:1023)
==13210== by 0x760ABB: PostgresMain (home/onurctirtir/postgres/src/backend/tcop/postgres.c:4624)
==13210== by 0x688F1A: BackendRun (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:4461)
==13210== by 0x688801: BackendStartup (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:4189)
==13210== by 0x684D21: ServerLoop (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:1779)
==13210== by 0x6845F6: PostmasterMain (home/onurctirtir/postgres/src/backend/postmaster/postmaster.c:1463)
==13210== by 0x540351: main (home/onurctirtir/postgres/src/backend/main/main.c:200)
==13210==
==13210== VALGRINDERROR-END
**13210** Valgrind detected 1 error(s) during execution of "select 1;"
**13210** Valgrind detected 1 error(s) during execution of "select 1;"
Best, Onur
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Sunday, April 2, 2023 11:14 PM
To: Onur Tirtir <Onur.Tirtir@microsoft.com>
Cc: peter.eisentraut@enterprisedb.com; pgsql-hackers@lists.postgresql.org
Subject: Re: [EXTERNAL] Re: [PATCH] Report the query string that caused a memory error under Valgrind
[You don't often get email from tgl@sss.pgh.pa.us. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Onur Tirtir <Onur.Tirtir@microsoft.com> writes:
Thank you for reviewing the patch and for your feedback. I believe the v2 patch should be able to handle other protocol messages too.
I like the concept here, but the reporting that the v2 patch provides would be seriously horrid: it's trying to print stuff that isn't necessarily text, and for bind and execute messages it's substantially dumber than the existing debug_query_string infrastructure. Another thing that is not great is that if Postgres itself throws an error later in the query, nothing will be reported since we don't reach the bottom of the processing loop.
I suggest that we need something closer to the attached. Some bikeshedding is possible on the specific printouts, but I'm not sure it's worth working harder than this.
regards, tom lane
Onur Tirtir <Onur.Tirtir@microsoft.com> writes:
Thank you for sharing your proposal as a patch. It looks much nicer and useful than mine.
I've also tested it for a few cases --by injecting a memory error on purpose-- and seen that it helps reporting the problematic query.
Should we move forward with v3 then?
OK, I pushed v3 as-is. We can refine it later if anyone has suggestions.
Thanks for the contribution!
regards, tom lane