PATCH: backtraces for error messages
Hi folks
I recently needed a way to get backtraces from errors in a convenient,
non-interactive and indescriminate way. The attached patch is the result.
It teaches Pg to use libunwind to self-backtrace in elog/ereport.
Anyone think this is useful/interesting/worth pursuing?
(Patch is currently against pg10, so this is a PoC only).
As written it emits a backtrace when log_error_verbosity=verbose or,
unconditionally, on PANIC. A bt can be hidden by errhidestack() but is
otherwise shown. That's ridiculously, excessively spammy, so it's not
viable for core as-is. Before playing with it too much I thought I'd ask
for ideas on if anyone thinks it's useful, and if so, how it'd work best.
My goal is to allow capture of extra diagnostic info from key locations in
production without needing to attach gdb. It's imperfect since sometimes
there's no convenient message, and other times you can't afford to set up
logging with enough detail. So this would be most useful when combined with
one of the occasionally discussed patches to allow for selective logging
verbosity on a module- or file- level. But I think it's still handy without
that.
I briefly looked into Windows too. Roughly the same approach could be used
to plug in dbghelp.dll support for Windows self-backtracing, it's just
rather uglier; see
https://jpassing.com/2008/03/12/walking-the-stack-of-the-current-thread/ .
BTW, Álvaro posted a simpler patch at
/messages/by-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql.
It uses backtrace() from glibc, and requires each site you want to bt to be
annotated explicitly. I forgot about backtrace() completely when I wrote
mine, and I prefer the control libunwind gives me anyway, but the reduced
dependency would be nice. Especially since backtrace() is in FreeBSD too.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
v2-0001-Support-generating-backtraces-in-logs-using-libun.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Support-generating-backtraces-in-logs-using-libun.patchDownload
From d9ff6bd8112b31d087d8442f25ffd430df794771 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Wed, 20 Jun 2018 10:57:31 +0800
Subject: [PATCH v2] Support generating backtraces in logs using libunwind
---
configure | 349 ++++++++++++++++++++++++++++-------------
configure.in | 22 +++
src/Makefile.global.in | 6 +-
src/backend/Makefile | 4 +-
src/backend/utils/error/elog.c | 112 +++++++++++++
src/include/pg_config.h.in | 6 +
src/include/pg_config_manual.h | 2 +-
src/include/utils/elog.h | 12 ++
8 files changed, 399 insertions(+), 114 deletions(-)
diff --git a/configure b/configure
index 2d5375d51c..87a7d60f3f 100755
--- a/configure
+++ b/configure
@@ -655,6 +655,8 @@ LIBOBJS
UUID_LIBS
LDAP_LIBS_BE
LDAP_LIBS_FE
+LIBUNWIND_LIBS
+LIBUNWIND_CFLAGS
PTHREAD_CFLAGS
PTHREAD_LIBS
PTHREAD_CC
@@ -716,12 +718,12 @@ with_perl
with_tcl
ICU_LIBS
ICU_CFLAGS
-PKG_CONFIG_LIBDIR
-PKG_CONFIG_PATH
-PKG_CONFIG
with_icu
enable_thread_safety
INCLUDES
+PKG_CONFIG_LIBDIR
+PKG_CONFIG_PATH
+PKG_CONFIG
autodepend
TAS
GCC
@@ -846,6 +848,7 @@ with_uuid
with_ossp_uuid
with_libxml
with_libxslt
+with_libunwind
with_system_tzdata
with_zlib
with_gnu_ld
@@ -868,7 +871,9 @@ PKG_CONFIG_LIBDIR
ICU_CFLAGS
ICU_LIBS
LDFLAGS_EX
-LDFLAGS_SL'
+LDFLAGS_SL
+LIBUNWIND_CFLAGS
+LIBUNWIND_LIBS'
# Initialize some variables set by options.
@@ -1543,6 +1548,7 @@ Optional Packages:
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
+ --with-libunwind use libunwind for enhanced error context stacks
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
@@ -1566,6 +1572,10 @@ Some influential environment variables:
ICU_LIBS linker flags for ICU, overriding pkg-config
LDFLAGS_EX extra linker flags for linking executables only
LDFLAGS_SL extra linker flags for linking shared libraries only
+ LIBUNWIND_CFLAGS
+ override cflags used when building with libunwind
+ LIBUNWIND_LIBS
+ override libraries linked when building with libunwind
Use these variables to override the choices made by `configure' or to help
it to find libraries and programs with nonstandard names/locations.
@@ -5359,112 +5369,8 @@ fi
#
-# Include directories
+# Look for pkg-config
#
-ac_save_IFS=$IFS
-IFS="${IFS}${PATH_SEPARATOR}"
-# SRCH_INC comes from the template file
-for dir in $with_includes $SRCH_INC; do
- if test -d "$dir"; then
- INCLUDES="$INCLUDES -I$dir"
- else
- { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** Include directory $dir does not exist." >&5
-$as_echo "$as_me: WARNING: *** Include directory $dir does not exist." >&2;}
- fi
-done
-IFS=$ac_save_IFS
-
-
-
-#
-# Library directories
-#
-ac_save_IFS=$IFS
-IFS="${IFS}${PATH_SEPARATOR}"
-# LIBRARY_DIRS comes from command line, SRCH_LIB from template file.
-for dir in $LIBRARY_DIRS $SRCH_LIB; do
- if test -d "$dir"; then
- LIBDIRS="$LIBDIRS -L$dir"
- else
- { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** Library directory $dir does not exist." >&5
-$as_echo "$as_me: WARNING: *** Library directory $dir does not exist." >&2;}
- fi
-done
-IFS=$ac_save_IFS
-
-#
-# Enable thread-safe client libraries
-#
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking allow thread-safe client libraries" >&5
-$as_echo_n "checking allow thread-safe client libraries... " >&6; }
-
-
-# Check whether --enable-thread-safety was given.
-if test "${enable_thread_safety+set}" = set; then :
- enableval=$enable_thread_safety;
- case $enableval in
- yes)
- :
- ;;
- no)
- :
- ;;
- *)
- as_fn_error $? "no argument expected for --enable-thread-safety option" "$LINENO" 5
- ;;
- esac
-
-else
- enable_thread_safety=yes
-
-fi
-
-
-if test "$enable_thread_safety" = yes; then
-
-$as_echo "#define ENABLE_THREAD_SAFETY 1" >>confdefs.h
-
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $enable_thread_safety" >&5
-$as_echo "$enable_thread_safety" >&6; }
-
-
-#
-# ICU
-#
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with ICU support" >&5
-$as_echo_n "checking whether to build with ICU support... " >&6; }
-
-
-
-# Check whether --with-icu was given.
-if test "${with_icu+set}" = set; then :
- withval=$with_icu;
- case $withval in
- yes)
-
-$as_echo "#define USE_ICU 1" >>confdefs.h
-
- ;;
- no)
- :
- ;;
- *)
- as_fn_error $? "no argument expected for --with-icu option" "$LINENO" 5
- ;;
- esac
-
-else
- with_icu=no
-
-fi
-
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_icu" >&5
-$as_echo "$with_icu" >&6; }
-
-
-if test "$with_icu" = yes; then
@@ -5584,8 +5490,116 @@ $as_echo "yes" >&6; }
$as_echo "no" >&6; }
PKG_CONFIG=""
fi
+fi;
+
+#
+# Include directories
+#
+ac_save_IFS=$IFS
+IFS="${IFS}${PATH_SEPARATOR}"
+# SRCH_INC comes from the template file
+for dir in $with_includes $SRCH_INC; do
+ if test -d "$dir"; then
+ INCLUDES="$INCLUDES -I$dir"
+ else
+ { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** Include directory $dir does not exist." >&5
+$as_echo "$as_me: WARNING: *** Include directory $dir does not exist." >&2;}
+ fi
+done
+IFS=$ac_save_IFS
+
+
+
+#
+# Library directories
+#
+ac_save_IFS=$IFS
+IFS="${IFS}${PATH_SEPARATOR}"
+# LIBRARY_DIRS comes from command line, SRCH_LIB from template file.
+for dir in $LIBRARY_DIRS $SRCH_LIB; do
+ if test -d "$dir"; then
+ LIBDIRS="$LIBDIRS -L$dir"
+ else
+ { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** Library directory $dir does not exist." >&5
+$as_echo "$as_me: WARNING: *** Library directory $dir does not exist." >&2;}
+ fi
+done
+IFS=$ac_save_IFS
+
+#
+# Enable thread-safe client libraries
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking allow thread-safe client libraries" >&5
+$as_echo_n "checking allow thread-safe client libraries... " >&6; }
+
+
+# Check whether --enable-thread-safety was given.
+if test "${enable_thread_safety+set}" = set; then :
+ enableval=$enable_thread_safety;
+ case $enableval in
+ yes)
+ :
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --enable-thread-safety option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ enable_thread_safety=yes
+
fi
+
+if test "$enable_thread_safety" = yes; then
+
+$as_echo "#define ENABLE_THREAD_SAFETY 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $enable_thread_safety" >&5
+$as_echo "$enable_thread_safety" >&6; }
+
+
+#
+# ICU
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with ICU support" >&5
+$as_echo_n "checking whether to build with ICU support... " >&6; }
+
+
+
+# Check whether --with-icu was given.
+if test "${with_icu+set}" = set; then :
+ withval=$with_icu;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_ICU 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-icu option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_icu=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_icu" >&5
+$as_echo "$with_icu" >&6; }
+
+
+if test "$with_icu" = yes; then
+
pkg_failed=no
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for icu-uc icu-i18n" >&5
$as_echo_n "checking for icu-uc icu-i18n... " >&6; }
@@ -6402,6 +6416,35 @@ fi
+#
+# libunwind
+#
+
+
+
+# Check whether --with-libunwind was given.
+if test "${with_libunwind+set}" = set; then :
+ withval=$with_libunwind;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_LIBUNWIND 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libunwind option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libunwind=no
+
+fi
+
+
@@ -10375,6 +10418,94 @@ fi
fi
+if test "x$with_libunwind" = xyes; then :
+
+
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for libunwind" >&5
+$as_echo_n "checking for libunwind... " >&6; }
+
+if test -n "$LIBUNWIND_CFLAGS"; then
+ pkg_cv_LIBUNWIND_CFLAGS="$LIBUNWIND_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"libunwind\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "libunwind") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBUNWIND_CFLAGS=`$PKG_CONFIG --cflags "libunwind" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$LIBUNWIND_LIBS"; then
+ pkg_cv_LIBUNWIND_LIBS="$LIBUNWIND_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"libunwind\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "libunwind") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_LIBUNWIND_LIBS=`$PKG_CONFIG --libs "libunwind" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ LIBUNWIND_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "libunwind" 2>&1`
+ else
+ LIBUNWIND_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "libunwind" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$LIBUNWIND_PKG_ERRORS" >&5
+
+
+ as_fn_error $? "library 'libunwind' is required for enhanced error stack support, check for libunwind.pc" "$LINENO" 5
+
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+ as_fn_error $? "library 'libunwind' is required for enhanced error stack support, check for libunwind.pc" "$LINENO" 5
+
+else
+ LIBUNWIND_CFLAGS=$pkg_cv_LIBUNWIND_CFLAGS
+ LIBUNWIND_LIBS=$pkg_cv_LIBUNWIND_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+
+fi
+
+else
+ { $as_echo "$as_me:${as_lineno-$LINENO}: libunwind not configured" >&5
+$as_echo "$as_me: libunwind not configured" >&6;}
+fi
+
+
+
# Note: We can test for libldap_r only after we know PTHREAD_LIBS
if test "$with_ldap" = yes ; then
_LIBS="$LIBS"
diff --git a/configure.in b/configure.in
index 05dce2808f..181f776fd4 100644
--- a/configure.in
+++ b/configure.in
@@ -573,6 +573,11 @@ PGAC_ARG_BOOL(enable, cassert, no, [enable assertion checks (for debugging)],
[Define to 1 to build with assertion checks. (--enable-cassert)])])
+#
+# Look for pkg-config
+#
+PKG_PROG_PKG_CONFIG;
+
#
# Include directories
#
@@ -839,6 +844,11 @@ AC_SUBST(with_libxml)
PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
[AC_DEFINE([USE_LIBXSLT], 1, [Define to 1 to use XSLT support when building contrib/xml2. (--with-libxslt)])])
+#
+# libunwind
+#
+PGAC_ARG_BOOL(with, libunwind, no, [use libunwind for enhanced error context stacks],
+ [AC_DEFINE([USE_LIBUNWIND], 1, [Define to 1 to use libunwind for more error details. (--with-libunwind)])])
AC_SUBST(with_libxslt)
@@ -1120,6 +1130,18 @@ if test "$with_libxslt" = yes ; then
AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
fi
+AS_IF([test "x$with_libunwind" = xyes], [
+ AC_ARG_VAR(LIBUNWIND_CFLAGS, [override cflags used when building with libunwind])
+ AC_ARG_VAR(LIBUNWIND_LIBS, [override libraries linked when building with libunwind])
+ PKG_CHECK_MODULES(LIBUNWIND, libunwind,
+ [
+ ], [
+ AC_MSG_ERROR([library 'libunwind' is required for enhanced error stack support, check for libunwind.pc])
+ ])
+], [AC_MSG_NOTICE([libunwind not configured]) ])
+AC_SUBST([LIBUNWIND_CFLAGS])
+AC_SUBST([LIBUNWIND_LIBS])
+
# Note: We can test for libldap_r only after we know PTHREAD_LIBS
if test "$with_ldap" = yes ; then
_LIBS="$LIBS"
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index e8b3a519cb..23f1da398a 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -222,6 +222,9 @@ TCL_SHLIB_LD_LIBS = @TCL_SHLIB_LD_LIBS@
PTHREAD_CFLAGS = @PTHREAD_CFLAGS@
PTHREAD_LIBS = @PTHREAD_LIBS@
+LIBUNWIND_CFLAGS = @LIBUNWIND_CFLAGS@
+LIBUNWIND_LIBS = @LIBUNWIND_LIBS@
+
##########################################################################
#
@@ -232,7 +235,7 @@ PTHREAD_LIBS = @PTHREAD_LIBS@
CPP = @CPP@
CPPFLAGS = @CPPFLAGS@
-override CPPFLAGS := $(ICU_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBUNWIND_CFLAGS) $(CPPFLAGS)
ifdef PGXS
override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
@@ -623,7 +626,6 @@ ifdef PROFILE
LDFLAGS += $(PROFILE)
endif
-
##########################################################################
#
# substitute implementations of C library routines (see src/port/)
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 2640834d5f..eaab459918 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -39,8 +39,8 @@ OBJS = $(SUBDIROBJS) $(LOCALOBJS) $(top_builddir)/src/port/libpgport_srv.a \
$(top_builddir)/src/common/libpgcommon_srv.a
# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# libldap, ICU and libunwind
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBUNWIND_LIBS)
# The backend doesn't need everything that's in LIBS, however
LIBS := $(filter-out -lz -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d10a658e3d..6ac5b8457f 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -217,6 +217,20 @@ err_gettext(const char *str)
#endif
}
+static inline void
+errsavestack(ErrorData *edata)
+{
+ /*
+ * We need to capture the stack unwinding context now because not everybody
+ * reaches errfinish within the ereport(...) macro, lots of call paths call
+ * EmitErrorReport directly.
+ */
+#if USE_LIBUNWIND
+ if (!edata->hide_stack)
+ if (unw_getcontext((unw_context_t*)&edata->unwind_context) == 0)
+ edata->unwind_context_valid = true;
+#endif
+}
/*
* errstart --- begin an error-reporting cycle
@@ -389,6 +403,7 @@ errstart(int elevel, const char *filename, int lineno,
edata->sqlerrcode = ERRCODE_WARNING;
else
edata->sqlerrcode = ERRCODE_SUCCESSFUL_COMPLETION;
+ errsavestack(edata);
/* errno is saved here so that error parameter eval can't change it */
edata->saved_errno = errno;
@@ -1096,6 +1111,24 @@ errhidecontext(bool hide_ctx)
return 0; /* return value does not matter */
}
+/*
+ * errhidestack --- optionally suppress STACK: field of log entry
+ *
+ * This should only be used for verbose debugging messages where the repeated
+ * inclusion of context would bloat the log volume too much.
+ */
+int
+errhidestack(bool hide_stack)
+{
+ ErrorData *edata = &errordata[errordata_stack_depth];
+
+ /* we don't bother incrementing recursion_depth */
+ CHECK_STACK_DEPTH();
+
+ edata->hide_stack = hide_stack;
+
+ return 0; /* return value does not matter */
+}
/*
* errfunction --- add reporting function name to the current error
@@ -1334,6 +1367,7 @@ elog_start(const char *filename, int lineno, const char *funcname)
edata->filename = filename;
edata->lineno = lineno;
edata->funcname = funcname;
+ errsavestack(edata);
/* errno is saved now so that error parameter eval can't change it */
edata->saved_errno = errno;
@@ -2619,6 +2653,63 @@ log_line_prefix(StringInfo buf, ErrorData *edata)
}
}
+static void
+append_unwind_backtrace(StringInfo buf, ErrorData *edata, const char *missing_str)
+{
+#if USE_LIBUNWIND
+#define MAX_FUNCNAME_LENGTH 100
+ unw_cursor_t cursor;
+
+ Assert(edata->unwind_context_valid);
+
+ if (unw_init_local(&cursor, (unw_context_t*)&edata->unwind_context) == 0)
+ {
+ int frameno = 0;
+
+ /* We don't want to see errstart or elog_start in the stack */
+ unw_step(&cursor);
+
+ while (unw_step(&cursor) > 0)
+ {
+ unw_word_t offp;
+ char frame_funcname[MAX_FUNCNAME_LENGTH];
+ int ret;
+ ret = unw_get_proc_name(&cursor, &frame_funcname[0], MAX_FUNCNAME_LENGTH, &offp);
+ if (ret == 0)
+ {
+ // StartupXLOG is 12000 bytes. So question offsets lots bigger than it,
+ // we might've failed to resolve a symbol. This helps catch things like
+ // FRAME 8: _fini +0x1d4737
+ //
+ // but we must make an exception for _yy_ symbols, since we know the parser
+ // is huge. We mainly want to catch things like _fini.
+ bool symbol_suspect = offp > 20000 && strncmp(&frame_funcname[0], "_yy", 3) != 0;
+ appendStringInfo(buf, "\n\tFRAME %4d: %s%s +%-4ld",
+ frameno, symbol_suspect ? "<suspect> " : "",
+ frame_funcname, (long)offp);
+ }
+ else
+ {
+ unw_word_t ip, sp;
+ unw_get_reg(&cursor, UNW_REG_IP, &ip);
+ unw_get_reg(&cursor, UNW_REG_SP, &sp);
+ appendStringInfo(buf, "\n\tFRAME %4d: < ?? > ip=0x%lx sp=0x%lx", frameno, (long)ip, (long)sp);
+ }
+ frameno ++;
+ // Abbreviate stacks at known endpoints
+ if (strcmp(&frame_funcname[0], "PostmasterMain") == 0
+ || strcmp(&frame_funcname[0], "PostgresMain") == 0
+ || strcmp(&frame_funcname[0], "StartupProcessMain") == 0
+ || strcmp(&frame_funcname[0], "AuxiliaryProcessMain") == 0)
+ break;
+ }
+ appendStringInfoString(buf, "\n");
+ }
+ else
+ appendStringInfoString(buf, missing_str);
+#endif
+}
+
/*
* append a CSV'd version of a string to a StringInfo
* We use the PostgreSQL defaults for CSV, i.e. quote = escape = '"'
@@ -2830,6 +2921,18 @@ write_csvlog(ErrorData *edata)
if (application_name)
appendCSVLiteral(&buf, application_name);
+ /* Call stack if recorded */
+ if ((Log_error_verbosity >= PGERROR_VERBOSE
+ || edata->elevel == PANIC)
+ && !edata->hide_stack && edata->unwind_context_valid)
+ {
+ StringInfoData bt;
+ initStringInfo(&bt);
+ append_unwind_backtrace(&bt, edata, "");
+ appendCSVLiteral(&buf, bt.data);
+ pfree(bt.data);
+ }
+
appendStringInfoChar(&buf, '\n');
/* If in the syslogger process, try to write messages direct to file */
@@ -2949,6 +3052,15 @@ send_message_to_server_log(ErrorData *edata)
edata->filename, edata->lineno);
}
}
+
+ if ((Log_error_verbosity >= PGERROR_VERBOSE
+ || edata->elevel == PANIC)
+ && !edata->hide_stack && edata->unwind_context_valid)
+ {
+ log_line_prefix(&buf, edata);
+ appendStringInfoString(&buf, _("STACK: "));
+ append_unwind_backtrace(&buf, edata, _("< no stack >"));
+ }
}
/*
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 79986e9241..4537644e63 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -339,6 +339,9 @@
/* Define to 1 if you have the `xslt' library (-lxslt). */
#undef HAVE_LIBXSLT
+/* Define to 1 if you have the `unwind' library (-lunwind). */
+#undef HAVE_LIBUNWIND
+
/* Define to 1 if you have the `z' library (-lz). */
#undef HAVE_LIBZ
@@ -834,6 +837,9 @@
(--with-libxslt) */
#undef USE_LIBXSLT
+/* Define to 1 to use libunwind support */
+#undef USE_LIBUNWIND
+
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index f3b35297d1..8612a029e9 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -256,7 +256,7 @@
* You should normally use MEMORY_CONTEXT_CHECKING with USE_VALGRIND;
* instrumentation of repalloc() is inferior without it.
*/
-/* #define USE_VALGRIND */
+#define USE_VALGRIND
/*
* Define this to cause pfree()'d memory to be cleared immediately, to
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 7bfd25a9e9..6157280252 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -16,6 +16,11 @@
#include <setjmp.h>
+#ifdef USE_LIBUNWIND
+#define UNW_LOCAL_ONLY
+#include <libunwind.h>
+#endif
+
/* Error level codes */
#define DEBUG5 10 /* Debugging messages, in categories of
* decreasing detail. */
@@ -169,6 +174,7 @@ extern int errcontext_msg(const char *fmt,...) pg_attribute_printf(1, 2);
extern int errhidestmt(bool hide_stmt);
extern int errhidecontext(bool hide_ctx);
+extern int errhidestack(bool hide_stack);
extern int errfunction(const char *funcname);
extern int errposition(int cursorpos);
@@ -334,6 +340,7 @@ typedef struct ErrorData
bool show_funcname; /* true to force funcname inclusion */
bool hide_stmt; /* true to prevent STATEMENT: inclusion */
bool hide_ctx; /* true to prevent CONTEXT: inclusion */
+ bool hide_stack; /* true to prevent STACK: inclusion */
const char *filename; /* __FILE__ of ereport() call */
int lineno; /* __LINE__ of ereport() call */
const char *funcname; /* __func__ of ereport() call */
@@ -358,6 +365,11 @@ typedef struct ErrorData
/* context containing associated non-constant strings */
struct MemoryContextData *assoc_context;
+
+#ifdef USE_LIBUNWIND
+ unw_context_t unwind_context; /* call context STACK info for libunwind */
+ bool unwind_context_valid;
+#endif
} ErrorData;
extern void EmitErrorReport(void);
--
2.14.3
On 2018-Jun-20, Craig Ringer wrote:
Hi folks
I recently needed a way to get backtraces from errors in a convenient,
non-interactive and indescriminate way. The attached patch is the result.
It teaches Pg to use libunwind to self-backtrace in elog/ereport.Anyone think this is useful/interesting/worth pursuing?
I think we sorely need some mechanism to optionally get backtraces in
error messages. I think having backtraces in all messages is definitely
not a good thing, but requiring an explicit marker (such as in my patch)
means the code has to be recompiled, which is not a solution in
production systems. I kind lean towards your approach, but it has to be
something that's easily enabled/disabled at runtime.
I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected. Something like PGC_USERSET
server_min_backtraces=PANIC
might be a possible interface.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/20/2018 06:10 PM, Craig Ringer wrote:
I recently needed a way to get backtraces from errors in a convenient,
non-interactive and indescriminate way. The attached patch is the
result. It teaches Pg to use libunwind to self-backtrace in elog/ereport.Anyone think this is useful/interesting/worth pursuing?
This would be a great thing to have. I often need to add stack traces to
the logs, and for now I just link with libunwind and add some ad-hoc
function to grab the traces. Having libunwind support in core would make
this simpler.
--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
I recently needed a way to get backtraces from errors in a convenient,
non-interactive and indescriminate way. The attached patch is the result.
It teaches Pg to use libunwind to self-backtrace in elog/ereport.Anyone think this is useful/interesting/worth pursuing?
Generally interesting, yes.
I think we sorely need some mechanism to optionally get backtraces in
error messages. I think having backtraces in all messages is definitely
not a good thing, but requiring an explicit marker (such as in my patch)
means the code has to be recompiled, which is not a solution in
production systems. I kind lean towards your approach, but it has to be
something that's easily enabled/disabled at runtime.
I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected. Something like PGC_USERSET
server_min_backtraces=PANIC
might be a possible interface.
Yes, most definitely. We can't do this everywhere. It's quite expensive
to collect / build them. So I think we'd probably need another guc that
controls when the information is collected, perhaps defaulting to PANIC.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected. Something like PGC_USERSET
server_min_backtraces=PANIC
might be a possible interface.
Yes, most definitely. We can't do this everywhere. It's quite expensive
to collect / build them. So I think we'd probably need another guc that
controls when the information is collected, perhaps defaulting to PANIC.
The cost is a problem, and the dependencies on various additional
libraries are too. I wonder whether anything could be gained by
putting this stuff in an extension? Then we could have different
extensions for different backtrace libraries, and loading the extension
or not would be one avenue to control whether you were paying the cost.
(Though some control GUCs might be needed anyway.)
regards, tom lane
On 2018-06-20 12:04:51 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected. Something like PGC_USERSET
server_min_backtraces=PANIC
might be a possible interface.Yes, most definitely. We can't do this everywhere. It's quite expensive
to collect / build them. So I think we'd probably need another guc that
controls when the information is collected, perhaps defaulting to PANIC.The cost is a problem, and the dependencies on various additional
libraries are too. I wonder whether anything could be gained by
putting this stuff in an extension? Then we could have different
extensions for different backtrace libraries, and loading the extension
or not would be one avenue to control whether you were paying the cost.
(Though some control GUCs might be needed anyway.)
I think the problem is that this most frequently is an issue when
investigating an issue for customers. Often enough it'll legally not be
possible to gain direct access to the system, and remotely instructing
people to install an extension and configure it isn't great. So if we
could, by default, include something better for PANICs etc, it'd be a
great boon - I think that's even clear from conversionations on the pg
lists (which often will be the more knowledgable people: How often did
we try hard to get a backtrace from a crash?
If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.
Greetings,
Andres Freund
On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund <andres@anarazel.de> wrote:
I think the problem is that this most frequently is an issue when
investigating an issue for customers. Often enough it'll legally not be
possible to gain direct access to the system, and remotely instructing
people to install an extension and configure it isn't great. So if we
could, by default, include something better for PANICs etc, it'd be a
great boon - I think that's even clear from conversionations on the pg
lists (which often will be the more knowledgable people: How often did
we try hard to get a backtrace from a crash?
+1 to all of that. This is a real nuisance, and making it less of a
nuisance would be great.
If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.
I think doing it on PANIC would be phenomenal. SIGSEGV would be great
if we can make it safe enough, which I'm not sure about, but then I
suppose we're crashing anyway. Instead of making the ERROR behavior
conditional on log_error_verbosity as Craig has it now, how about
doing it whenever the error code is ERRCODE_INTERNAL_ERROR? That's
pretty much the cases that aren't supposed to happen, so if we see
those happening a lot, it's either a bug we need to fix or we should
supply a better error code. Also, a lot of those messages are
duplicated in many places and/or occur inside fairly generic functions
inside lsyscache.c, so the actual error message text tends not to be
enough to know what happened.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund <andres@anarazel.de> wrote:
If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.I think doing it on PANIC would be phenomenal. SIGSEGV would be great
if we can make it safe enough, which I'm not sure about, but then I
suppose we're crashing anyway.
Yea, I think that's pretty much why It'd be ok.
Instead of making the ERROR behavior conditional on
log_error_verbosity as Craig has it now, how about doing it whenever
the error code is ERRCODE_INTERNAL_ERROR? That's pretty much the
cases that aren't supposed to happen, so if we see those happening a
lot, it's either a bug we need to fix or we should supply a better
error code. Also, a lot of those messages are duplicated in many
places and/or occur inside fairly generic functions inside
lsyscache.c, so the actual error message text tends not to be enough
to know what happened.
I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this. E.g. a lot of generic filesystem errors have
errcode_for_file_access(), but are too generic messages to debug. So I
think we'll just need a separate GUC for things that aren't PANIC and
haven't explicitly opt-ed in.
Greetings,
Andres Freund
On 20 Jun 2018, at 17:10, Craig Ringer <craig@2ndquadrant.com> wrote:
BTW, Álvaro posted a simpler patch at /messages/by-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql. It uses backtrace() from glibc, and requires each site you want to bt to be annotated explicitly. I forgot about backtrace() completely when I wrote mine, and I prefer the control libunwind gives me anyway, but the reduced dependency would be nice. Especially since backtrace() is in FreeBSD too.
Just as a datapoint regarding this; backtrace() is not available in OpenBSD,
but there is a library in ports which implements it, libexecinfo.
cheers ./daniel
On Wed, Jun 20, 2018 at 1:15 PM, Andres Freund <andres@anarazel.de> wrote:
I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations,
Really? How?
and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this. E.g. a lot of generic filesystem errors have
errcode_for_file_access(), but are too generic messages to debug. So I
think we'll just need a separate GUC for things that aren't PANIC and
haven't explicitly opt-ed in.
Mmph. I don't like that much. I mean I can hold my nose, but I hope
we can find a way to do better.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected. Something like PGC_USERSET
server_min_backtraces=PANIC
might be a possible interface.
+1
In addition, it would be nicer if we could have some filtering based on the error condition. Sometimes I wanted to know where the "out of memory" or "invalid memory alloc request size ..." was thrown.
How about adding a GUC like the -e option of strace? strace allows to specify which system calls and groups of them to capture. We can use the SQL state and class code/name to list the conditions to capture. In the out-of-memory case, I want to specify 53200, but don't want to get stack traces for other errors in class 53.
https://www.postgresql.org/docs/devel/static/errcodes-appendix.html
Regards
Takayuki Tsunakawa
On 21 June 2018 at 01:15, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund <andres@anarazel.de>
wrote:
If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.I think doing it on PANIC would be phenomenal. SIGSEGV would be great
if we can make it safe enough, which I'm not sure about, but then I
suppose we're crashing anyway.Yea, I think that's pretty much why It'd be ok.
Yep. At worst we crash again while trying to generate a bt. We're not doing
anything particularly exciting, and it should be sensible enough.
Instead of making the ERROR behavior conditional on
log_error_verbosity as Craig has it now, how about doing it whenever
the error code is ERRCODE_INTERNAL_ERROR? That's pretty much the
cases that aren't supposed to happen, so if we see those happening a
lot, it's either a bug we need to fix or we should supply a better
error code. Also, a lot of those messages are duplicated in many
places and/or occur inside fairly generic functions inside
lsyscache.c, so the actual error message text tends not to be enough
to know what happened.I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this.
Perhaps we should fix those, but it might be a game of whack-a-mole as the
code changes, and inevitably you'll want to generate stacks for some other
errcode while getting frustrated at all the ERRCODE_INTERNAL_ERROR. Not
sure it's worth it.
However, maybe a GUC like
log_error_stacks_errcodes = 'XX000, 55000'
would work. It'd be somewhat expensive to evaluate, but we'd only be doing
it where we'd already decided to emit an error. And it'd fit in even if we
later added smarter selective logging.
BTW, it's worth noting that these backtraces are very limited. They don't
report arguments or locals. So it's still no substitute for suitable
errcontext callbacks, and sometimes it's still necessary to fall back to
gdb or messing around with perf userspace tracepoints.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
This is what the stacks look like btw
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG: 00000:
find_in_dynamic_libpath: trying
"/home/craig/pg/10/lib/postgresql/pglogical.so"
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
find_in_dynamic_libpath, dfmgr.c:639
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
FRAME 0: find_in_dynamic_libpath +628
FRAME 1: expand_dynamic_library_name +205
FRAME 2: load_external_function +38
FRAME 3: LookupBackgroundWorkerFunction +197
FRAME 4: StartBackgroundWorker +549
FRAME 5: do_start_bgworker +466
FRAME 6: maybe_start_bgworkers +382
FRAME 7: reaper +895
FRAME 8: funlockfile +80
FRAME 9: __select +23
FRAME 10: ServerLoop +394
FRAME 11: PostmasterMain +4499
I wrote it because I got sick of Assert(false) debugging, and I was chasing
down some "ERROR: 08P01: insufficient data left in message" errors. Then I
got kind of caught up in it... you know how it is.
It also goes to show there are plenty of places you might want to get a
stack where you don't have an internal errcode or panic. I don't think that
idea will fly.
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
I wrote it because I got sick of Assert(false) debugging, and I was chasing
down some "ERROR: 08P01: insufficient data left in message" errors. Then I
got kind of caught up in it... you know how it is.
Yes, I know that feeling! I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.
It also goes to show there are plenty of places you might want to get a
stack where you don't have an internal errcode or panic. I don't think that
idea will fly.
Yep. Error message uniqueness can help, but getting the call stack
can trace back to more understanding of a problem.
--
Michael
Hello, I basically like this.
At Thu, 21 Jun 2018 12:35:10 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YGFMVnFLRSf09GGgjUyxv3C0ytep9ftM8N2X1kLM-SkKw@mail.gmail.com>
This is what the stacks look like btw
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG: 00000:
find_in_dynamic_libpath: trying
"/home/craig/pg/10/lib/postgresql/pglogical.so"
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
find_in_dynamic_libpath, dfmgr.c:639
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
FRAME 0: find_in_dynamic_libpath +628
FRAME 1: expand_dynamic_library_name +205
FRAME 2: load_external_function +38
FRAME 3: LookupBackgroundWorkerFunction +197
FRAME 4: StartBackgroundWorker +549
FRAME 5: do_start_bgworker +466
FRAME 6: maybe_start_bgworkers +382
FRAME 7: reaper +895
FRAME 8: funlockfile +80
FRAME 9: __select +23
FRAME 10: ServerLoop +394
FRAME 11: PostmasterMain +4499I wrote it because I got sick of Assert(false) debugging, and I was chasing
down some "ERROR: 08P01: insufficient data left in message" errors. Then I
got kind of caught up in it... you know how it is.It also goes to show there are plenty of places you might want to get a
stack where you don't have an internal errcode or panic. I don't think that
idea will fly.
I think this for assertion failure is no problem but I'm not sure
for other cases. We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.
Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.
However, system tap doesn't work for me but I'm not sure how it
is broken. (stap version is 3.2/0.170 and uname -r shows
3.10.0-862)
$ sudo stap -e 'probe process(".../bin/postgres").mark("transaction__start"){}'
...
LOG: autovacuum launcher process (PID 10592) was terminated by signal 4: Illegal instruction
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz>
wrote:
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
I wrote it because I got sick of Assert(false) debugging, and I was
chasing
down some "ERROR: 08P01: insufficient data left in message" errors.
Then I
got kind of caught up in it... you know how it is.
Yes, I know that feeling! I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.
Sometime back I'd suggested an idea to be able to dynamically manage log
levels for elog messages [1]/messages/by-id/CABOikdMvx_Kr_b4ELhJEoeGcLTZKrDma+fPZpoZVdpL7Zc0bVw@mail.gmail.com. That did not invoke much response, but I
still believe some such facility will be very useful for debugging
production systems. Now I concede that the POC patch that I sent is
horribly written and has a bad UI, but those can be improved if there is
interest. Given the lack of response, I suspect there is enough interest in
the feature though.
Thanks,
Pavan
[1]: /messages/by-id/CABOikdMvx_Kr_b4ELhJEoeGcLTZKrDma+fPZpoZVdpL7Zc0bVw@mail.gmail.com
/messages/by-id/CABOikdMvx_Kr_b4ELhJEoeGcLTZKrDma+fPZpoZVdpL7Zc0bVw@mail.gmail.com
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
wrote:
On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz>
wrote:On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
I wrote it because I got sick of Assert(false) debugging, and I was
chasing
down some "ERROR: 08P01: insufficient data left in message" errors.
Then I
got kind of caught up in it... you know how it is.
Yes, I know that feeling! I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.Sometime back I'd suggested an idea to be able to dynamically manage log
levels for elog messages [1].
Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.
I don't think I saw the post.
Such a thing would obviously make built in backtrace support much more
useful.
Show quoted text
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
wrote:On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz>
wrote:On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
I wrote it because I got sick of Assert(false) debugging, and I was
chasing
down some "ERROR: 08P01: insufficient data left in message" errors.
Then I
got kind of caught up in it... you know how it is.
Yes, I know that feeling! I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.Sometime back I'd suggested an idea to be able to dynamically manage log
levels for elog messages [1].Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.I don't think I saw the post.
Such a thing would obviously make built in backtrace support much more
useful.
I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.
- Andres
On Fri, Jun 22, 2018 at 6:18 AM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.I don't think I saw the post.
Such a thing would obviously make built in backtrace support much more
useful.I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.
Sorry, I did not mean to mix up two patches. I brought it up just in case
it provides another idea about when and how to log the backtrace. So yeah,
let's discuss that patch separately.
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 22 June 2018 at 08:48, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
wrote:On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <michael@paquier.xyz
wrote:
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
I wrote it because I got sick of Assert(false) debugging, and I was
chasing
down some "ERROR: 08P01: insufficient data left in message" errors.
Then I
got kind of caught up in it... you know how it is.
Yes, I know that feeling! I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able toget
more easily backtraces would be really nice.
Sometime back I'd suggested an idea to be able to dynamically manage
log
levels for elog messages [1].
Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.I don't think I saw the post.
Such a thing would obviously make built in backtrace support much more
useful.I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.
Totally agree, and it's why I raised this as its own thing.
Thanks.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
A few misc comments:
In general, +1.
It might be nice to move the stack trace code into a separate function (not
static to elog.c) - I have often found it useful to attach backtraces to
objects so I can debug complex allocation/deallocation problems.
The major expense in capturing a stack trace is in the symbolization of the
stack addresses, not the stack walk itself. You typically want to
symbolize the addresses at the time you generate the trace, but that's not
always the case. For example, if you want to record the stack trace of
every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
performance gets unbearable if you symbolize every trace. So a flag that
specifies whether to symbolize would be nice too.
If you don't symbolize, you need a way to capture the address range of each
dynamically loaded shared object (so that you can later symbolize using
something like addr2line).
(The above suggestions relate to server debugging, not application
debugging).
End of wish list...
-- Korry
On Fri, Jun 22, 2018 at 3:07 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
Show quoted text
On 22 June 2018 at 08:48, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, <pavan.deolasee@gmail.com>
wrote:On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <
michael@paquier.xyz>
wrote:
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
I wrote it because I got sick of Assert(false) debugging, and I was
chasing
down some "ERROR: 08P01: insufficient data left in message"
errors.
Then I
got kind of caught up in it... you know how it is.
Yes, I know that feeling! I have been using as well the
Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to
get
more easily backtraces would be really nice.
Sometime back I'd suggested an idea to be able to dynamically manage
log
levels for elog messages [1].
Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.I don't think I saw the post.
Such a thing would obviously make built in backtrace support much more
useful.I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.Totally agree, and it's why I raised this as its own thing.
Thanks.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 22 June 2018 at 23:26, Korry Douglas <korry.douglas@enterprisedb.com>
wrote:
A few misc comments:
In general, +1.
It might be nice to move the stack trace code into a separate function
(not static to elog.c) - I have often found it useful to attach backtraces
to objects so I can debug complex allocation/deallocation problems.
Good suggestion.
The major expense in capturing a stack trace is in the symbolization of
the stack addresses, not the stack walk itself. You typically want to
symbolize the addresses at the time you generate the trace, but that's not
always the case. For example, if you want to record the stack trace of
every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
performance gets unbearable if you symbolize every trace. So a flag that
specifies whether to symbolize would be nice too.
libunwind has some nifty caching that makes that a _lot_ cheaper; that's
part of why I went with it despite the extra lib requirement.
If you don't symbolize, you need a way to capture the address range of
each dynamically loaded shared object (so that you can later symbolize
using something like addr2line).
Yeah. libunwind doesn't expose any interface to get that information, so
you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or
capture /proc/self/maps. You have to make sure that info makes it to the
log, and is re-output often enough not to be lost to log rotation, and is
invalidated and re-output if mappings change due to new lib loads etc. But
you don't want to print it all the time either.
Then you have to walk the stack and print the instruction pointers and
stack pointers and spit out raw traces for the user to reassemble.
I'd argue that if you're doing the sort of thing where you want a stack of
every AllocSetAlloc, you shouldn't be trying to do that in-process. That's
where tools like perf, systemtap, etc really come into their own. I'm
focused on making additional diagnostic info for errors and key log
messages collectable for systems that aren't easily accessed, like users
who have 12-hour read/response latencies and security setups as strict as
they are insane and nonsensical.
I'd have no objection to the option to disable symbolic back traces and
print the raw call stack. It's trivial to do; in fact, I only removed the
ip/sp info to keep the output more concise.
I'm not interested in writing anything to handle the library load address
mapping collection etc though. I don't see a really sensible way to do that
in a log-based system.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 23 June 2018 at 20:22, Craig Ringer <craig@2ndquadrant.com> wrote:
Yeah. libunwind doesn't expose any interface to get that information, so
you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or
capture /proc/self/maps.
Ahem, I take that part back.
https://stackoverflow.com/a/21584356/398670
see /usr/include/link.h
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
wrote:
I think this for assertion failure is no problem but I'm not sure
for other cases.
I think it's pretty strongly desirable for PANIC.
We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.
Yeah. The trouble there is that there are a _lot_ of places to touch for
such things, and inevitably the one you really want to see will be
something that didn't get suitably annotated.
Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.
I don't bother with SystemTAP anymore; perf does the job for most purposes.
You can use --enable-dtrace SDTs with Perf fine. I wrote it up for Pg at
https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
.
Dynamic tracepoints are also very useful.
Both require debuginfo, but so does outputting of symbolic stacks per my
patch.
(That reminds me, I need to chat with Devrim about creating a longer lived
debuginfo + old versions rpms repo for Pg its self, if not the accessory
bits and pieces. I'm so constantly frustrated by not being able to get
needed debuginfo packages to investigate some core or running system
problem because they've been purged from the PGDG yum repo as soon as a new
version comes out.)
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi.
At Mon, 25 Jun 2018 09:32:36 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YGBw9tgKRGxyihVeMzmjQx_2t8D17tE7t5-0gMdW7S6UA@mail.gmail.com>
On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
wrote:
I think this for assertion failure is no problem but I'm not sure
for other cases.
I think it's pretty strongly desirable for PANIC.
Ah, I forgot about that. I agree to that. The cost to collect the
information is not a problem on PANIC. However still I don't
think stack trace is valuable on all PANIC messages. I can accept
the guc to control it but it is preferable that this works fine
without such an extra setup.
We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.Yeah. The trouble there is that there are a _lot_ of places to touch for
such things, and inevitably the one you really want to see will be
something that didn't get suitably annotated.
Agreed, it is the reality. Instaed, can't we make a new error
classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
restrict stack dump for elog()? Or elog_stackdump() and elog()
is also fine for me. Using it is easier than proper
annotating. It would be perfect if we could invent an automated
way but I don't think it is realistic.
Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.I don't bother with SystemTAP anymore; perf does the job for most purposes.
You can use --enable-dtrace SDTs with Perf fine. I wrote it up for Pg at
https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
.Dynamic tracepoints are also very useful.
Both require debuginfo, but so does outputting of symbolic stacks per my
patch.
Mmm. If I understand you correctly, I mean that perf doesn't dump
a backtrace on a probe point but trace points are usable to take
a symbolic backtrace. (I'm sorry that I cannot provide an example
since stap doesn't work in my box..)
If your intention is to take back traces without any setting (I
think it is preferable), it should be restricted to the required
points. It can be achieved by the additional error classes or
substitute error output functions.
As just an idea but can't we use an definition file on that
LOCATION of error messages that needs to dump a backtrace are
listed? That list is usually empty and should be very short if
any. The LOCATION information is easily obtained from a verbose
error message itself if once shown but a bit hard to find
otherwise..
(That reminds me, I need to chat with Devrim about creating a longer lived
debuginfo + old versions rpms repo for Pg its self, if not the accessory
bits and pieces. I'm so constantly frustrated by not being able to get
needed debuginfo packages to investigate some core or running system
problem because they've been purged from the PGDG yum repo as soon as a new
version comes out.)
We in our department take care to preserve them for ourselves for
the necessity of supporting older systems. I sometimes feel that
It is very helpful if they were available on the official site.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 25 June 2018 at 14:21, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
wrote:
Hi.
At Mon, 25 Jun 2018 09:32:36 +0800, Craig Ringer <craig@2ndquadrant.com>
wrote in <CAMsr+YGBw9tgKRGxyihVeMzmjQx_2t8D17tE7t5-0gMdW7S6UA@mail.
gmail.com>On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp
wrote:
I think this for assertion failure is no problem but I'm not sure
for other cases.
I think it's pretty strongly desirable for PANIC.
Ah, I forgot about that. I agree to that. The cost to collect the
information is not a problem on PANIC. However still I don't
think stack trace is valuable on all PANIC messages. I can accept
the guc to control it but it is preferable that this works fine
without such an extra setup.
Places such as?
We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.Yeah. The trouble there is that there are a _lot_ of places to touch for
such things, and inevitably the one you really want to see will be
something that didn't get suitably annotated.Agreed, it is the reality. Instaed, can't we make a new error
classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
restrict stack dump for elog()? Or elog_stackdump() and elog()
is also fine for me. Using it is easier than proper
annotating. It would be perfect if we could invent an automated
way but I don't think it is realistic.
That needlessly complicates error severity levels with information not
really related to the severity. -1 from me.
Mmm. If I understand you correctly, I mean that perf doesn't dump
a backtrace on a probe point but trace points are usable to take
a symbolic backtrace. (I'm sorry that I cannot provide an example
since stap doesn't work in my box..)
perf record --call-graph dwarf -e sdt_postgresql:checkpoint__start -u
postgres
perf report -g
If your intention is to take back traces without any setting (I
think it is preferable), it should be restricted to the required
points. It can be achieved by the additional error classes or
substitute error output functions.
Who's classifying all the possible points?
Which PANICs or assertion failures do you want to exempt?
I definitely do not want to emit stacks for everything, like my patch
currently does. It's just a proof of concept. Later on I'll want control on
a fine grained level at runtime of when that happens, but that's out of
scope for this. For now the goal is emit stacks at times it's obviously
pretty sensible to have a stack, and do it in a way that doesn't require
per-error-site maintenance/changes or create backport hassle.
As just an idea but can't we use an definition file on that
LOCATION of error messages that needs to dump a backtrace are
listed? That list is usually empty and should be very short if
any. The LOCATION information is easily obtained from a verbose
error message itself if once shown but a bit hard to find
otherwise..
That's again veering into selective logging control territory. Rather than
doing it for stack dump control only, it should be part of a broader
control over dynamic and scoped verbosity, selective logging, and log
options, like Pavan raised. I see stacks as just one knob that can be
turned on/off here.
(That reminds me, I need to chat with Devrim about creating a longer lived
debuginfo + old versions rpms repo for Pg its self, if not the accessory
bits and pieces. I'm so constantly frustrated by not being able to get
needed debuginfo packages to investigate some core or running system
problem because they've been purged from the PGDG yum repo as soon as anew
version comes out.)
We in our department take care to preserve them for ourselves for
the necessity of supporting older systems. I sometimes feel that
It is very helpful if they were available on the official
Maybe if I can get some interest in that, you might be willing to
contribute your archives as a starter so we have them for back-versions?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jun 23, 2018 at 8:22 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 22 June 2018 at 23:26, Korry Douglas <korry.douglas@enterprisedb.com>
wrote:A few misc comments:
In general, +1.
It might be nice to move the stack trace code into a separate function
(not static to elog.c) - I have often found it useful to attach backtraces
to objects so I can debug complex allocation/deallocation problems.Good suggestion.
The major expense in capturing a stack trace is in the symbolization of
the stack addresses, not the stack walk itself. You typically want to
symbolize the addresses at the time you generate the trace, but that's not
always the case. For example, if you want to record the stack trace of
every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
performance gets unbearable if you symbolize every trace. So a flag that
specifies whether to symbolize would be nice too.libunwind has some nifty caching that makes that a _lot_ cheaper; that's
part of why I went with it despite the extra lib requirement.
I've not used libunwind before - looking through the docs now. It does
seem much more flexible than backtrace(), thanks.
If you don't symbolize, you need a way to capture the address range of
each dynamically loaded shared object (so that you can later symbolize
using something like addr2line).Yeah. libunwind doesn't expose any interface to get that information, so
you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or
capture /proc/self/maps. You have to make sure that info makes it to the
log, and is re-output often enough not to be lost to log rotation, and is
invalidated and re-output if mappings change due to new lib loads etc. But
you don't want to print it all the time either.Then you have to walk the stack and print the instruction pointers and
stack pointers and spit out raw traces for the user to reassemble.I'd argue that if you're doing the sort of thing where you want a stack of
every AllocSetAlloc, you shouldn't be trying to do that in-process. That's
where tools like perf, systemtap, etc really come into their own. I'm
focused on making additional diagnostic info for errors and key log
messages collectable for systems that aren't easily accessed, like users
who have 12-hour read/response latencies and security setups as strict as
they are insane and nonsensical.
Understood - I realized after I replied that instrumenting AllocSetAlloc()
is a bit more complex than I had suggested. When I need to capture the
call stack of each allocation, I store the stack trace in malloc'ed
buffers, otherwise I get recursive (in AllocSetAlloc()). I suspect that's
unreasonable for a general-purpose solution.
I'd have no objection to the option to disable symbolic back traces and
print the raw call stack. It's trivial to do; in fact, I only removed the
ip/sp info to keep the output more concise.I'm not interested in writing anything to handle the library load address
mapping collection etc though. I don't see a really sensible way to do that
in a log-based system.
Agreed - I tend to use (saved, not logged) stack traces to find memory
leaks that valgrind can't see (or more precisely, excessive resource
consumption that is not actually leaked).
Thanks for your consideration.
-- Korry
Hello. Thaks for discussing.
At Mon, 25 Jun 2018 17:08:41 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YGdJTGygmGhRhfn+8DLi6o+Teq+tcA-Dr3kK+8vYqwzCA@mail.gmail.com>
On 25 June 2018 at 14:21, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp
I think it's pretty strongly desirable for PANIC.
Ah, I forgot about that. I agree to that. The cost to collect the
information is not a problem on PANIC. However still I don't
think stack trace is valuable on all PANIC messages. I can accept
the guc to control it but it is preferable that this works fine
without such an extra setup.Places such as?
Sorry but I didn't get this.
Agreed, it is the reality. Instaed, can't we make a new error
classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
restrict stack dump for elog()? Or elog_stackdump() and elog()
is also fine for me. Using it is easier than proper
annotating. It would be perfect if we could invent an automated
way but I don't think it is realistic.That needlessly complicates error severity levels with information not
really related to the severity. -1 from me.
Ok.
Mmm. If I understand you correctly, I mean that perf doesn't dump
a backtrace on a probe point but trace points are usable to take
a symbolic backtrace. (I'm sorry that I cannot provide an example
since stap doesn't work in my box..)perf record --call-graph dwarf -e sdt_postgresql:checkpoint__start -u
postgres
perf report -g
Good to know that but unfortunately it doesn't work for me. It
(USDT support) is still in the perf Wiki Todo list. Some googling
told me that it is available on Linux 4.4. CentOS 7's kernel is
3.10. It seems a cutting-edge feature. However, I suppose that
what it emits is different from what wanted here.
But regardless of that, I agree that it is a bit hard to use
on-site.. (stap doesn't work for me from uncertain reason..)
If your intention is to take back traces without any setting (I
think it is preferable), it should be restricted to the required
points. It can be achieved by the additional error classes or
substitute error output functions.Who's classifying all the possible points?
I didn't mean classifying all points. I meant that only the
points where it is needed would be enough for the first step.
Which PANICs or assertion failures do you want to exempt?
I didn't intended to cover all required casees. Intended that
save just a few or several obviously valuable cases, where we
gave up to add adequate context information for some reasons and
backtrace gives valuable information. Most xlog stuff wouldn't
need it. Lock stuff might need it. Heapam doesn't seem to need
since they seem to caused by calling order. etc... But such
control is out of this patch, I know.
I definitely do not want to emit stacks for everything, like my patch
currently does. It's just a proof of concept. Later on I'll want control on
a fine grained level at runtime of when that happens, but that's out of
scope for this. For now the goal is emit stacks at times it's obviously
pretty sensible to have a stack, and do it in a way that doesn't require
per-error-site maintenance/changes or create backport hassle.
Ok. The PoC is focusing on the means to emit backtraces and
controlling logs are the different issue, as Andres said
upthread.
I think that backtrace() generates somewhat strange-looking
output and we can get cleaner one using unwind. So I vote for
unwind to implement this. The cost is not a matter as far as we
intend to emit this for only Asserts or PANICs. Choosing some
ERRORs by a GUC is also fine for me.
As just an idea but can't we use an definition file on that
LOCATION of error messages that needs to dump a backtrace are
listed? That list is usually empty and should be very short if
any. The LOCATION information is easily obtained from a verbose
error message itself if once shown but a bit hard to find
otherwise..That's again veering into selective logging control territory. Rather than
doing it for stack dump control only, it should be part of a broader
control over dynamic and scoped verbosity, selective logging, and log
options, like Pavan raised. I see stacks as just one knob that can be
turned on/off here.(That reminds me, I need to chat with Devrim about creating a longer lived
debuginfo + old versions rpms repo for Pg its self, if not the accessory
bits and pieces. I'm so constantly frustrated by not being able to get
needed debuginfo packages to investigate some core or running system
problem because they've been purged from the PGDG yum repo as soon as anew
version comes out.)
We in our department take care to preserve them for ourselves for
the necessity of supporting older systems. I sometimes feel that
It is very helpful if they were available on the officialMaybe if I can get some interest in that, you might be willing to
contribute your archives as a starter so we have them for back-versions?
Unfortunately I think we cannot do that for some reasons. Apart
from the reasons, I think that the official site removes older
versions by a policy and I'm not sure breaking it in other places
is good deed or not.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center