Timing events WIP v1
Attached is a first WIP saving Timing Events via a new hook, grabbed by
a new pg_timing_events contrib module. This only implements a small
portion of the RFP spec I presented earlier this month, and is by no
means finished code looking for a regular review. This is just enough
framework to do something useful while looking for parts that I suspect
have tricky details.
Right now this saves just a single line of text and inserts the hook
into the checkpoint code, so output looks like this:
$ psql -c "select * from pg_timing_events" -x
-[ RECORD 1 ]-
seq | 0
event | check/restartpoint complete: wrote 0 buffers (0.0%); 0
transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s,
sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s
Extending this to save the key/value set and most of the other data I
mentioned before is pretty straightforward. There is a fair amount of
debris in here from the pg_stat_statements code this started as. I've
left a lot of that here but commented out because it gives examples of
grabbing things like the user and database I'll need soon.
There's a modest list of things that I've done or am facing soon that
involve less obvious choices though, and I would appreciate feedback on
these things in particular:
-This will eventually aggregate data from clients running queries, the
checkpointer process, and who knows what else in the future. The most
challenging part of this so far is deciding how to handle the memory
management side. I've tried just dumping everything into
TopMemoryContext, and that not working as hoped is the biggest known bug
right now. I'm not decided on whether to continue doing that but
resolve the bug; if there is an alternate context that makes more sense
for a contrib module like this; or if I should fix the size of the data
and allocate everything at load time. The size of the data from
log_lock_waits in particular will be hard to estimate in advance.
-The event queue into a simple array accessed in circular fashion.
After realizing that output needed to navigate in the opposite order of
element addition, I ended up just putting all the queue navigation code
directly into the add/output routines. I'd be happy to use a more
formal Postgres type here instead--I looked at SHM_QUEUE for
example--but I haven't found something yet that makes this any simpler.
-I modeled the hook here on the logging one that went into 9.2. It's
defined in its own include file and it gets initialized by the logging
system. No strong justification for putting it there, it was just
similar to the existing hook and that seemed reasonable. This is in
some respects an alternate form of logging after all. The data sent by
the hook itself needs to be more verbose in order to handle the full
spec, right now I'm just asking about placement.
-I'm growing increasingly worried about allowing concurrent reads of
this data (which might be large and take a while to return to the
client) without blocking insertions. The way that was split to improve
pg_stat_statements concurrency doesn't convert naturally to this data.
The serial number field in here is part of one idea I had. I might grab
the header with an exclusive lock for only a moment and lookup the
serial number of the last record I should display. Then I could drop to
a share lock looping over the elements, stopping if I find one with a
serial number over that.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachments:
timing-events-v1.difftext/plain; charset=UTF-8; name=timing-events-v1.diff; x-mac-creator=0; x-mac-type=0Download
diff --git a/contrib/pg_timing_events/Makefile b/contrib/pg_timing_events/Makefile
new file mode 100644
index 0000000..3e7f1ab
--- /dev/null
+++ b/contrib/pg_timing_events/Makefile
@@ -0,0 +1,18 @@
+# contrib/pg_timing_events/Makefile
+
+MODULE_big = pg_timing_events
+OBJS = pg_timing_events.o
+
+EXTENSION = pg_timing_events
+DATA = pg_timing_events--1.0.sql pg_timing_events--unpackaged--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_timing_events
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_timing_events/pg_timing_events--1.0.sql b/contrib/pg_timing_events/pg_timing_events--1.0.sql
new file mode 100644
index 0000000..ccf6372
--- /dev/null
+++ b/contrib/pg_timing_events/pg_timing_events--1.0.sql
@@ -0,0 +1,43 @@
+/* contrib/pg_timing_events/pg_timing_events--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_timing_events" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_timing_events_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_timing_events(
+ OUT seq int8,
+ OUT event text
+-- OUT occured timestamp
+
+-- OUT userid oid,
+-- OUT dbid oid,
+-- OUT query text,
+-- OUT calls int8,
+-- OUT total_time float8,
+-- OUT rows int8,
+-- OUT shared_blks_hit int8,
+-- OUT shared_blks_read int8,
+-- OUT shared_blks_written int8,
+-- OUT local_blks_hit int8,
+-- OUT local_blks_read int8,
+-- OUT local_blks_written int8,
+-- OUT temp_blks_read int8,
+-- OUT temp_blks_written int8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_timing_events AS
+ SELECT * FROM pg_timing_events();
+
+GRANT SELECT ON pg_timing_events TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_timing_events_reset() FROM PUBLIC;
diff --git a/contrib/pg_timing_events/pg_timing_events--unpackaged--1.0.sql b/contrib/pg_timing_events/pg_timing_events--unpackaged--1.0.sql
new file mode 100644
index 0000000..0c9b1d8
--- /dev/null
+++ b/contrib/pg_timing_events/pg_timing_events--unpackaged--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/pg_timing_events/pg_timing_events--unpackaged--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_timing_events" to load this file. \quit
+
+ALTER EXTENSION pg_timing_events ADD function pg_timing_events_reset();
+ALTER EXTENSION pg_timing_events ADD function pg_timing_events();
+ALTER EXTENSION pg_timing_events ADD view pg_timing_events;
diff --git a/contrib/pg_timing_events/pg_timing_events.c b/contrib/pg_timing_events/pg_timing_events.c
new file mode 100644
index 0000000..d7558e6
--- /dev/null
+++ b/contrib/pg_timing_events/pg_timing_events.c
@@ -0,0 +1,716 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_timing_events.c
+ * Track statement execution times across a whole database cluster.
+ *
+ * Note about locking issues: to create or delete an entry in the shared
+ * hashtable, one must hold pgte->lock exclusively. Modifying any field
+ * in an entry requires the same. To look up an entry, one must hold
+ * the lock shared.
+ *
+ * Portions copyright (c) 2012, Salesforce.com
+ * Portions copyright (c) 2008-2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pg_timing_events/pg_timing_events.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/hash.h"
+#include "catalog/pg_type.h"
+#include "executor/executor.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "mb/pg_wchar.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/spin.h"
+#include "tcop/utility.h"
+#include "utils/builtins.h"
+#include "utils/hsearch.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/timing_event.h"
+
+
+PG_MODULE_MAGIC;
+
+/* Location of stats file */
+#define PGTE_DUMP_FILE "global/pg_timing_events.stat"
+
+/* This constant defines the magic number in the stats file header */
+static const uint32 PGTE_FILE_HEADER = 0x20100108;
+
+/*
+ * Statistics per statement
+ *
+ * NB: see the file read/write code before changing field order here.
+ */
+typedef struct pgteEntry
+{
+ /* TODO Remove key - it's just to cut down on compiler errors while refactoring to eliminate */
+ int key;
+
+ Timestamp entry_time;
+ char *description;
+ int serial_num;
+
+ /* Old entries just to use as examples */
+ Oid userid; /* user OID */
+ Oid dbid; /* database OID */
+ int encoding; /* query encoding */
+ int query_len; /* # of valid bytes in query string */
+ const char *query_ptr; /* query string proper */
+
+ slock_t mutex; /* protects the counters only */
+ char query[1]; /* VARIABLE LENGTH ARRAY - MUST BE LAST */
+ /* Note: the allocated length of query[] is actually pgte->query_size */
+} pgteEntry;
+
+/*
+ * Global shared state
+ */
+typedef struct pgteSharedState
+{
+ LWLockId lock; /* protects queue lookup/modification */
+ int serial_num;
+ int front;
+ unsigned char full; /* really a boolean */
+ char last_query[1024];
+ int queue_size;
+ int query_size; /* max query length in bytes */
+} pgteSharedState;
+
+/*---- Local variables ----*/
+
+/* Saved hook values in case of unload */
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+static timing_event_hook_type prev_timing_event_hook = NULL;
+
+/* Links to shared memory state */
+static pgteSharedState *pgte = NULL;
+static pgteEntry *PGTE_queue = NULL;
+
+/*---- GUC variables ----*/
+
+static int PGTE_max; /* max # statements to track */
+static int PGTE_track = 1; /* tracking level */
+static bool PGTE_save; /* whether to save stats across shutdown */
+
+
+/* TODO Add another GUC for disabling collection */
+#define PGTE_enabled() \
+ (PGTE_track == 1 )
+
+/*---- Function declarations ----*/
+
+void _PG_init(void);
+void _PG_fini(void);
+
+Datum pg_timing_events_reset(PG_FUNCTION_ARGS);
+Datum pg_timing_events(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(pg_timing_events_reset);
+PG_FUNCTION_INFO_V1(pg_timing_events);
+
+static void PGTE_shmem_startup(void);
+static void PGTE_shmem_shutdown(int code, Datum arg);
+static void PGTE_timing_event(const char *desc);
+static void PGTE_store(const char *query);
+static Size PGTE_memsize(void);
+static void entry_reset(void);
+
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+ /*
+ * In order to create our shared memory area, we have to be loaded via
+ * shared_preload_libraries. If not, fall out without hooking into any of
+ * the main system. (We don't throw error here because it seems useful to
+ * allow the pg_timing_events functions to be created even when the
+ * module isn't active. The functions must protect themselves against
+ * being called then, however.)
+ */
+ if (!process_shared_preload_libraries_in_progress)
+ return;
+
+ /*
+ * Define (or redefine) custom GUC variables.
+ */
+ DefineCustomIntVariable("pg_timing_events.max",
+ "Sets the maximum number of events tracked by pg_timing_events.",
+ NULL,
+ &PGTE_max,
+ 1000,
+ 100,
+ INT_MAX,
+ PGC_POSTMASTER,
+ 0,
+ NULL,
+ NULL,
+ NULL);
+
+ DefineCustomBoolVariable("pg_timing_events.save",
+ "Save pg_timing_events statistics across server shutdowns.",
+ NULL,
+ &PGTE_save,
+ true,
+ PGC_SIGHUP,
+ 0,
+ NULL,
+ NULL,
+ NULL);
+
+ EmitWarningsOnPlaceholders("pg_timing_events");
+
+ /*
+ * Request additional shared resources. (These are no-ops if we're not in
+ * the postmaster process.) We'll allocate or attach to the shared
+ * resources in PGTE_shmem_startup().
+ */
+ RequestAddinShmemSpace(PGTE_memsize());
+ RequestAddinLWLocks(1);
+
+ /*
+ * Install hooks.
+ */
+ prev_shmem_startup_hook = shmem_startup_hook;
+ shmem_startup_hook = PGTE_shmem_startup;
+
+ prev_timing_event_hook = timing_event_hook;
+ timing_event_hook = PGTE_timing_event;
+}
+
+/*
+ * Module unload callback
+ */
+void
+_PG_fini(void)
+{
+ /* Uninstall hooks. */
+ shmem_startup_hook = prev_shmem_startup_hook;
+ timing_event_hook = prev_timing_event_hook;
+
+}
+
+/*
+ * shmem_startup hook: allocate or attach to shared memory,
+ * then load any pre-existing statistics from file.
+ */
+static void
+PGTE_shmem_startup(void)
+{
+ bool found;
+ FILE *file;
+ uint32 header;
+ int32 num;
+ int32 i;
+ int query_size;
+ int buffer_size;
+ char *buffer = NULL;
+
+ if (prev_shmem_startup_hook)
+ prev_shmem_startup_hook();
+
+ /* reset in case this is a restart within the postmaster */
+ pgte = NULL;
+
+ /*
+ * Create or attach to the shared memory state, including hash table
+ */
+ LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+ pgte = (pgteSharedState *) ShmemInitStruct("pg_timing_events",
+ sizeof(pgteSharedState),
+ &found);
+
+ if (!found)
+ {
+ /* First time through ... */
+ pgte->lock = LWLockAssign();
+ pgte->query_size = pgstat_track_activity_query_size;
+ pgte->front = -1;
+ pgte->full = 0;
+ }
+
+ /* Be sure everyone agrees on the table entry sizes */
+ query_size = pgte->query_size;
+ pgte->queue_size = PGTE_max;
+
+ PGTE_queue = (pgteEntry *)
+ ShmemInitStruct("Timing Event Queue",
+ PGTE_max * (Size) sizeof(pgteEntry *), &found);
+ if (!found)
+ {
+ /*
+ * TODO If the queue logic is right, there should be no need
+ * to initialize any of its elements here?
+ */
+ }
+
+ LWLockRelease(AddinShmemInitLock);
+
+ /*
+ * If we're in the postmaster (or a standalone backend...), set up a shmem
+ * exit hook to dump the statistics to disk.
+ */
+ if (!IsUnderPostmaster)
+ on_shmem_exit(PGTE_shmem_shutdown, (Datum) 0);
+
+ /*
+ * Attempt to load old statistics from the dump file, if this is the first
+ * time through and we weren't told not to.
+ */
+ if (found || !PGTE_save)
+ return;
+
+ /*
+ * TODO Below is the old pg_stat_statements reading code, but since we've
+ * disabled writes we should never find a file to read yet for it to break
+ * on. Return anyway just to be safe.
+ */
+ return;
+
+ /*
+ * Note: we don't bother with locks here, because there should be no other
+ * processes running when this code is reached.
+ */
+ file = AllocateFile(PGTE_DUMP_FILE, PG_BINARY_R);
+ if (file == NULL)
+ {
+ if (errno == ENOENT)
+ return; /* ignore not-found error */
+ goto error;
+ }
+
+ buffer_size = query_size;
+ buffer = (char *) palloc(buffer_size);
+
+ if (fread(&header, sizeof(uint32), 1, file) != 1 ||
+ header != PGTE_FILE_HEADER ||
+ fread(&num, sizeof(int32), 1, file) != 1)
+ goto error;
+
+ for (i = 0; i < num; i++)
+ {
+ pgteEntry temp;
+/*
+ pgteEntry *entry;
+*/
+ if (fread(&temp, offsetof(pgteEntry, mutex), 1, file) != 1)
+ goto error;
+
+ /* Encoding is the only field we can easily sanity-check */
+/*
+ if (!PG_VALID_BE_ENCODING(temp.key.encoding))
+ goto error;
+*/
+ /* Previous incarnation might have had a larger query_size */
+/*
+ if (temp.key.query_len >= buffer_size)
+ {
+ buffer = (char *) repalloc(buffer, temp.key.query_len + 1);
+ buffer_size = temp.key.query_len + 1;
+ }
+
+ if (fread(buffer, 1, temp.key.query_len, file) != temp.key.query_len)
+ goto error;
+ buffer[temp.key.query_len] = '\0';
+*/
+ /* Clip to available length if needed */
+/*
+ if (temp.key.query_len >= query_size)
+ temp.key.query_len = pg_encoding_mbcliplen(temp.key.encoding,
+ buffer,
+ temp.key.query_len,
+ query_size - 1);
+ temp.key.query_ptr = buffer;
+*/
+ /* make the hashtable entry (discards old entries if too many) */
+/*
+ entry = entry_alloc(&temp);
+*/
+ }
+
+ pfree(buffer);
+ FreeFile(file);
+
+ /*
+ * Remove the file so it's not included in backups/replication
+ * slaves, etc. A new file will be written on next shutdown.
+ */
+ unlink(PGTE_DUMP_FILE);
+
+ return;
+
+error:
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not read pg_stat_statement file \"%s\": %m",
+ PGTE_DUMP_FILE)));
+ if (buffer)
+ pfree(buffer);
+ if (file)
+ FreeFile(file);
+ /* If possible, throw away the bogus file; ignore any error */
+ unlink(PGTE_DUMP_FILE);
+}
+
+/*
+ * shmem_shutdown hook: Dump statistics into file.
+ *
+ * Note: we don't bother with acquiring lock, because there should be no
+ * other processes running when this is called.
+ */
+static void
+PGTE_shmem_shutdown(int code, Datum arg)
+{
+ FILE *file;
+
+/*
+ TODO Don't need to store state until code is rewritten
+ HASH_SEQ_STATUS hash_seq;
+ int32 num_entries;
+ pgteEntry *entry;
+*/
+
+ /* Don't try to dump during a crash. */
+ if (code)
+ return;
+
+ /* Safety check ... shouldn't get here unless shmem is set up. */
+ if (!pgte || !PGTE_queue)
+ return;
+
+ /* Don't dump if told not to. */
+ if (!PGTE_save)
+ return;
+
+ /* TODO Cannot execute the below until it's corrected to write queue entries */
+ return;
+
+ file = AllocateFile(PGTE_DUMP_FILE ".tmp", PG_BINARY_W);
+ if (file == NULL)
+ goto error;
+
+ if (fwrite(&PGTE_FILE_HEADER, sizeof(uint32), 1, file) != 1)
+ goto error;
+
+/* TODO Replace with a queue iterator
+ num_entries = queue_get_num_entries(pgteSharedState);
+ if (fwrite(&num_entries, sizeof(int32), 1, file) != 1)
+ goto error;
+*/
+
+/*
+ hash_seq_init(&hash_seq, PGTE_hash);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ int len = entry->key.query_len;
+
+ if (fwrite(entry, offsetof(pgteEntry, mutex), 1, file) != 1 ||
+ fwrite(entry->query, 1, len, file) != len)
+ goto error;
+ }
+*/
+
+ if (FreeFile(file))
+ {
+ file = NULL;
+ goto error;
+ }
+
+ /*
+ * Rename file into place, so we atomically replace the old one.
+ */
+ if (rename(PGTE_DUMP_FILE ".tmp", PGTE_DUMP_FILE) != 0)
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename pg_stat_statement file \"%s\": %m",
+ PGTE_DUMP_FILE ".tmp")));
+
+ return;
+
+error:
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write pg_stat_statement file \"%s\": %m",
+ PGTE_DUMP_FILE ".tmp")));
+ if (file)
+ FreeFile(file);
+ unlink(PGTE_DUMP_FILE ".tmp");
+}
+
+/*
+ * Timing event hook: save the event
+ */
+static void
+PGTE_timing_event(const char *desc)
+{
+ if (PGTE_enabled())
+ {
+ PGTE_store(desc);
+ }
+}
+
+/*
+ * Store some statistics for a statement.
+ */
+static void
+PGTE_store(const char *query)
+{
+ Assert(query != NULL);
+
+ /* Safety check... */
+ if (!pgte || !PGTE_queue)
+ return;
+
+ /* Must acquire exclusive lock to add a new entry. */
+ LWLockAcquire(pgte->lock, LW_EXCLUSIVE);
+
+ pgte->front++;
+ if (pgte->front == PGTE_max)
+ {
+ pgte->front = 0;
+ /*
+ * Noting that queue has filled is rundant but harmless on passes after
+ * the first time it happens.
+ */
+ pgte->full = 1;
+ }
+
+ /* Using a plain strdup here results in junk events from the background writer */
+ PGTE_queue[pgte->front].description=MemoryContextStrdup(TopMemoryContext, query);
+ PGTE_queue[pgte->front].serial_num=pgte->serial_num;
+ pgte->serial_num++;
+
+ /*
+ * TODO Cheat and save a copy in a static buffer. This must be removed once the
+ * memory is correctly moved to the output via the MemoryContextStrdup above
+ * (or its replacement)
+ */
+ strncpy(pgte->last_query,query,1023);
+
+ /* TODO Examples of how to grab some state data we'll need to add later
+ key.userid = GetUserId();
+ key.dbid = MyDatabaseId;
+ key.encoding = GetDatabaseEncoding();
+ key.query_len = strlen(query);
+ if (key.query_len >= pgte->query_size)
+ key.query_len = pg_encoding_mbcliplen(key.encoding,
+ query,
+ key.query_len,
+ pgte->query_size - 1);
+ key.query_ptr = query;
+ */
+
+ LWLockRelease(pgte->lock);
+}
+
+/*
+ * Reset all statement statistics.
+ */
+Datum
+pg_timing_events_reset(PG_FUNCTION_ARGS)
+{
+ if (!pgte || !PGTE_queue)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("pg_timing_events must be loaded via shared_preload_libraries")));
+ entry_reset();
+ PG_RETURN_VOID();
+}
+
+#define pg_timing_events_COLS 2
+
+/*
+ * Retrieve statement statistics.
+ */
+Datum
+pg_timing_events(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ /*
+ Oid userid = GetUserId();
+ bool is_superuser = superuser();
+ */
+ int count;
+ int position;
+ int offset;
+
+ if (!pgte || !PGTE_queue)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("pg_timing_events must be loaded via shared_preload_libraries")));
+
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mode required, but it is not " \
+ "allowed in this context")));
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ /* TODO Fake data insertion, remove once real events are being stored */
+ /*PGTE_store("dummy data");*/
+
+ /*
+ * We can co-exist with other readers, but will block concurrent additions
+ */
+ LWLockAcquire(pgte->lock, LW_SHARED);
+
+ /*
+ * Determine start and how many active entries are in the queue.
+ * Once the array holding the queue is full, the next position after
+ * the front of the queue holds the oldest entry. But before the queue
+ * fills, the oldest entry is at position 0.
+ */
+ if (pgte->full)
+ {
+ position=(pgte->front + 1) % PGTE_max;
+ count = PGTE_max;
+ }
+ else
+ {
+ position=0;
+ count = pgte->front + 1;
+ }
+
+ offset=0; /* Just for debug printing, not navigation */
+
+ for (; count>0; count--)
+ {
+ Datum values[pg_timing_events_COLS];
+ bool nulls[pg_timing_events_COLS];
+ int col = 0;
+ pgteEntry *entry = &PGTE_queue[position];
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ /* printf("position[%d]='%s' (element %d)\n",offset,PGTE_queue[position].description,i); */
+
+ /*
+ * TODO Quietly upcasting the temporary 32 bit serial_num to 64 bits.
+ * Eventually the serial number needs to be a true 64 bit value if
+ * it is kept in the final design.
+ */
+ values[col++] = Int64GetDatumFast(entry->serial_num);
+
+/* Debugging hack if we want to see how long the description string is */
+/* values[col++] = Int64GetDatumFast(strlen(entry->description));*/
+
+/* Code that should work but doesn't yet */
+/* values[col++] = CStringGetTextDatum(entry->description); */
+
+ values[col++] = CStringGetTextDatum(pgte->last_query);
+
+/*
+ values[col++] = ObjectIdGetDatum(entry->key.userid);
+ values[col++] = ObjectIdGetDatum(entry->key.dbid);
+*/
+
+/*
+ if (is_superuser || entry->userid == userid)
+ {
+ char *qstr;
+
+ qstr = (char *)
+ pg_do_encoding_conversion((unsigned char *) entry->query,
+ entry->query_len,
+ entry->encoding,
+ GetDatabaseEncoding());
+ values[col++] = CStringGetTextDatum(qstr);
+ if (qstr != entry->query)
+ pfree(qstr);
+ }
+ else
+ values[col++] = CStringGetTextDatum("<insufficient privilege>");
+*/
+ Assert(col == pg_timing_events_COLS);
+
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+
+ /* Move to next queue position, wrapping around the array at the end */
+ position++;
+ if (position == PGTE_max)
+ position=0;
+ offset++;
+ }
+
+ LWLockRelease(pgte->lock);
+
+ /* clean up and return the tuplestore */
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Estimate shared memory space needed.
+ */
+static Size
+PGTE_memsize(void)
+{
+ Size size;
+ Size entrysize;
+
+ size = MAXALIGN(sizeof(pgteSharedState));
+ entrysize = offsetof(pgteEntry, query) +pgstat_track_activity_query_size;
+ size = add_size(size, hash_estimate_size(PGTE_max, entrysize));
+
+ return size;
+}
+
+/*
+ * Release all entries.
+ */
+static void
+entry_reset(void)
+{
+ pgteEntry *entry;
+
+ LWLockAcquire(pgte->lock, LW_EXCLUSIVE);
+ entry=PGTE_queue;
+
+/* TODO Iterate over queue array
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ hash_search(PGTE_hash, &entry->key, HASH_REMOVE, NULL);
+ }
+*/
+
+ LWLockRelease(pgte->lock);
+
+}
diff --git a/contrib/pg_timing_events/pg_timing_events.control b/contrib/pg_timing_events/pg_timing_events.control
new file mode 100644
index 0000000..7db2d12
--- /dev/null
+++ b/contrib/pg_timing_events/pg_timing_events.control
@@ -0,0 +1,5 @@
+# pg_timing_events extension
+comment = 'track execution statistics of all SQL statements executed'
+default_version = '1.0'
+module_pathname = '$libdir/pg_timing_events'
+relocatable = true
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf76f6d..8abce9a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -58,6 +58,7 @@
#include "utils/relmapper.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "utils/timing_event.h"
#include "pg_trace.h"
@@ -6730,6 +6731,7 @@ LogCheckpointEnd(bool restartpoint)
longest_usecs,
average_usecs;
uint64 average_sync_time;
+ StringInfoData timing_descr;
CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
@@ -6774,6 +6776,31 @@ LogCheckpointEnd(bool restartpoint)
average_secs = (long) (average_sync_time / 1000000);
average_usecs = average_sync_time - (uint64) average_secs *1000000;
+ if (timing_event_hook)
+ {
+ initStringInfo(&timing_descr);
+ appendStringInfo(&timing_descr,
+ "check/restartpoint complete: wrote %d buffers (%.1f%%); "
+ "%d transaction log file(s) added, %d removed, %d recycled; "
+ "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
+ "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s",
+ CheckpointStats.ckpt_bufs_written,
+ (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
+ CheckpointStats.ckpt_segs_added,
+ CheckpointStats.ckpt_segs_removed,
+ CheckpointStats.ckpt_segs_recycled,
+ write_secs, write_usecs / 1000,
+ sync_secs, sync_usecs / 1000,
+ total_secs, total_usecs / 1000,
+ CheckpointStats.ckpt_sync_rels,
+ longest_secs, longest_usecs / 1000,
+ average_secs, average_usecs / 1000);
+
+ elog(LOG,"Timing event message: %s",timing_descr.data);
+
+ (*timing_event_hook) (timing_descr.data);
+ }
+
if (restartpoint)
elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
"%d transaction log file(s) added, %d removed, %d recycled; "
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 68b7ab3..1deb31c 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -77,6 +77,7 @@
#include "utils/guc.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
+#include "utils/timing_event.h"
#undef _
@@ -104,6 +105,11 @@ extern bool redirection_done;
*/
emit_log_hook_type emit_log_hook = NULL;
+/*
+ * Hook for notifing a timing event happened.
+ */
+timing_event_hook_type timing_event_hook = NULL;
+
/* GUC parameters */
int Log_error_verbosity = PGERROR_VERBOSE;
char *Log_line_prefix = NULL; /* format for extra log line info */
diff --git a/src/include/utils/timing_event.h b/src/include/utils/timing_event.h
new file mode 100644
index 0000000..7c516f7
--- /dev/null
+++ b/src/include/utils/timing_event.h
@@ -0,0 +1,20 @@
+/*-------------------------------------------------------------------------
+ *
+ * timing_event.h
+ * Timing event definitions.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/timing_event.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIMEEVENT_H
+#define TIMEEVENT_H
+
+/* Hook for sending timing event messages */
+typedef void (*timing_event_hook_type) (const char *event_text);
+extern PGDLLIMPORT timing_event_hook_type timing_event_hook;
+
+#endif /* TIMEEVENT_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a831a1e..9c8356f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2234,6 +2234,7 @@ timeKEY
time_t
timerCA
timezone_extra
+timing_event_hook_type
tlist_vinfo
transferMode
trgm
Extending this to save the key/value set and most of the other data I
mentioned before is pretty straightforward.
Why not use Hstore? Seriously?
It would require merging Hstore into core, but I don't necessarily see
that as a bad thing.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
Greg Smith wrote:
-The event queue into a simple array accessed in circular fashion.
After realizing that output needed to navigate in the opposite order
of element addition, I ended up just putting all the queue
navigation code directly into the add/output routines. I'd be happy
to use a more formal Postgres type here instead--I looked at
SHM_QUEUE for example--but I haven't found something yet that makes
this any simpler.
SHM_QUEUE is on the death row. Try a dlist from src/backend/lib/ilist.c
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 11/15/12 10:26 AM, Josh Berkus wrote:
Extending this to save the key/value set and most of the other data I
mentioned before is pretty straightforward.Why not use Hstore? Seriously?
I just haven't done that part yet. The fact that hstore is the likely
way to do it is why I said it's straightforward, and haven't gotten
stressed about not having that finished. I wanted to try and shake out
some feedback on hook location and how best to deal with memory context,
both of which might impact this. Extending the current hook setup I've
got here to pass data via hstore would mean it needs to pull into core.
If there's a better approach for adding a hook that avoids that I'd be
happy to adopt it, I just haven't thought of one.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 11/16/2012 02:26 AM, Josh Berkus wrote:
Extending this to save the key/value set and most of the other data I
mentioned before is pretty straightforward.Why not use Hstore? Seriously?
The only issue I see is that hstore's text format is non-standard, not
widely understood and a pain to work with in applications. You can
always write queries against it in Pg and extract normal row sets of
key/value pairs, so it's not a big problem - but I'd want to think about
adding a built-in `hstore_to_json` if we were going to use hstore to
manage the data, since json is so much more widely understood by client
applications.
Not that implementing `hstore_to_json` is exactly hard, mind you, as a
quick search shows: https://gist.github.com/2318757
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 11/16/12 12:20 AM, Craig Ringer wrote:
Not that implementing `hstore_to_json` is exactly hard, mind you, as a
quick search shows: https://gist.github.com/2318757
Both pulling hstore more firmly into core and adopting something like a
hstore_to_json call as the preferred UI for timing event data are all
directions I'd be happy to go. That's why I stopped before trying to
even implement that part. I think the general direction to go is clear,
but the details on how to present the resulting data is more of a tarp
than even a bikeshed at this point. It's not a hard problem though.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
2012/11/20 Greg Smith <greg@2ndquadrant.com>:
On 11/16/12 12:20 AM, Craig Ringer wrote:
Not that implementing `hstore_to_json` is exactly hard, mind you, as a
quick search shows: https://gist.github.com/2318757Both pulling hstore more firmly into core and adopting something like a
hstore_to_json call as the preferred UI for timing event data are all
directions I'd be happy to go. That's why I stopped before trying to even
implement that part. I think the general direction to go is clear, but the
details on how to present the resulting data is more of a tarp than even a
bikeshed at this point. It's not a hard problem though.
I don't like to see current hstore in core - It doesn't support
nesting, it doesn't support different datatypes, it is not well
supported by plpgsql
regards
Pavel
Show quoted text
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/20/2012 04:48 PM, Pavel Stehule wrote:
2012/11/20 Greg Smith <greg@2ndquadrant.com>:
On 11/16/12 12:20 AM, Craig Ringer wrote:
Not that implementing `hstore_to_json` is exactly hard, mind you, as a
quick search shows: https://gist.github.com/2318757Both pulling hstore more firmly into core and adopting something like a
hstore_to_json call as the preferred UI for timing event data are all
directions I'd be happy to go. That's why I stopped before trying to even
implement that part. I think the general direction to go is clear, but the
details on how to present the resulting data is more of a tarp than even a
bikeshed at this point. It's not a hard problem though.I don't like to see current hstore in core - It doesn't support
nesting, it doesn't support different datatypes, it is not well
supported by plpgsql
... or by many client libraries, though converting hstore values to json
notation for output usually takes care of that.
I can't help but suspect that hstore will be subsumed by native storage
and indexing of json objects, since these are already widely supported
in KV or document stores. Of course, that requires that we come to some
agreement on how to represent literal scalars for interaction with json,
which hasn't seen much progress.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 11/20/12 8:02 PM, Craig Ringer wrote:
On 11/20/2012 04:48 PM, Pavel Stehule wrote:
I don't like to see current hstore in core - It doesn't support
nesting, it doesn't support different datatypes, it is not well
supported by plpgsql... or by many client libraries, though converting hstore values to json
notation for output usually takes care of that.
The more I look at this, the less comfortable I am moving forward with
hand waving this issue away. The obvious but seemingly all troublesome
options:
-Store things internally using hstore. This seems to require pulling
hstore fully into core, which has this list of issues. Outputting
through a JSON filter seems like a lot of overhead. And even if the
output concerns are addressed, if there's a problem processing the raw
data with PL/pgSQL that would be bad. I didn't think about that at all
until Pavel brought it up.
-Go back to a simpler composite type idea for storing this data. It
feels wrong to create something that looks a lot like hstore data, but
isn't. There's probably subtle display/client/processing issues in
there I haven't thought of too.
-Try to store the data in a JSON friendly way in the first place.
That last one seems to lead right to...
I can't help but suspect that hstore will be subsumed by native storage
and indexing of json objects, since these are already widely supported
in KV or document stores. Of course, that requires that we come to some
agreement on how to represent literal scalars for interaction with json,
which hasn't seen much progress.
If this is the Right Way to solve this problem, that's puts a serious
snag in doing something useful with Timing Events in he near future. I
know better than to try and fight against following the correct path
once it's identified. I can't claim any good expertise on client
encoding/transfer issues though; that's not a strong area for me.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2012/12/12 Greg Smith <greg@2ndquadrant.com>:
On 11/20/12 8:02 PM, Craig Ringer wrote:
On 11/20/2012 04:48 PM, Pavel Stehule wrote:
I don't like to see current hstore in core - It doesn't support
nesting, it doesn't support different datatypes, it is not well
supported by plpgsql... or by many client libraries, though converting hstore values to json
notation for output usually takes care of that.The more I look at this, the less comfortable I am moving forward with hand
waving this issue away. The obvious but seemingly all troublesome options:-Store things internally using hstore. This seems to require pulling hstore
fully into core, which has this list of issues. Outputting through a JSON
filter seems like a lot of overhead. And even if the output concerns are
addressed, if there's a problem processing the raw data with PL/pgSQL that
would be bad. I didn't think about that at all until Pavel brought it up.-Go back to a simpler composite type idea for storing this data. It feels
wrong to create something that looks a lot like hstore data, but isn't.
There's probably subtle display/client/processing issues in there I haven't
thought of too.-Try to store the data in a JSON friendly way in the first place.
That last one seems to lead right to...
I can't help but suspect that hstore will be subsumed by native storage
and indexing of json objects, since these are already widely supported
in KV or document stores. Of course, that requires that we come to some
agreement on how to represent literal scalars for interaction with json,
which hasn't seen much progress.If this is the Right Way to solve this problem, that's puts a serious snag
in doing something useful with Timing Events in he near future. I know
better than to try and fight against following the correct path once it's
identified. I can't claim any good expertise on client encoding/transfer
issues though; that's not a strong area for me.
I didn't watch this discussion all time, but I don't understand why
you need a complex datatypes.
Maybe you can do normalization and complex data types can hold only in memory.
Regards
Pavel
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I decided to take a look at this.
On 15 November 2012 09:56, Greg Smith <greg@2ndquadrant.com> wrote:
-I modeled the hook here on the logging one that went into 9.2. It's
defined in its own include file and it gets initialized by the logging
system. No strong justification for putting it there, it was just similar
to the existing hook and that seemed reasonable. This is in some respects
an alternate form of logging after all. The data sent by the hook itself
needs to be more verbose in order to handle the full spec, right now I'm
just asking about placement.
I noticed that when !log_checkpoints, control never reaches the site
where the hook is called, and thus the checkpoint info is not stored.
Is that the intended behaviour of the patch? I don't feel strongly
either way, but it did give me pause for a second. pg_stat
I don't like the design of this patch. It seems like we should be
moving as far away from an (internal) textual representation as
possible - that's basically the main reason why logging is bad.
Textual representations make easy, correct querying hard, are bloated,
and are really no better than manual logging.
The first question I'd ask is "what is true of all timing events?".
You address this in your original RFC [1]Original RFC e-mail: /messages/by-id/509300F7.5000803@2ndQuadrant.com. However, this isn't
reflected in the datastructures that the patch uses, and I suspect
that it should be.
What I'm envisioning is new NodeTag infrastructure. You'd have an
"abstract base class". This is sort of like the one used by plan nodes
(i.e. the Plan struct). This would contain the information that is
essential to all timing events (in the case of Plan, this includes
startup and total costs for the node in question). You'd refactor
structs like CheckpointStatsData, to compose them such that their
first field was this parent struct (i.e. logically, they'd inherit
from, say, TimingEvent, which itself would inherit from Node, in the
style of Plan's children). That'd entail normalising things like
inconsistent use of datatypes to represent time intervals and whatnot
among these sorts of structs in the extant code, which is painful, but
long term it seems like the way forward.
With this commonality and variability analysis performed, and having
refactored the code, you'd be able to use type punning to pass the
essential information to your new hook. You'd have a way to usefully
restrict statistics to only one class of timing event, because you
could expose the NodeTag in your view (or perhaps a prettified version
thereof), by putting something in a query predicate. You can perform
aggregation, too. Maybe this abstract base class would look something
like:
typedef struct TimingEvent
{
NodeTag type;
/*
* Common structural data for all TimingEvent types.
*/
double duration; /* Duration of event in milliseconds (per project
policy for views) */
double naffected; /* Number of objects (of variable type) affected */
char *string; /* Non-standardised event string */
} TimingEvent;
However, my failure to come up with more fields here leads me to
believe that we'd be better off with a slightly different approach.
Let's forget about storing text altogether, and make the unstructured
"union" field JSON (while making fields common to all timing events
plain-old scalar datums, for ease of use).
It would become the job of PGTE_memsize() to make each entry in the
shared hashtable wide enough to accommodate the widest of all possible
TimingEvent-inheriting structs. We'd just store heterogeneous
TimingEvent-inheriting structs in the hash table. It would be
pg_timing_events job to call a JSON conversion routine directly, with
baked-in knowledge, within a NodeTag case statement.
It actually wouldn't be hard to have the view spit out some JSON from
the internal, compact struct representation. Currently, explain.c
manages to generate JSON representations of plans with very little
fuss, and without using any of the JSON datatype stuff. Doing this
with hstore is just way too controversial, and not particularly worth
it, IMHO. With a "where event_type = x" in the query predicate, the
JSON datums would have predictable, consistent structure, facilitating
machine reading and aggregation.
[1]: Original RFC e-mail: /messages/by-id/509300F7.5000803@2ndQuadrant.com
/messages/by-id/509300F7.5000803@2ndQuadrant.com
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1/14/13 11:19 AM, Peter Geoghegan wrote:
I noticed that when !log_checkpoints, control never reaches the site
where the hook is called, and thus the checkpoint info is not stored.
Is that the intended behaviour of the patch?
I was aware and considered it a defensible situation--that turning off
checkpoint logging takes that data out everywhere. But it was not
necessarily the right thing to do.
Currently, explain.c
manages to generate JSON representations of plans with very little
fuss, and without using any of the JSON datatype stuff. Doing this
with hstore is just way too controversial, and not particularly worth
it, IMHO.
I wasn't optimistic after seeing the number of bugs that scurried out
when the hstore rock was turned over for this job. On the
implementation side, the next round of code I've been playing with here
has struggled with the problem of rendering to strings earlier than I'd
like. I'd like to delay that as long as possible; certainly not do it
during storage, and preferably it only happens when someone asks for the
timing event.
With a "where event_type = x" in the query predicate, the
JSON datums would have predictable, consistent structure, facilitating
machine reading and aggregation.
Filtering on a range of timestamps or on the serial number field is the
main thing that I imagined, as something that should limit results
before even producing tuples. The expected and important case where
someone wants "all timing events after #123" after persisting #123 to
disk, I'd like that to be efficient. All of the fields I'd want to see
filtering on are part of the fixed set of columns every entry will have.
To summarize, your suggestion is to build an in-memory structure capable
of holding the timing event data. The Datum approach will be used to
cope with different types of events having different underlying types.
The output format for queries against this data set will be JSON,
rendered directly from the structure similarly to how EXPLAIN (FORMAT
JSON) outputs query trees. The columns every row contains, like a
serial number, timestamp, and pid, can be filtered on by something
operating at the query executor level. Doing something useful with the
more generic, "dynamic schema" parts will likely require parsing the
JSON output.
Those are all great ideas I think people could live with.
It looks to me like the hook definition itself would need the entire
data structure defined, and known to work, before its API could be
nailed down. I was hoping that we might get a hook for diverting this
data committed into 9.3 even if the full extension to expose it wasn't
nailed down. That was based on similarity to the generic logging hook
that went into 9.2. This new implementation idea reminds me more of the
query length decoration needed for normalized pg_stat_statements
though--something that wasn't easy to just extract out from the consumer
at all.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers