POC: Extension for adding distributed tracing - pg_tracing
Hi hackers,
Currently, most distributed tracing solutions only provide details from the
client's drivers point of view. We don't have visibility when the query
reached the database and how the query was processed.
The goal of this project is to provide distributed tracing within the
database to provide more insights on a query execution as an extension. The
idea and architecture is heavily inspired by pg_stat_statements and
auto_explain.
I have a working prototype of a pg_tracing extension and wanted some
feedback on the design and architecture.
Some definitions: (paraphrased from opentelemetry
https://opentelemetry.io/docs/concepts/signals/traces/)
A trace groups multiple spans and a span represents a single specific
operation. It is defined by:
- a trace_id
- a spanid (randomly generated int64)
- a name
- a parent id (a span can have children)
- a start timestamp
- a duration
- Attributes (key value metadatas)
We will use attributes to propagate the query's instrumentation details:
buffer usage, jit, wal usage, first tuple, plan cost...
Triggering a trace:
===============
We rely on https://google.github.io/sqlcommenter/ to propagate trace
information.
A query with sqlcommenter will look like:
/*dddbs='postgres.db',traceparent='00-00000000000000000000000000000009-0000000000000005-01'*/
select 1;
The traceparent fields are detailed in
https://www.w3.org/TR/trace-context/#traceparent-header-field-values
The 3 fields are
00000000000000000000000000000009: trace id
0000000000000005: parent id
01: trace flags (01 == sampled)
If the query is sampled, we instrument the query and generate spans for the
following operations:
- Query Span: The top span for a query. They are created after extracting
the traceid from traceparent or to represent a nested query. We always have
at least one. We may have more if we have a nested query.
- Planner: We track the time spent in the planner and report the planner
counters
- Executor: We trace the different steps of the Executor: Start, Run,
Finish and End
- Node Span: Created from planstate. The name is extracted from the node
type (IndexScan, SeqScan). For the operation name, we generate something
similar to the explain.
Storing spans:
===============
Spans are stored in a shared memory buffer. The buffer size is fixed and if
it is full, no further spans are generated until the buffer is freed.
We store multiple information with variable sized text in the spans: names,
parameter values...
To avoid keeping those in the shared memory, we store them in an external
file (similar to pg_stat_statements query text).
Processing spans:
===============
Traces need to be sent to a trace collector. We offload this logic to an
external application which will:
- Collect the spans through a new pg_tracing_spans() function output
- Built spans in any format specific to the targeted trace collector
(opentelemetry, datadog...)
- Send it to the targeted trace collector using the appropriate protocol
(gRPC, http...)
I have an example of a span forwarder that I've been using in my tests
available here:
https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50.
This setup permits a lot of flexibility:
- The postgres extension doesn't need any vendor specific knowledge
- The trace forwarder can reuse existing libraries and any languages to
send spans
Buffer management:
===============
I've tried to keep the memory management simple. Creating spans will add
new elements to the shared_spans buffer.
Once a span is forwarded, there's no need to keep it in the shared memory.
Thus, pg_tracing_spans has a "consume" parameter which will completely
empty the shared buffers when called by the forwarder.
If we have a regularly full buffers, then we can:
- increase forwarder's frequency
- increase the shared buffer size
- decrease the amount of generated spans
Statistics are available through pg_tracing_info to keep track of the
number of generated spans and dropped events.
Overhead:
===============
I haven't run benchmarks yet, however, we will have multiple ways to
control the overhead.
Traces rely heavily on sampling to keep the overhead low: we want extensive
information on representative samples of our queries.
For now, we leave the sampling decision to the caller (through the sampled
flag) but I will add an additional parameter to allow additional sampling
if the rate of traced queries is too high.
Stopping tracing when we have a full buffer is also a good safeguard. If an
application is misconfigured and starts sampling every queries, this will
stop all tracing and won't impact query performances.
Query Instrumentation:
===============
I have been able to rely on most of existing query instrumentation except
for one missing information: I needed to know the start of a node as we
currently only have the first tuple and the duration. I added firsttime to
Instrumentation. This is the only change done outside the extension's code.
Current status:
===============
The current code is able to generate spans which once forwarded to the
trace collector, I was able to get traces with execution details with
similar amount of information that an explain (analyze, buffers) would
provide. I have provided screenshots of the result.
There's a lot of work left in cleaning, commenting, handling edge cases and
adding tests.
Thoughts and feedback are welcome.
Regards,
Anthonin Bonnefoy
Hi Anthonin,
I have a working prototype of a pg_tracing extension and wanted some feedback on the design and architecture.
The patch looks very interesting, thanks for working on it and for
sharing. The facts that the patch doesn't change the core except for
two lines in instrument.{c.h} and that is uses pull-based model:
- Collect the spans through a new pg_tracing_spans() function output
... IMO were the right design decisions. The patch lacks the
documentation, but this is OK for a PoC.
I added the patch to the nearest CF [1]https://commitfest.postgresql.org/44/4456/. Let's see what the rest of
the community thinks.
[1]: https://commitfest.postgresql.org/44/4456/
--
Best regards,
Aleksander Alekseev
Hi!
This patch looks very interesting, I'm working on the same subject too. But
I've used
another approach - I'm using C wrapper for C++ API library from
OpenTelemetry, and
handle span storage and output to this library. There are some nuances
though, but it
is possible. Have you tried to use OpenTelemetry APIs instead of
implementing all
functionality around spans?
On Tue, Jul 25, 2023 at 1:16 PM Aleksander Alekseev <
aleksander@timescale.com> wrote:
Hi Anthonin,
I have a working prototype of a pg_tracing extension and wanted some
feedback on the design and architecture.
The patch looks very interesting, thanks for working on it and for
sharing. The facts that the patch doesn't change the core except for
two lines in instrument.{c.h} and that is uses pull-based model:- Collect the spans through a new pg_tracing_spans() function output
... IMO were the right design decisions. The patch lacks the
documentation, but this is OK for a PoC.I added the patch to the nearest CF [1]. Let's see what the rest of
the community thinks.[1] https://commitfest.postgresql.org/44/4456/
--
Best regards,
Aleksander Alekseev
--
Regards,
--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Nikita,
This patch looks very interesting, I'm working on the same subject too. But I've used
another approach - I'm using C wrapper for C++ API library from OpenTelemetry, and
handle span storage and output to this library. There are some nuances though, but it
is possible. Have you tried to use OpenTelemetry APIs instead of implementing all
functionality around spans?
I don't think that PostgreSQL accepts such kind of C++ code, not to
mention the fact that the PostgreSQL license is not necessarily
compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
advice). Such a design decision will probably require using separate
compile flags since the user doesn't necessarily have a corresponding
dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.
So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
C implementation without 3rd party dependencies from me. Especially
considering the fact that the implementation seems to be rather
simple.
--
Best regards,
Aleksander Alekseev
I've initially thought of sending the spans from PostgreSQL since this is
the usual behavior of tracing libraries.
However, this created a lot potential issues:
- Protocol support and differences between trace collectors. OpenTelemetry
seems to use gRPC, others are using http and those will require additional
libraries (plus gRPC support in C doesn't look good) and any change in
protobuf definition would require updating the extension.
- Do we send the spans within the query hooks? This means that we could
block the process if the trace collector is slow to answer or we can’t
connect. Sending spans from a background process sounded rather complex and
resource heavy.
Moving to a pull model fixed those issues and felt more natural as this is
the way PostgreSQL exposes its metrics.
On Wed, Jul 26, 2023 at 4:11 PM Aleksander Alekseev <
aleksander@timescale.com> wrote:
Show quoted text
Nikita,
This patch looks very interesting, I'm working on the same subject too.
But I've used
another approach - I'm using C wrapper for C++ API library from
OpenTelemetry, and
handle span storage and output to this library. There are some nuances
though, but it
is possible. Have you tried to use OpenTelemetry APIs instead of
implementing all
functionality around spans?
I don't think that PostgreSQL accepts such kind of C++ code, not to
mention the fact that the PostgreSQL license is not necessarily
compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
advice). Such a design decision will probably require using separate
compile flags since the user doesn't necessarily have a corresponding
dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
C implementation without 3rd party dependencies from me. Especially
considering the fact that the implementation seems to be rather
simple.--
Best regards,
Aleksander Alekseev
Hi!
I've tried to test the extension, but got errors calling make check and
cannot install the extension.
I've applied the patch onto current master and configured it as:
./configure --enable-debug --enable-cassert --enable-depend
--enable-tap-tests
Could you please advise if I'm doing something wrong?
For make check please see attached log.
For installing extension after setting shared preload library I've got an
error in Postgres logfile:
2023-07-27 13:41:52.324 MSK [12738] LOG: database system is shut down
2023-07-27 13:41:52.404 MSK [14126] FATAL: could not load library
"/usr/local/pgsql/lib/pg_tracing.so": /usr/local/pgsql/lib/pg_tracing.so:
undefined symbol: get_operation_name
Thank you!
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Attachments:
Hi,
Also FYI, there are build warnings because functions
const char * get_span_name(const Span * span, const char *qbuffer)
and
const char * get_operation_name(const Span * span, const char *qbuffer)
do not have default inside switch and no return outside of switch.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi,
Also FYI, there are build warnings because functions
const char * get_span_name(const Span * span, const char *qbuffer)
and
const char * get_operation_name(const Span * span, const char *qbuffer)
do not have default inside switch and no return outside of switch.
You are right, there are a few warnings:
```
[1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
../contrib/pg_tracing/span.c: In function ‘get_span_name’:
../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
non-void function [-Wreturn-type]
210 | }
| ^
../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
non-void function [-Wreturn-type]
249 | }
| ^
```
Here is the patch v2 with a quick fix.
but got errors calling make check and cannot install the extension
Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.
--
Best regards,
Aleksander Alekseev
Attachments:
v2-0001-TODO-FIXME-write-a-proper-commit-message-this-is-.patchapplication/octet-stream; name=v2-0001-TODO-FIXME-write-a-proper-commit-message-this-is-.patchDownload+3274-1
Hi,
I've fixed the Autotools build, please check patch below (v2).
On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <
aleksander@timescale.com> wrote:
Hi,
Also FYI, there are build warnings because functions
const char * get_span_name(const Span * span, const char *qbuffer)
and
const char * get_operation_name(const Span * span, const char *qbuffer)
do not have default inside switch and no return outside of switch.You are right, there are a few warnings:
```
[1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
../contrib/pg_tracing/span.c: In function ‘get_span_name’:
../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
non-void function [-Wreturn-type]
210 | }
| ^
../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
non-void function [-Wreturn-type]
249 | }
| ^
```Here is the patch v2 with a quick fix.
but got errors calling make check and cannot install the extension
Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.--
Best regards,
Aleksander Alekseev
--
Regards,
--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Attachments:
pg-tracing-v2.patchapplication/octet-stream; name=pg-tracing-v2.patchDownload+3283-1
Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.
I was only using meson and forgot to keep Automake files up to date when
I've split pg_tracing.c in multiple files (span.c, explain.c...).
On Fri, Jul 28, 2023 at 8:10 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
Show quoted text
Hi,
I've fixed the Autotools build, please check patch below (v2).
On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <
aleksander@timescale.com> wrote:Hi,
Also FYI, there are build warnings because functions
const char * get_span_name(const Span * span, const char *qbuffer)
and
const char * get_operation_name(const Span * span, const char *qbuffer)
do not have default inside switch and no return outside of switch.You are right, there are a few warnings:
```
[1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
../contrib/pg_tracing/span.c: In function ‘get_span_name’:
../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
non-void function [-Wreturn-type]
210 | }
| ^
../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
non-void function [-Wreturn-type]
249 | }
| ^
```Here is the patch v2 with a quick fix.
but got errors calling make check and cannot install the extension
Agree, something goes wrong when using Autotools (but not Meson) on
both Linux and MacOS. I didn't investigate the issue though.--
Best regards,
Aleksander Alekseev--
Regards,--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi,
I'd keep Autotools build up to date, because Meson is very limited in terms
of not very
up-to-date OS versions. Anyway, it's OK now. I'm currently playing with the
patch and
reviewing sources, if you need any cooperation - please let us know. I'm +1
for committing
this patch after review.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi,
I'd keep Autotools build up to date
Definitely. The patch is still in a rough state and updating Autotools fell
through the cracks.
I'm currently playing with the patch and
reviewing sources, if you need any cooperation - please let us know.
The goal for me was to get validation on the design and to check if there
was anything that would be a blocker before commiting more time on the
project.
There are already several things I was planning on improving:
- The parse span is something I've added recently. I've realised that I
could rely on the stmtStartTimestamp to get the time spent in parsing in
the post parse hook so the code around this is still rough.
- There are probably race conditions on the reads and writes of the query
file that need to be fixed
- I'm using the same Span structure for everything while Executor spans
only need a small subset which is a bit wasteful. I've tried to use two
different shared buffers (base spans and spans with counters) but it was
making things more confusing.
- Finish commenting code and start writing the doc
So any comments, missing features and reviews on the current state of the
project is welcome.
Regards,
Anthonin
On Fri, Jul 28, 2023 at 9:35 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
Show quoted text
Hi,
I'd keep Autotools build up to date, because Meson is very limited in
terms of not very
up-to-date OS versions. Anyway, it's OK now. I'm currently playing with
the patch and
reviewing sources, if you need any cooperation - please let us know.
I'm +1 for committing
this patch after review.--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi!
What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
macros for timing calculations?
Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in
production environments you rarely could
change query text directly. For the current state the most simple solution
is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same way,
but not for any random query.
I'll check the patch for the race conditions.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
macros for timing calculations?
If you're talking of the two instances where I'm modifying the instr_time's
ticks, it's because I can't use the macros there.
The first case is for the parse span. I only have the start timestamp
using GetCurrentStatementStartTimestamp and don't
have access to the start instr_time so I need to build the duration from 2
timestamps.
The second case is when building node spans from the planstate. I directly
have the duration from Instrumentation.
I guess one fix would be to use an int64 for the span duration to directly
store nanoseconds instead of an instr_time
but I do use the instrumentation macros outside of those two cases to get
the duration of other spans.
Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them?
That's a good point. I was focusing on leaving the sampling decision to the
caller through the sampled flag and
only recently added the pg_tracing_sample_rate parameter to give more
control. It should be straightforward to
add an option to create standalone traces based on sample rate alone. This
way, setting the sample rate to 1
would force the queries running in the session to be traced.
On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Show quoted text
Hi!
What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
macros for timing calculations?Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in
production environments you rarely could
change query text directly. For the current state the most simple solution
is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same way,
but not for any random query.I'll check the patch for the race conditions.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi,
Here's a new patch with changes from the previous discussion:
- I'm now directly storing nanoseconds duration in the span instead of the
instr_time. Using the instr_time macros was a bit awkward as the
durations I generate don't necessarily have a starting and ending
instr_time.
Moving to straight nanoseconds made things clearer from my point of view.
- I've added an additional sample rate pg_tracing.sample_rate (on top of
the pg_tracing.caller_sample_rate). This one will allow queries to be
sampled even without trace information propagated from the caller.
Setting this sample rate to 1 will basically trace everything. For now,
this will only work when going through the post parse hook. I will add
support for prepared statements and cached plans for the next patch.
- I've improved how parse spans are created. It's a bit challenging to get
the start of a parse as there's no pre parse hook or instrumentation around
parse so it's only an estimation.
Regards,
Anthonin
On Fri, Jul 28, 2023 at 4:06 PM Anthonin Bonnefoy <
anthonin.bonnefoy@datadoghq.com> wrote:
Show quoted text
What do you think about using INSTR_TIME_SET_CURRENT,
INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC
macros for timing calculations?
If you're talking of the two instances where I'm modifying the
instr_time's ticks, it's because I can't use the macros there.
The first case is for the parse span. I only have the start timestamp
using GetCurrentStatementStartTimestamp and don't
have access to the start instr_time so I need to build the duration from 2
timestamps.
The second case is when building node spans from the planstate. I directly
have the duration from Instrumentation.I guess one fix would be to use an int64 for the span duration to directly
store nanoseconds instead of an instr_time
but I do use the instrumentation macros outside of those two cases to get
the duration of other spans.Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them?
That's a good point. I was focusing on leaving the sampling decision to
the caller through the sampled flag and
only recently added the pg_tracing_sample_rate parameter to give more
control. It should be straightforward to
add an option to create standalone traces based on sample rate alone. This
way, setting the sample rate to 1
would force the queries running in the session to be traced.On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi!
What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
macros for timing calculations?Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them? It would
be much more usable for maintenance and support personnel, because in
production environments you rarely could
change query text directly. For the current state the most simple
solution is switch tracing on and off by calling SQL
function, and possibly switch tracing for prepared statement the same
way, but not for any random query.I'll check the patch for the race conditions.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Attachments:
pg-tracing-v2.patchapplication/octet-stream; name=pg-tracing-v2.patchDownload+3669-0
Hi!
Thanks for the improvements!
Here's a new patch with changes from the previous discussion:
- I'm now directly storing nanoseconds duration in the span instead of the
instr_time. Using the instr_time macros was a bit awkward as the
durations I generate don't necessarily have a starting and ending
instr_time.
Moving to straight nanoseconds made things clearer from my point of view.
Cool. It could be easily casted to ms for the user.
- I've added an additional sample rate pg_tracing.sample_rate (on top of
the pg_tracing.caller_sample_rate). This one will allow queries to be
sampled even without trace information propagated from the caller.
Setting this sample rate to 1 will basically trace everything. For now,
this will only work when going through the post parse hook. I will add
support for prepared statements and cached plans for the next patch.
Cool, I've just made the same improvement and wanted to send a patch a bit
later after tests.
- I've improved how parse spans are created. It's a bit challenging to get
the start of a parse as there's no pre parse hook or instrumentation around
parse so it's only an estimation.
I've also added a query id field to span and made a table and an sql
function that flushes spans to this table instead of returning set or
records - it is more convenient for the maintenance to query the table.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi!
Please check some suggested improvements -
1) query_id added so span to be able to join it with pg_stat_activity and
pg_stat_statements;
2) table for storing spans added, to flush spans buffer, for maintenance
reasons - to keep track of spans,
with SQL function that flushes buffer into table instead of recordset;
3) added setter function for sampling_rate GUC to tweak it on-the-fly
without restart.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Attachments:
0001_pg_tracing_v2_1.patchapplication/octet-stream; name=0001_pg_tracing_v2_1.patchDownload+3924-40
Hi!
1) query_id added so span to be able to join it with pg_stat_activity and
pg_stat_statements;
Sounds good, I've added your changes with my code.
2) table for storing spans added, to flush spans buffer
I'm not sure about this. It means that this is something that would only be
available on primary as replicas won't be able
to write data in the table. It will also make version updates and
migrations much more complex and I haven't seen a similar
pattern on other extensions.
3) added setter function for sampling_rate GUC to tweak it on-the-fly
without restart
ok, I've added this in my branch.
On my side, I've made the following changes:
1) All spans are now kept in palloced buffers and only added during
end_tracing. This way, we limit the shared_spans lock.
2) I've added a pg_tracing.drop_on_full_buffer parameter to drop all spans
when the buffer is full. This could be useful to always keep
the latest spans when the consuming app is not fast enough. This is also
useful for testing.
3) I'm testing more complex queries. Most of my previous tests were using
simple query protocol but extended protocol introduces
differences that break some assumptions I did. For example, with multi
statement transaction like
BEGIN;
SELECT 1;
SELECT 2;
The parse of SELECT 2 will happen before the ExecutorEnd (and the
end_tracing) of SELECT 1. For now, I'm skipping the post parse
hook if we still have an ongoing tracing.
I've also started running https://github.com/anse1/sqlsmith on a db with
full sample and it's currently failing some assertions and I'm
working to fix those.
On Thu, Aug 3, 2023 at 9:13 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Show quoted text
Hi!
Please check some suggested improvements -
1) query_id added so span to be able to join it with pg_stat_activity and
pg_stat_statements;
2) table for storing spans added, to flush spans buffer, for maintenance
reasons - to keep track of spans,
with SQL function that flushes buffer into table instead of recordset;
3) added setter function for sampling_rate GUC to tweak it on-the-fly
without restart.--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Attachments:
pg-tracing-v3.patchapplication/octet-stream; name=pg-tracing-v3.patchDownload+4196-0
Hi!
Storing spans is a tricky question. Monitoring systems often use pull model
and use probes or SQL
API for pulling data, so from this point of view it is much more convenient
to keep spans in separate
table. But in this case we come to another issue - how to flush this data
into the table? Automatic
flushing on a full buffer would randomly (to the user) significantly affect
query performance. I'd rather
make a GUC turned off by default to store spans into the table instead of
buffer.
3) I'm testing more complex queries. Most of my previous tests were using
simple query protocol but extended protocol introduces
differences that break some assumptions I did. For example, with multi
statement transaction like
BEGIN;
SELECT 1;
SELECT 2;
The parse of SELECT 2 will happen before the ExecutorEnd (and the
end_tracing) of SELECT 1. For now, I'm skipping the post parse
hook if we still have an ongoing tracing.
I've checked this behavior before and haven't noticed the case you
mentioned, but for
loops like
for i in 1..2 loop
StartTime := clock_timestamp();
insert into t2 values (i, a_data);
EndTime := clock_timestamp();
Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from
StartTime) );
end loop;
I've got the following call sequence:
psql:/home/postgres/tests/trace.sql:52: NOTICE:
pg_tracing_post_parse_analyze hook <1>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <1>
psql:/home/postgres/tests/trace.sql:52: NOTICE:
pg_tracing_post_parse_analyze hook <2>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <2>
psql:/home/postgres/tests/trace.sql:52: NOTICE:
pg_tracing_post_parse_analyze hook <StartTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook
<StartTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE:
pg_tracing_post_parse_analyze hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorStart
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorRun
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorFinish
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorEnd
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE:
pg_tracing_post_parse_analyze hook <EndTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook
<EndTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE:
pg_tracing_post_parse_analyze hook <Delta := 1000 * ( extract(epoch from
EndTime) - extract(epoch from StartTime) )>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook
<Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from
StartTime) )>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorStart
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorRun
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorFinish
<insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorEnd
<insert into t2 values (i, a_data)>
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Hi!
Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
API for pulling data, so from this point of view it is much more convenient to keep spans in separate
table. But in this case we come to another issue - how to flush this data into the table? Automatic
flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
make a GUC turned off by default to store spans into the table instead of buffer.
My main issue with having the extension flush spans in a table is that
it will only work on primaries.
Replicas won't be able to do this as they can't write data on tables
and having this flush function is likely going to introduce more
confusion if it only works on primaries.
From my point of view, processing the spans should be done by an
external application.Similar to my initial example which forward spans
to a trace collector
(https://gist.github.com/bonnefoa/6ed24520bdac026d6a6a6992d308bd50#file-main-go),
this application could instead store spans in a dedicated table. This
way, the creation of the table is outside pg_tracing's scope and the
span_store application will be able to store spans on replicas and
primaries.
How frequently this application should pull data to avoid full buffer
and dropping spans is a tricky part. We have stats so we can monitor
if drops are happening and adjust the spans buffer size or increase
the application's pull frequency. Another possibility (though I'm not
familiar with that) could be to use notifications. The application
will listen to a channel and pg_tracing will notify when a
configurable threshold is reached (i.e.: if the buffer reaches 50%,
send a notification).
I've checked this behavior before and haven't noticed the case you mentioned, but for
loops like
for i in 1..2 loop
StartTime := clock_timestamp();
insert into t2 values (i, a_data);
EndTime := clock_timestamp();
Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
end loop;
Was this run through psql? In this case, you would be using simple
protocol which always drops the portal at the end of the statement and
is not triggering the issue.
With extended protocol and a multi statement transactions, I have the
following backtrace:
* frame #0: 0x0000000104581318
postgres`ExecutorEnd(queryDesc=0x00000001420083b0) at execMain.c:471:6
frame #1: 0x00000001044e1644
postgres`PortalCleanup(portal=0x0000000152031d00) at
portalcmds.c:299:4
frame #2: 0x0000000104b0e77c
postgres`PortalDrop(portal=0x0000000152031d00, isTopCommit=false) at
portalmem.c:503:3
frame #3: 0x0000000104b0e3e8 postgres`CreatePortal(name="",
allowDup=true, dupSilent=true) at portalmem.c:194:3
frame #4: 0x000000010487a308
postgres`exec_bind_message(input_message=0x000000016bb4b398) at
postgres.c:1745:12
frame #5: 0x000000010487846c
postgres`PostgresMain(dbname="postgres", username="postgres") at
postgres.c:4685:5
frame #6: 0x0000000104773144
postgres`BackendRun(port=0x0000000141704080) at postmaster.c:4433:2
frame #7: 0x000000010477044c
postgres`BackendStartup(port=0x0000000141704080) at
postmaster.c:4161:3
frame #8: 0x000000010476d6fc postgres`ServerLoop at postmaster.c:1778:6
frame #9: 0x000000010476c260 postgres`PostmasterMain(argc=3,
argv=0x0000600001cf72c0) at postmaster.c:1462:11
frame #10: 0x0000000104625ca8 postgres`main(argc=3,
argv=0x0000600001cf72c0) at main.c:198:3
frame #11: 0x00000001a61dbf28 dyld`start + 2236
At this point, the new statement is already parsed and it's only
during the bind that the previous' statement portal is dropped and
ExecutorEnd is called. Thus, we have overlapping statements which are
tricky to handle.
On my side, I've made the following changes:
Skip nodes that are not executed
Fix query_id propagation, it should now be associated with all the
spans of a query when available
I've forgotten to add the spans buffer in the shmem hook so it would
crash when pg_tracing.max_span was too high. Now it's fixed and we can
request more than 2000 spans. Currently, the size of Span is 320 bytes
so 50000 will take ~15MB of memory.
I've added the subplan/init plan spans and started experimenting
deparsing the plan to add more details on the nodes. Instead of
'NestedLoop', we will have something 'NestedLoop|Join Filter : (oid =
relnamespace)'. One consequence is that it will create a high number
of different operations which is something we want to avoid with trace
collectors. I'm probably gonna move this outside the operation name or
find a way to make this more stable.
I've started initial benchmarks and profiling. On a default
installation and running a default `pgbench -T15`, I have the
following:
Without pg_tracing: tps = 953.118759 (without initial connection time)
With pg_tracing: tps = 652.953858 (without initial connection time)
Most of the time is spent process_query_desc. I haven't tried to
optimize performances yet so there's probably a lot of room for
improvement.
Show quoted text
On Mon, Aug 14, 2023 at 11:17 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi!
Storing spans is a tricky question. Monitoring systems often use pull model and use probes or SQL
API for pulling data, so from this point of view it is much more convenient to keep spans in separate
table. But in this case we come to another issue - how to flush this data into the table? Automatic
flushing on a full buffer would randomly (to the user) significantly affect query performance. I'd rather
make a GUC turned off by default to store spans into the table instead of buffer.3) I'm testing more complex queries. Most of my previous tests were using simple query protocol but extended protocol introduces
differences that break some assumptions I did. For example, with multi statement transaction like
BEGIN;
SELECT 1;
SELECT 2;
The parse of SELECT 2 will happen before the ExecutorEnd (and the end_tracing) of SELECT 1. For now, I'm skipping the post parse
hook if we still have an ongoing tracing.I've checked this behavior before and haven't noticed the case you mentioned, but for
loops like
for i in 1..2 loop
StartTime := clock_timestamp();
insert into t2 values (i, a_data);
EndTime := clock_timestamp();
Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) );
end loop;I've got the following call sequence:
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <1>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <1>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <2>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <2>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <StartTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <StartTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorRun <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <EndTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <EndTime := clock_timestamp()>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_post_parse_analyze hook <Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) )>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <Delta := 1000 * ( extract(epoch from EndTime) - extract(epoch from StartTime) )>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_planner_hook <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorStart <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorRun <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorFinish <insert into t2 values (i, a_data)>
psql:/home/postgres/tests/trace.sql:52: NOTICE: pg_tracing_ExecutorEnd <insert into t2 values (i, a_data)>--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/