Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi,
Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.
CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;
-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)
-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)
-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)
That nearly *double* the execution time without TIMING.
Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.
That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)
but obviously doesn't move the needle.
Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
- 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined)
- agg_retrieve_direct (inlined)
- 79.27% fetch_input_tuple
- ExecProcNode (inlined)
- 75.72% ExecProcNodeInstr
+ 25.22% SeqNext
- 21.74% InstrStopNode
+ 17.80% __GI___clock_gettime (inlined)
- 21.44% InstrStartNode
+ 19.23% __GI___clock_gettime (inlined)
+ 4.06% ExecScan
+ 13.09% advance_aggregates (inlined)
1.06% MemoryContextReset
And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.
Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp
The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.
In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).
When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms │
│ Execution Time: 2610.256 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)
Time: 2610.589 ms (00:02.611)
And there's still some smaller improvements that could be made ontop of
that.
As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)
That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...
In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).
I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:
On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.
Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.
In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.
There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...
I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.
Greetings,
Andres Freund
so 13. 6. 2020 v 1:28 odesílatel Andres Freund <andres@anarazel.de> napsal:
Hi,
Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)That nearly *double* the execution time without TIMING.
Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)but obviously doesn't move the needle.
Looking at a profile it's easy to confirm that we spend a lot of time acquiring time: - 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined) - agg_retrieve_direct (inlined) - 79.27% fetch_input_tuple - ExecProcNode (inlined) - 75.72% ExecProcNodeInstr + 25.22% SeqNext - 21.74% InstrStopNode + 17.80% __GI___clock_gettime (inlined) - 21.44% InstrStartNode + 19.23% __GI___clock_gettime (inlined) + 4.06% ExecScan + 13.09% advance_aggregates (inlined) 1.06% MemoryContextResetAnd that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscpThe reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual
time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016
width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms
│
│ Execution Time: 2610.256 ms
│└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)Time: 2610.589 ms (00:02.611)
And there's still some smaller improvements that could be made ontop of
that.As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.
+1
Pavel
Show quoted text
Greetings,
Andres Freund
On Sat, Jun 13, 2020 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:
[PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper.
Makes a lot of sense. If we do this, I'll need to update pgbench,
which just did something similar locally. If I'd been paying
attention to this thread I might not have committed that piece of the
recent pgbench changes, but it's trivial stuff and I'll be happy to
tidy that up when the time comes.
[PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation.
Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp
The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.
Yeah, after reading a bit about this, I agree that there is no reason
to think that the stalling version makes the answer better in any way.
It might make sense if you use it once at the beginning of a large
computation, but it makes no sense if you sprinkle it around inside
blocks that will run multiple times. It destroys your
instructions-per-cycle while, turning your fancy super scalar Pentium
into a 486. It does raise some interesting questions about what
exactly you're measuring, though: I don't know enough to have a good
grip on how far out of order the TSC could be read!
There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.
Googling tells me that Nehalem (2008) introduced "invariant TSC"
(clock rate independent) and also socket synchronisation at the same
time, so systems without it are already pretty long in the tooth.
A quick peek at an AMD manual[1]https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373) tells me that a similar change
happened in 15H/Bulldozer/Piledriver/Steamroller/Excavator (2011),
identified with the same CPUID test.
My first reaction is that it seems like TSC would be the least of your
worries if you're measuring a VM that's currently migrating between
hosts, but maybe the idea is just that you have to make sure you don't
assume it can't ever go backwards or something like that...
Google Benchmark has some clues about how to spell this on MSVC, what
some instructions might be to research on ARM, etc.
[1]: https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373)
(page 373)
[2]: https://github.com/google/benchmark/blob/master/src/cycleclock.h
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.
Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.
See attached an updated patch (adding it to the July commitfest), with a few
changes:
- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
- The decision to use rdtsc (or not) is made at runtime, in the new
INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time
because this is dependent on the specific CPU in use, amongst other things
- In an abundance of caution, for now I've decided to only enable this if we
are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
quite sophisticated logic around making this decision, see [1]https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c)
Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2]http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/.
Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion
* Possibly related, the floating point handling for the cycles_to_sec
variable
is problematic in terms of precision (see FIXME, taken over from Andres'
POC)
Open questions from me:
1) Do we need to account for different TSC offsets on different CPUs in SMP
systems? (the Linux kernel certainly has logic to that extent, but [3]https://stackoverflow.com/a/11060619/1652607
suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
having an invariant TSC)
2) Should we have a setting "--with-tsc" for configure? (instead of always
enabling it when on Linux/x86 with a TSC clocksource)
3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
current instructions to finish -- the prior discussion seemed to suggest
we don't want it for node instruction measurements, but possibly we do
want
this in other cases?)
4) Should we support using the "mrs" instruction on ARM? (which is similar
to
rdtsc, see [4]https://cpufun.substack.com/p/fun-with-timers-and-cpuid)
Thanks,
Lukas
[1]: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c
[2]: http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[3]: https://stackoverflow.com/a/11060619/1652607
[4]: https://cpufun.substack.com/p/fun-with-timers-and-cpuid
--
Lukas Fittl
Attachments:
v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchapplication/octet-stream; name=v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchDownload+163-12
v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchapplication/octet-stream; name=v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchDownload+26-37
Hi,
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.See attached an updated patch (adding it to the July commitfest), with a few
changes:- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
Yep.
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.
- In an abundance of caution, for now I've decided to only enable this if we
are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
quite sophisticated logic around making this decision, see [1])
I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...
Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2].
Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion
Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.
* Possibly related, the floating point handling for the cycles_to_sec
variable
is problematic in terms of precision (see FIXME, taken over from Andres'
POC)
And probably also performance...
Open questions from me:
1) Do we need to account for different TSC offsets on different CPUs in SMP
systems? (the Linux kernel certainly has logic to that extent, but [3]
suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
having an invariant TSC)
I don't think we should cater to systems where we need that.
2) Should we have a setting "--with-tsc" for configure? (instead of always
enabling it when on Linux/x86 with a TSC clocksource)
Probably not worth it.
3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
current instructions to finish -- the prior discussion seemed to suggest
we don't want it for node instruction measurements, but possibly we do
want
this in other cases?)
I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?
4) Should we support using the "mrs" instruction on ARM? (which is similar
to
rdtsc, see [4])
I'd leave that for later personally.
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -95,17 +104,37 @@ typedef int64 instr_time;#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void) +extern double cycles_to_sec; + +bool use_rdtsc;
This should be extern and inside the ifdef below.
+#if defined(__x86_64__) && defined(__linux__) +extern void pg_clock_gettime_initialize_rdtsc(void); +#endif + +static inline instr_time pg_clock_gettime_ref_cycles(void) { struct timespec tmp;+#if defined(__x86_64__) && defined(__linux__) + if (use_rdtsc) + return __rdtsc(); +#endif + clock_gettime(PG_INSTR_CLOCK, &tmp);return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
Greetings,
Andres Freund
I ran that original test case with and without the patch. Here are the
numbers I'm seeing:
master (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms
postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms
postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)
patched (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms
postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms
postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms
On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversionCould you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.
I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?
Thanks,
Maciek
On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda <m.sakrejda@gmail.com>
wrote:
I ran that original test case with and without the patch. Here are the
numbers I'm seeing:master (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 mspostgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 mspostgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)patched (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 mspostgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 mspostgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 msOn Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out,
where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE
is
reporting. With Andres' earlier test case, I'm seeing a consistent
~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is aproblem
somewhere with how we perform the cycles to time conversion
Could you explain a bit more what you're seeing? I just tested your
patches
and didn't see that here.
I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?Thanks,
MaciekThe patch requires a rebase; please rebase the patch with the latest code.
Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm
--
Ibrar Ahmed
On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote:
Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm
No rebased version has been sent since this update, so this patch has
been marked as RwF.
--
Michael
I think it would be great to get this patch committed. Beyond the
reasons already mentioned, the significant overhead also tends to skew
the reported runtimes in ways that makes it difficult to compare them.
For example, if two nodes are executed equally often but one needs twice
the time to process the rows: in such a case EXPLAIN ANALYZE should
report timings that are 2x apart. However, currently, the high overhead
of clock_gettime() tends to skew the relative runtimes.
On 10/12/22 10:33, Michael Paquier wrote:
No rebased version has been sent since this update, so this patch has
been marked as RwF.
I've rebased the patch set on latest master and fixed a few compiler
warnings. Beyond that some findings and thoughts:
You're only using RDTSC if the clock source is 'tsc'. Great idea to not
bother caring about a lot of hairy TSC details. Looking at the kernel
code this seems to imply that the TSC is frequency invariant. I don't
think though that this implies that Linux is not running under a
hypervisor; which is good because I assume PostgreSQL is used a lot in
VMs. However, when running under a hypervisor (at least with VMWare)
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated
success but the returned values were garbage. Instead of using leaf
0x16, we should then use the hypervisor interface to obtain the TSC
frequency. Checking if a hypervisor is active can be done via:
bool IsHypervisorActive()
{
uint32 cpuinfo[4] = {0};
int res = __get_cpuid(0x1, &cpuinfo[0], &cpuinfo[1], &cpuinfo[2],
&cpuinfo[3]);
return res > 0 && (cpuinfo[2] & (1 << 30));
}
Obtaining the TSC frequency via the hypervisor interface can be done
with the following code. See https://lwn.net/Articles/301888/ for more
details.
// Under hypervisors (tested with VMWare) leaf 0x16 is not available,
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to
obtain TSC frequency.
uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x40000001, &cpuinfo[0],
&cpuinfo[1], &cpuinfo[2], &cpuinfo[3]) > 0)
cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);
Given that we anyways switch between RDTSC and clock_gettime() with a
global variable, what about exposing the clock source as GUC? That way
the user can switch back to a working clock source in case we miss a
detail around activating or reading the TSC.
I'm happy to update the patches accordingly.
--
David Geier
(ServiceNow)
I missed attaching the patches.
--
David Geier
(ServiceNow)
Attachments:
0001-WIP-Change-instr_time-to-just-store-nanoseconds-v3.patchtext/x-patch; charset=UTF-8; name=0001-WIP-Change-instr_time-to-just-store-nanoseconds-v3.patchDownload+26-37
0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure-v3.patchtext/x-patch; charset=UTF-8; name=0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure-v3.patchDownload+157-11
Hi,
I re-based again on master and applied the following changes:
I removed the fallback for obtaining the TSC frequency from /proc/cpu as
suggested by Andres. Worst-case we fall back to clock_gettime().
I added code to obtain the TSC frequency via CPUID when under a
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(),
because __get_cpuid() returns an error if the leaf is > 0x80000000
(probably the implementation pre-dates the hypervisor timing leafs).
Unfortunately, while testing my implementation under VMWare, I found
that RDTSC runs awfully slow there (like 30x slower). [1]https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC") indicates that
we cannot generally rely on RDTSC being actually fast on VMs. However,
the same applies to clock_gettime(). It runs as slow as RDTSC on my
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not
entirely sure if there aren't cases where e.g. clock_gettime() is
actually faster than RDTSC and it would be advantageous to use
clock_gettime(). We could add a GUC so that the user can decide which
clock source to use. Any thoughts?
I also somewhat improved the accuracy of the cycles to milli- and
microseconds conversion functions by having two more multipliers with
higher precision. For microseconds we could also keep the computation
integer-only. I'm wondering what to best do for seconds and
milliseconds. I'm currently leaning towards just keeping it as is,
because the durations measured and converted are usually long enough
that precision shouldn't be a problem.
In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I
initialized a few variables of type instr_time which otherwise resulted
in warnings due to use of potentially uninitialized variables.
I also couldn't reproduce the reported timing discrepancy. For me the
runtime reported by \timing is just slightly higher than the time
reported by EXPLAIN ANALYZE, which is expected.
Beyond that:
What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so
that it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?
If no one objects I would also re-register this patch in the commit fest.
[1]: https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC")
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
(page 11 "Virtual TSC")
--
David Geier
(ServiceNow)
Attachments:
0001-Change-instr_time-to-just-store-nanoseconds-v4.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v4.patchDownload+26-37
0002-Use-CPU-reference-cycles-via-RDTSC-v4.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-v4.patchDownload+175-12
0003-Refactor-some-instr_time-related-code-v4.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v4.patchDownload+9-11
Hi David,
Thanks for continuing to work on this patch, and my apologies for silence
on the patch.
Its been hard to make time, and especially so because I typically develop
on an ARM-based macOS system where I can't test this directly - hence my
tests with virtualized EC2 instances, where I ran into the timing oddities.
On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?
Minor note, but in my understanding using a uint64 (where we can) is faster
for any simple arithmetic we do with the values.
If no one objects I would also re-register this patch in the commit fest.
+1, and feel free to carry this patch forward - I'll try to make an effort
to review my earlier testing issues again, as well as your later
improvements to the patch.
Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE
with timing over in [0]/messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com, using a sampling-based approach to reduce the
timing overhead.
[0]: /messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com
/messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com
Thanks,
Lukas
--
Lukas Fittl
On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversionCould you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?
For what it's worth, I can no longer reproduce this. In fact, I went
back to master-as-of-around-then and applied Lukas' v2 patches again,
and I still can't reproduce that. I do remember it happening
consistently across several executions, but now \timing consistently
shows 0.5-1ms slower, as expected. This does not explain the different
timing issue Lukas was seeing in his tests, but I think we can assume
what I reported originally here is not an issue.
Hi Lukas,
On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for
silence on the patch.
It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().
I rebased again on master because of [1]/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com. Patches attached.
Its been hard to make time, and especially so because I typically
develop on an ARM-based macOS system where I can't test this directly
- hence my tests with virtualized EC2 instances, where I ran into the
timing oddities.
That's good and bad. Bad to do the development and good to test the
implementation on more virtualized setups; given that I also encountered
"interesting" behavior on VMWare (see my previous mails).
On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?Minor note, but in my understanding using a uint64 (where we can) is
faster for any simple arithmetic we do with the values.
That's true. So the argument could be that for seconds and milliseconds
we want the extra precision while microseconds are precise enough.
Still, we could also make the seconds and milliseconds conversion code
integer only and e.g. return two integers with the value before and
after the comma. FWICS, the functions are nowhere used in performance
critical code, so it doesn't really make a difference performance-wise.
+1, and feel free to carry this patch forward - I'll try to make an
effort to review my earlier testing issues again, as well as your
later improvements to the patch.
Moved to the current commit fest. Will you become reviewer?
Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
ANALYZE with timing over in [0], using a sampling-based approach to
reduce the timing overhead.
Interesting idea. I'll reply with some thoughts on the corresponding thread.
[1]: /messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
--
David Geier
(ServiceNow)
Attachments:
0001-Change-instr_time-to-just-store-nanoseconds-v5.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v5.patchDownload+26-37
0002-Use-CPU-reference-cycles-via-RDTSC-v5.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-v5.patchDownload+175-12
0003-Refactor-some-instr_time-related-code-v5.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v5.patchDownload+9-11
On Tue, 3 Jan 2023 at 14:08, David Geier <geidav.pg@gmail.com> wrote:
Hi Lukas,
On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for
silence on the patch.It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().I rebased again on master because of [1]. Patches attached.
Its been hard to make time, and especially so because I typically
develop on an ARM-based macOS system where I can't test this directly
- hence my tests with virtualized EC2 instances, where I ran into the
timing oddities.That's good and bad. Bad to do the development and good to test the
implementation on more virtualized setups; given that I also encountered
"interesting" behavior on VMWare (see my previous mails).On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?Minor note, but in my understanding using a uint64 (where we can) is
faster for any simple arithmetic we do with the values.That's true. So the argument could be that for seconds and milliseconds
we want the extra precision while microseconds are precise enough.
Still, we could also make the seconds and milliseconds conversion code
integer only and e.g. return two integers with the value before and
after the comma. FWICS, the functions are nowhere used in performance
critical code, so it doesn't really make a difference performance-wise.+1, and feel free to carry this patch forward - I'll try to make an
effort to review my earlier testing issues again, as well as your
later improvements to the patch.Moved to the current commit fest. Will you become reviewer?
Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
ANALYZE with timing over in [0], using a sampling-based approach to
reduce the timing overhead.Interesting idea. I'll reply with some thoughts on the corresponding thread.
[1]
/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
CFBot shows some compilation errors as in [1]https://cirrus-ci.com/task/5375312565895168, please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status
[1]: https://cirrus-ci.com/task/5375312565895168
Regards,
Vignesh
Hi,
CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status[1] - https://cirrus-ci.com/task/5375312565895168
Regards,
Vignesh
I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.
--
David Geier
(ServiceNow)
Attachments:
0001-Change-instr_time-to-just-store-nanoseconds-v6.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v6.patchDownload+26-37
0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patchDownload+179-15
0003-Refactor-some-instr_time-related-code-v6.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v6.patchDownload+9-11
Hi,
On 2023-01-04 13:02:05 +0100, David Geier wrote:
From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).
While the range of instr_time storing nanoseconds wouldn't be good enough for
a generic timestamp facility (hence using microsecs for Timestamp), the range
seems plenty for its use of measuring runtime:
(2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years
Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just
239 years.
It could theoretically be a different story, if we stored instr_time's on
disk. But we don't, they're ephemeral.
This doesn't buy a whole lot of performance - the bottlenck is the actual
timestamp computation. But in a query with not much else going on, it's
visible and reproducible. It's, unsurprisingly, a lot easier to see when using
BUFFERS.
For both timespec and nanosecond, I measured three server starts, and for each
started server three executions of
pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT generate_series(1, 10000000) OFFSET 10000000;")
the best result is:
timespec: 1073.431
nanosec: 957.532
a ~10% difference
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2023-01-04 13:02:05 +0100, David Geier wrote:
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds.
Cheaper, and perhaps more accurate too? Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.
I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.
regards, tom lane
Hi,
On 2023-01-13 15:25:16 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds.Cheaper, and perhaps more accurate too? Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.
instr_time on !WIN32 use struct timespec, so we already should have nanosecond
precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I
misunderstanding what you mean?
I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.
Cool.
Looking at the instr_time.h part of the change, I think it should go further,
and basically do the same thing in the WIN32 path. The only part that needs to
be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a
good bit.
Greetings,
Andres Freund
On Wed, 4 Jan 2023 at 17:32, David Geier <geidav.pg@gmail.com> wrote:
I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.
The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3751.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch
./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch
....
patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 135.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej
[1]: http://cfbot.cputube.org/patch_41_3751.log
Regards,
Vignesh