Re: [HACKERS] S_LOCK reduced contention

Started by Tom Laneover 27 years ago2 messages
#1Tom Lane
tgl@sss.pgh.pa.us

dg@illustra.com (David Gould) writes:

[ Much careful testing snipped ]

Nice job, David!

gprof introduces severe experimental error for small quick functions.
Perhaps the spinlock function is so short and quick compared to the
mcount overhead added to the function prolog that the overhead dominates
the measurement. gprof remains a useful tool for larger functions with
longer runtimes, but must be considered very suspect for tiny functions.

Right. gprof is a fine example of Heisenberg's Uncertainty Principle
applied to software ;-). You can't measure something without affecting it.

As Bruce Momjian just pointed out in another followup, running the test
function a lot of times in a tight loop isn't a perfect answer either:
you find out what happens when the function's code and referenced data are
all in cache, but you can't tell much about what happens when they are
not; and you can't tell whether the whole application's memory usage
patterns are such that the function will remain in cache.

I'm guessing that the backend uses tas() enough that it will probably
stay in cache, but that is *strictly* a guess with no evidence.

In some of the test cases there was significant timing variation from
run to run even though the test conditions were apparently identical.
Even more strangely, the variation in time was not random but appeared
to represent two different modes. And, the variation was itself
repeatable.
[snip]
I have no explanation for this variation. Possibly it is some interaction
of where the program is loaded and the state of the memory heirarchy, but
even this is hard to sustain. I would be very curious to hear of any
plausible explainations.

After chewing on this for a while I think that your speculation is
right. You were using a Pentium, you said. The Pentium has a two-way
set associative cache, which means that any given main-memory address
has exactly two cache lines it could be loaded into. Main-memory
addresses that are 1000H apart contend for the same pair of cache lines.
Thus, if your program happens to repeatedly hit three locations that are
exactly 1000H apart, it will suffer a cache miss every time. Change the
address spacing, and no miss occurs. The cache miss takes forty-some
clock cycles, versus one if the target location is in cache.
(BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine
reference book if you are into hand-optimized assembly coding for Intel
processors.)

So what I think you were seeing is that on some runs, the loop involved
hitting three addresses that contend for the same cache line pair, while
on other runs there was no cache contention. This could be explained
by varying load addresses for the program, if it touched both its own
addresses (variable) and some non-varying addresses --- say, C library
routines executed from a shared library that remained loaded throughout.
If you can link with a non-shared C library then you should get more
consistent runtimes, because the offsets between all the locations
touched by your loop would be fixed, and thus cache hits or misses ought
to be consistent from run to run.

The bottom line, however, is that this behavior is too unpredictable
to be worth worrying about in a production program. A cache miss in
a tight loop could be created or eliminated by unrelated changes in
distant parts of the code ... and in any case the behavior will be
different on different CPUs. The 486, Pentium, and Pentium Pro all
have radically different cache layouts, let alone non-Intel CPUs.

regards, tom lane

#2Noname
dg@illustra.com
In reply to: Tom Lane (#1)

Tom Lane writes:

dg@illustra.com (David Gould) writes:

[ Much careful testing snipped ]
In some of the test cases there was significant timing variation from
run to run even though the test conditions were apparently identical.
Even more strangely, the variation in time was not random but appeared
to represent two different modes. And, the variation was itself
repeatable.
[snip]
I have no explanation for this variation. Possibly it is some interaction
of where the program is loaded and the state of the memory heirarchy, but
even this is hard to sustain. I would be very curious to hear of any
plausible explainations.

After chewing on this for a while I think that your speculation is
right. You were using a Pentium, you said. The Pentium has a two-way
set associative cache, which means that any given main-memory address
has exactly two cache lines it could be loaded into. Main-memory
addresses that are 1000H apart contend for the same pair of cache lines.
Thus, if your program happens to repeatedly hit three locations that are
exactly 1000H apart, it will suffer a cache miss every time. Change the
address spacing, and no miss occurs. The cache miss takes forty-some
clock cycles, versus one if the target location is in cache.
(BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine
reference book if you are into hand-optimized assembly coding for Intel
processors.)

So what I think you were seeing is that on some runs, the loop involved
hitting three addresses that contend for the same cache line pair, while
on other runs there was no cache contention. This could be explained
by varying load addresses for the program, if it touched both its own
addresses (variable) and some non-varying addresses --- say, C library
routines executed from a shared library that remained loaded throughout.
If you can link with a non-shared C library then you should get more
consistent runtimes, because the offsets between all the locations
touched by your loop would be fixed, and thus cache hits or misses ought
to be consistent from run to run.

This is in line with my own speculation. However, I am not convinced.

First, the test loop and the function it calls are only about 100 bytes
total taken together. And, no system calls or library calls are made during
the test. This tends to rule out "locations 1000H apart" and shared library
effects. Also, I would expect the system to load programs and libraries
on VM page boundaries. Unless there is some cachability difference from one
page to the next, I am at a loss to account for this.

The bottom line, however, is that this behavior is too unpredictable
to be worth worrying about in a production program. A cache miss in
a tight loop could be created or eliminated by unrelated changes in
distant parts of the code ... and in any case the behavior will be
different on different CPUs. The 486, Pentium, and Pentium Pro all
have radically different cache layouts, let alone non-Intel CPUs.

Agreed, the reasonable thing to do is to try to be sensitive to cache
effects and accept that there are mysteries.

thanks

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"Don't worry about people stealing your ideas. If your ideas are any
good, you'll have to ram them down people's throats." -- Howard Aiken