Why does the call latency on clock_gettime(CLOCK_REALTIME, ..) vary so much?










6















I'm trying to time how long clock_gettime(CLOCK_REALTIME,...) takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.



I wrote some test code that used __rdtscp to time repeated calls to clock_gettime (the rdtscp calls went around a loop that called clock_gettime and added the results together, just so the compiler wouldn't optimize too much away).



If I call clock_gettime() in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:



42467
1114
1077
496
455


However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:



Here at 10 second sleeps:



28293
1093
4729
4756
4736


Here at 1 second sleeps:



61578
855
4753
4741
5645
4753
4732


Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?



tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...) don't understand why it runs faster when called in quick succession as opposed to a second between calls.



Update: Here's the cpuinfo on proc 0



processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:


Here is the recreated test code:



#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;


int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);

unsigned long long x, y, n = 0;
unsigned int d;


x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);

printf("init run %lldn", (y-x));

for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));

y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));


exit(0);










share|improve this question



















  • 1





    You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a /prof/cpuinfo dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?

    – Kuba Ober
    Nov 11 '18 at 19:08







  • 1





    vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I

    – Kuba Ober
    Nov 11 '18 at 19:18











  • @Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.

    – JasonN
    Nov 11 '18 at 21:17












  • @kuba strace verifies that the clock_gettime calls are not syscalls

    – JasonN
    Nov 11 '18 at 23:20






  • 1





    @terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).

    – JasonN
    Nov 13 '18 at 23:06















6















I'm trying to time how long clock_gettime(CLOCK_REALTIME,...) takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.



I wrote some test code that used __rdtscp to time repeated calls to clock_gettime (the rdtscp calls went around a loop that called clock_gettime and added the results together, just so the compiler wouldn't optimize too much away).



If I call clock_gettime() in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:



42467
1114
1077
496
455


However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:



Here at 10 second sleeps:



28293
1093
4729
4756
4736


Here at 1 second sleeps:



61578
855
4753
4741
5645
4753
4732


Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?



tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...) don't understand why it runs faster when called in quick succession as opposed to a second between calls.



Update: Here's the cpuinfo on proc 0



processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:


Here is the recreated test code:



#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;


int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);

unsigned long long x, y, n = 0;
unsigned int d;


x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);

printf("init run %lldn", (y-x));

for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));

y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));


exit(0);










share|improve this question



















  • 1





    You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a /prof/cpuinfo dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?

    – Kuba Ober
    Nov 11 '18 at 19:08







  • 1





    vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I

    – Kuba Ober
    Nov 11 '18 at 19:18











  • @Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.

    – JasonN
    Nov 11 '18 at 21:17












  • @kuba strace verifies that the clock_gettime calls are not syscalls

    – JasonN
    Nov 11 '18 at 23:20






  • 1





    @terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).

    – JasonN
    Nov 13 '18 at 23:06













6












6








6


1






I'm trying to time how long clock_gettime(CLOCK_REALTIME,...) takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.



I wrote some test code that used __rdtscp to time repeated calls to clock_gettime (the rdtscp calls went around a loop that called clock_gettime and added the results together, just so the compiler wouldn't optimize too much away).



If I call clock_gettime() in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:



42467
1114
1077
496
455


However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:



Here at 10 second sleeps:



28293
1093
4729
4756
4736


Here at 1 second sleeps:



61578
855
4753
4741
5645
4753
4732


Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?



tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...) don't understand why it runs faster when called in quick succession as opposed to a second between calls.



Update: Here's the cpuinfo on proc 0



processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:


Here is the recreated test code:



#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;


int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);

unsigned long long x, y, n = 0;
unsigned int d;


x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);

printf("init run %lldn", (y-x));

for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));

y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));


exit(0);










share|improve this question
















I'm trying to time how long clock_gettime(CLOCK_REALTIME,...) takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.



I wrote some test code that used __rdtscp to time repeated calls to clock_gettime (the rdtscp calls went around a loop that called clock_gettime and added the results together, just so the compiler wouldn't optimize too much away).



If I call clock_gettime() in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:



42467
1114
1077
496
455


However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:



Here at 10 second sleeps:



28293
1093
4729
4756
4736


Here at 1 second sleeps:



61578
855
4753
4741
5645
4753
4732


Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?



tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...) don't understand why it runs faster when called in quick succession as opposed to a second between calls.



Update: Here's the cpuinfo on proc 0



processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:


Here is the recreated test code:



#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;


int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);

unsigned long long x, y, n = 0;
unsigned int d;


x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);

printf("init run %lldn", (y-x));

for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));

y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));


exit(0);







linux performance optimization timing code-timing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 21:05







JasonN

















asked Nov 11 '18 at 18:53









JasonNJasonN

750720




750720







  • 1





    You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a /prof/cpuinfo dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?

    – Kuba Ober
    Nov 11 '18 at 19:08







  • 1





    vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I

    – Kuba Ober
    Nov 11 '18 at 19:18











  • @Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.

    – JasonN
    Nov 11 '18 at 21:17












  • @kuba strace verifies that the clock_gettime calls are not syscalls

    – JasonN
    Nov 11 '18 at 23:20






  • 1





    @terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).

    – JasonN
    Nov 13 '18 at 23:06












  • 1





    You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a /prof/cpuinfo dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?

    – Kuba Ober
    Nov 11 '18 at 19:08







  • 1





    vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I

    – Kuba Ober
    Nov 11 '18 at 19:18











  • @Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.

    – JasonN
    Nov 11 '18 at 21:17












  • @kuba strace verifies that the clock_gettime calls are not syscalls

    – JasonN
    Nov 11 '18 at 23:20






  • 1





    @terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).

    – JasonN
    Nov 13 '18 at 23:06







1




1





You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a /prof/cpuinfo dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?

– Kuba Ober
Nov 11 '18 at 19:08






You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a /prof/cpuinfo dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?

– Kuba Ober
Nov 11 '18 at 19:08





1




1





vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I

– Kuba Ober
Nov 11 '18 at 19:18





vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I

– Kuba Ober
Nov 11 '18 at 19:18













@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.

– JasonN
Nov 11 '18 at 21:17






@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.

– JasonN
Nov 11 '18 at 21:17














@kuba strace verifies that the clock_gettime calls are not syscalls

– JasonN
Nov 11 '18 at 23:20





@kuba strace verifies that the clock_gettime calls are not syscalls

– JasonN
Nov 11 '18 at 23:20




1




1





@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).

– JasonN
Nov 13 '18 at 23:06





@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).

– JasonN
Nov 13 '18 at 23:06












2 Answers
2






active

oldest

votes


















2





+250









The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.



The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.



Now regarding the second and later measurements. From the question:



42467
1114
1077
496
455


I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.



In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.






share|improve this answer

























  • thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

    – JasonN
    Nov 18 '18 at 23:23












  • if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

    – JasonN
    Nov 18 '18 at 23:26






  • 1





    @JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

    – Hadi Brais
    Nov 18 '18 at 23:26












  • Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

    – JasonN
    Nov 20 '18 at 16:10







  • 2





    @JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

    – Hadi Brais
    Nov 20 '18 at 18:48



















0














I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).



For example:



./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)


Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc-based clock_gettime only if it seems suitable on the current system.



The second most likely reason is that clock_gettime(CLOCK_REALTIME) doesn't go through the VDSO on your system, so is a system call even if rdtsc is ultimately being used. I guess that could be due to an old libc version or something like that.



The third most likely reason is that rdtsc on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.



Single Loop Results



Trying with a single clock_gettime call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1 gives:



init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...


Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf line was modified to:



printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);





share|improve this answer

























  • When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

    – JasonN
    Nov 26 '18 at 0:03











  • @JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

    – BeeOnRope
    Nov 27 '18 at 17:40











  • You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

    – BeeOnRope
    Nov 27 '18 at 17:41






  • 1





    You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

    – Hadi Brais
    Nov 27 '18 at 20:15












  • @HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

    – BeeOnRope
    Nov 27 '18 at 21:34











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53252050%2fwhy-does-the-call-latency-on-clock-gettimeclock-realtime-vary-so-much%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2





+250









The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.



The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.



Now regarding the second and later measurements. From the question:



42467
1114
1077
496
455


I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.



In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.






share|improve this answer

























  • thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

    – JasonN
    Nov 18 '18 at 23:23












  • if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

    – JasonN
    Nov 18 '18 at 23:26






  • 1





    @JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

    – Hadi Brais
    Nov 18 '18 at 23:26












  • Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

    – JasonN
    Nov 20 '18 at 16:10







  • 2





    @JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

    – Hadi Brais
    Nov 20 '18 at 18:48
















2





+250









The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.



The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.



Now regarding the second and later measurements. From the question:



42467
1114
1077
496
455


I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.



In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.






share|improve this answer

























  • thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

    – JasonN
    Nov 18 '18 at 23:23












  • if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

    – JasonN
    Nov 18 '18 at 23:26






  • 1





    @JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

    – Hadi Brais
    Nov 18 '18 at 23:26












  • Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

    – JasonN
    Nov 20 '18 at 16:10







  • 2





    @JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

    – Hadi Brais
    Nov 20 '18 at 18:48














2





+250







2





+250



2




+250





The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.



The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.



Now regarding the second and later measurements. From the question:



42467
1114
1077
496
455


I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.



In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.






share|improve this answer















The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.



The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.



Now regarding the second and later measurements. From the question:



42467
1114
1077
496
455


I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.



In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 19 '18 at 16:22

























answered Nov 18 '18 at 23:16









Hadi BraisHadi Brais

9,44411838




9,44411838












  • thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

    – JasonN
    Nov 18 '18 at 23:23












  • if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

    – JasonN
    Nov 18 '18 at 23:26






  • 1





    @JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

    – Hadi Brais
    Nov 18 '18 at 23:26












  • Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

    – JasonN
    Nov 20 '18 at 16:10







  • 2





    @JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

    – Hadi Brais
    Nov 20 '18 at 18:48


















  • thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

    – JasonN
    Nov 18 '18 at 23:23












  • if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

    – JasonN
    Nov 18 '18 at 23:26






  • 1





    @JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

    – Hadi Brais
    Nov 18 '18 at 23:26












  • Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

    – JasonN
    Nov 20 '18 at 16:10







  • 2





    @JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

    – Hadi Brais
    Nov 20 '18 at 18:48

















thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

– JasonN
Nov 18 '18 at 23:23






thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."

– JasonN
Nov 18 '18 at 23:23














if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

– JasonN
Nov 18 '18 at 23:26





if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.

– JasonN
Nov 18 '18 at 23:26




1




1





@JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

– Hadi Brais
Nov 18 '18 at 23:26






@JasonN This explains the impact of calling sleep and passing an argument larger than zero seconds. It takes more time to call clock_gettime (about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.

– Hadi Brais
Nov 18 '18 at 23:26














Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

– JasonN
Nov 20 '18 at 16:10






Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the sleep 1 is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.

– JasonN
Nov 20 '18 at 16:10





2




2





@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

– Hadi Brais
Nov 20 '18 at 18:48






@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain now and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.

– Hadi Brais
Nov 20 '18 at 18:48














0














I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).



For example:



./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)


Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc-based clock_gettime only if it seems suitable on the current system.



The second most likely reason is that clock_gettime(CLOCK_REALTIME) doesn't go through the VDSO on your system, so is a system call even if rdtsc is ultimately being used. I guess that could be due to an old libc version or something like that.



The third most likely reason is that rdtsc on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.



Single Loop Results



Trying with a single clock_gettime call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1 gives:



init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...


Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf line was modified to:



printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);





share|improve this answer

























  • When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

    – JasonN
    Nov 26 '18 at 0:03











  • @JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

    – BeeOnRope
    Nov 27 '18 at 17:40











  • You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

    – BeeOnRope
    Nov 27 '18 at 17:41






  • 1





    You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

    – Hadi Brais
    Nov 27 '18 at 20:15












  • @HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

    – BeeOnRope
    Nov 27 '18 at 21:34
















0














I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).



For example:



./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)


Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc-based clock_gettime only if it seems suitable on the current system.



The second most likely reason is that clock_gettime(CLOCK_REALTIME) doesn't go through the VDSO on your system, so is a system call even if rdtsc is ultimately being used. I guess that could be due to an old libc version or something like that.



The third most likely reason is that rdtsc on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.



Single Loop Results



Trying with a single clock_gettime call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1 gives:



init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...


Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf line was modified to:



printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);





share|improve this answer

























  • When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

    – JasonN
    Nov 26 '18 at 0:03











  • @JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

    – BeeOnRope
    Nov 27 '18 at 17:40











  • You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

    – BeeOnRope
    Nov 27 '18 at 17:41






  • 1





    You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

    – Hadi Brais
    Nov 27 '18 at 20:15












  • @HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

    – BeeOnRope
    Nov 27 '18 at 21:34














0












0








0







I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).



For example:



./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)


Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc-based clock_gettime only if it seems suitable on the current system.



The second most likely reason is that clock_gettime(CLOCK_REALTIME) doesn't go through the VDSO on your system, so is a system call even if rdtsc is ultimately being used. I guess that could be due to an old libc version or something like that.



The third most likely reason is that rdtsc on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.



Single Loop Results



Trying with a single clock_gettime call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1 gives:



init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...


Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf line was modified to:



printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);





share|improve this answer















I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).



For example:



./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)


Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc-based clock_gettime only if it seems suitable on the current system.



The second most likely reason is that clock_gettime(CLOCK_REALTIME) doesn't go through the VDSO on your system, so is a system call even if rdtsc is ultimately being used. I guess that could be due to an old libc version or something like that.



The third most likely reason is that rdtsc on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.



Single Loop Results



Trying with a single clock_gettime call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1 gives:



init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...


Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf line was modified to:



printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 27 '18 at 21:33

























answered Nov 25 '18 at 23:08









BeeOnRopeBeeOnRope

25.2k875172




25.2k875172












  • When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

    – JasonN
    Nov 26 '18 at 0:03











  • @JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

    – BeeOnRope
    Nov 27 '18 at 17:40











  • You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

    – BeeOnRope
    Nov 27 '18 at 17:41






  • 1





    You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

    – Hadi Brais
    Nov 27 '18 at 20:15












  • @HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

    – BeeOnRope
    Nov 27 '18 at 21:34


















  • When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

    – JasonN
    Nov 26 '18 at 0:03











  • @JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

    – BeeOnRope
    Nov 27 '18 at 17:40











  • You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

    – BeeOnRope
    Nov 27 '18 at 17:41






  • 1





    You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

    – Hadi Brais
    Nov 27 '18 at 20:15












  • @HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

    – BeeOnRope
    Nov 27 '18 at 21:34

















When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

– JasonN
Nov 26 '18 at 0:03





When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is 4700 + 99*50 which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.

– JasonN
Nov 26 '18 at 0:03













@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

– BeeOnRope
Nov 27 '18 at 17:40





@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.

– BeeOnRope
Nov 27 '18 at 17:40













You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

– BeeOnRope
Nov 27 '18 at 17:41





You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.

– BeeOnRope
Nov 27 '18 at 17:41




1




1





You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

– Hadi Brais
Nov 27 '18 at 20:15






You can reproduce the pattern by passing a value of 1 to the third argument loops. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops is 1. But note that init run still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.

– Hadi Brais
Nov 27 '18 at 20:15














@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

– BeeOnRope
Nov 27 '18 at 21:34






@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes rdtsc overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.

– BeeOnRope
Nov 27 '18 at 21:34


















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53252050%2fwhy-does-the-call-latency-on-clock-gettimeclock-realtime-vary-so-much%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

ReactJS Fetched API data displays live - need Data displayed static

政党