Vectorization on nonconsecutive elements via index set
up vote
4
down vote
favorite
The standard template for vectorization seems to be thus:
#define N 100
double arr[N];
double func(int i);
for(int i = 0; i <N; i++)
arr[i] = func(i);
where all of the indices are consecutively accessed.
However, I have a situation where not all N
elements of arr
need updation. The template I have is as follows:
#define N 100
double arr[N];
double func(int i);
int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10
for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);
In this case, Intel Advisor
reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set
and indexset
would change correspondingly.
I have two questions:
(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?
(2)Assuming vectorization is possible, as Intel Advisor
seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor
are thus:
For Example 1, where the loop iteration count is not available before the loop
executes: If the loop iteration count and iterations lower bound can be calculated
for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;
OuterCount--;
For Example 2, where the compiler cannot determine if there is aliasing between all
the pointers used inside the loop and loop boundaries: Assign the loop boundary value
to a local variable. In most cases, this is enough for the compiler to determine
aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also
use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.
Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?
c++ compiler-optimization
|
show 3 more comments
up vote
4
down vote
favorite
The standard template for vectorization seems to be thus:
#define N 100
double arr[N];
double func(int i);
for(int i = 0; i <N; i++)
arr[i] = func(i);
where all of the indices are consecutively accessed.
However, I have a situation where not all N
elements of arr
need updation. The template I have is as follows:
#define N 100
double arr[N];
double func(int i);
int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10
for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);
In this case, Intel Advisor
reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set
and indexset
would change correspondingly.
I have two questions:
(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?
(2)Assuming vectorization is possible, as Intel Advisor
seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor
are thus:
For Example 1, where the loop iteration count is not available before the loop
executes: If the loop iteration count and iterations lower bound can be calculated
for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;
OuterCount--;
For Example 2, where the compiler cannot determine if there is aliasing between all
the pointers used inside the loop and loop boundaries: Assign the loop boundary value
to a local variable. In most cases, this is enough for the compiler to determine
aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also
use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.
Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?
c++ compiler-optimization
1
I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it iffoo
can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually usingintrinsic
. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32
@Fabio in my case,func
is a function that uses square root operator. What would it mean for a function (func
in this case) to be vectorized?
– Tryer
Nov 11 at 7:18
The function takes an integer. What does it do with it? Is itarr[i]=sqrt(arr[i])
?
– Fabio
Nov 11 at 7:27
1
No, it is a somewhat more complicated operation. It takes anint i
and then performs euclidean distance between point(x_i,y_i)
and some other globally available point(X*,Y*)
– Tryer
Nov 11 at 7:30
1
Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21
|
show 3 more comments
up vote
4
down vote
favorite
up vote
4
down vote
favorite
The standard template for vectorization seems to be thus:
#define N 100
double arr[N];
double func(int i);
for(int i = 0; i <N; i++)
arr[i] = func(i);
where all of the indices are consecutively accessed.
However, I have a situation where not all N
elements of arr
need updation. The template I have is as follows:
#define N 100
double arr[N];
double func(int i);
int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10
for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);
In this case, Intel Advisor
reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set
and indexset
would change correspondingly.
I have two questions:
(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?
(2)Assuming vectorization is possible, as Intel Advisor
seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor
are thus:
For Example 1, where the loop iteration count is not available before the loop
executes: If the loop iteration count and iterations lower bound can be calculated
for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;
OuterCount--;
For Example 2, where the compiler cannot determine if there is aliasing between all
the pointers used inside the loop and loop boundaries: Assign the loop boundary value
to a local variable. In most cases, this is enough for the compiler to determine
aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also
use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.
Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?
c++ compiler-optimization
The standard template for vectorization seems to be thus:
#define N 100
double arr[N];
double func(int i);
for(int i = 0; i <N; i++)
arr[i] = func(i);
where all of the indices are consecutively accessed.
However, I have a situation where not all N
elements of arr
need updation. The template I have is as follows:
#define N 100
double arr[N];
double func(int i);
int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10
for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);
In this case, Intel Advisor
reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set
and indexset
would change correspondingly.
I have two questions:
(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?
(2)Assuming vectorization is possible, as Intel Advisor
seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor
are thus:
For Example 1, where the loop iteration count is not available before the loop
executes: If the loop iteration count and iterations lower bound can be calculated
for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;
OuterCount--;
For Example 2, where the compiler cannot determine if there is aliasing between all
the pointers used inside the loop and loop boundaries: Assign the loop boundary value
to a local variable. In most cases, this is enough for the compiler to determine
aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also
use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.
Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?
c++ compiler-optimization
c++ compiler-optimization
asked Nov 11 at 5:31
Tryer
1,07611120
1,07611120
1
I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it iffoo
can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually usingintrinsic
. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32
@Fabio in my case,func
is a function that uses square root operator. What would it mean for a function (func
in this case) to be vectorized?
– Tryer
Nov 11 at 7:18
The function takes an integer. What does it do with it? Is itarr[i]=sqrt(arr[i])
?
– Fabio
Nov 11 at 7:27
1
No, it is a somewhat more complicated operation. It takes anint i
and then performs euclidean distance between point(x_i,y_i)
and some other globally available point(X*,Y*)
– Tryer
Nov 11 at 7:30
1
Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21
|
show 3 more comments
1
I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it iffoo
can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually usingintrinsic
. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32
@Fabio in my case,func
is a function that uses square root operator. What would it mean for a function (func
in this case) to be vectorized?
– Tryer
Nov 11 at 7:18
The function takes an integer. What does it do with it? Is itarr[i]=sqrt(arr[i])
?
– Fabio
Nov 11 at 7:27
1
No, it is a somewhat more complicated operation. It takes anint i
and then performs euclidean distance between point(x_i,y_i)
and some other globally available point(X*,Y*)
– Tryer
Nov 11 at 7:30
1
Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21
1
1
I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if
foo
can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic
. In my experience compilers' auto-vectorization capabilities are still rudimental.– Fabio
Nov 11 at 6:32
I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if
foo
can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic
. In my experience compilers' auto-vectorization capabilities are still rudimental.– Fabio
Nov 11 at 6:32
@Fabio in my case,
func
is a function that uses square root operator. What would it mean for a function (func
in this case) to be vectorized?– Tryer
Nov 11 at 7:18
@Fabio in my case,
func
is a function that uses square root operator. What would it mean for a function (func
in this case) to be vectorized?– Tryer
Nov 11 at 7:18
The function takes an integer. What does it do with it? Is it
arr[i]=sqrt(arr[i])
?– Fabio
Nov 11 at 7:27
The function takes an integer. What does it do with it? Is it
arr[i]=sqrt(arr[i])
?– Fabio
Nov 11 at 7:27
1
1
No, it is a somewhat more complicated operation. It takes an
int i
and then performs euclidean distance between point (x_i,y_i)
and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30
No, it is a somewhat more complicated operation. It takes an
int i
and then performs euclidean distance between point (x_i,y_i)
and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30
1
1
Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21
Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21
|
show 3 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Based on the additional comments, assuming the function you want to optimize is the following:
void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);
Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):
void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);
if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);
Here the load and store operations are not vectorized, i.e. we are loading x
and y
one by one and we are storing into result
one by one. All other operations are vectorized.
With gcc the assembler output of the body of the main loop is below.
// scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2
You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.
If the meaning of these functions is not clear, you can read this.
A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double
x0, x1, x2, …. xn
and we want to add it to a constant k
, obtaining x0+k, x1+k, x2+k, …. xn+k
, in scalar mode the operation is decomposed in read x(i)
, add k
, store result. In vectorial mode, with SSE, it would look like read x[i]
and x[i+1]
simultaneously, add k
simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Also,x[i+1]
andx[i]
are not what the function should access. It should bex[index[i+1]]
andx[index[i]]
and so on.
– Tryer
Nov 11 at 11:37
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
1
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
|
show 1 more comment
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Based on the additional comments, assuming the function you want to optimize is the following:
void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);
Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):
void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);
if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);
Here the load and store operations are not vectorized, i.e. we are loading x
and y
one by one and we are storing into result
one by one. All other operations are vectorized.
With gcc the assembler output of the body of the main loop is below.
// scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2
You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.
If the meaning of these functions is not clear, you can read this.
A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double
x0, x1, x2, …. xn
and we want to add it to a constant k
, obtaining x0+k, x1+k, x2+k, …. xn+k
, in scalar mode the operation is decomposed in read x(i)
, add k
, store result. In vectorial mode, with SSE, it would look like read x[i]
and x[i+1]
simultaneously, add k
simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Also,x[i+1]
andx[i]
are not what the function should access. It should bex[index[i+1]]
andx[index[i]]
and so on.
– Tryer
Nov 11 at 11:37
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
1
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
|
show 1 more comment
up vote
1
down vote
accepted
Based on the additional comments, assuming the function you want to optimize is the following:
void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);
Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):
void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);
if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);
Here the load and store operations are not vectorized, i.e. we are loading x
and y
one by one and we are storing into result
one by one. All other operations are vectorized.
With gcc the assembler output of the body of the main loop is below.
// scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2
You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.
If the meaning of these functions is not clear, you can read this.
A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double
x0, x1, x2, …. xn
and we want to add it to a constant k
, obtaining x0+k, x1+k, x2+k, …. xn+k
, in scalar mode the operation is decomposed in read x(i)
, add k
, store result. In vectorial mode, with SSE, it would look like read x[i]
and x[i+1]
simultaneously, add k
simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Also,x[i+1]
andx[i]
are not what the function should access. It should bex[index[i+1]]
andx[index[i]]
and so on.
– Tryer
Nov 11 at 11:37
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
1
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
|
show 1 more comment
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Based on the additional comments, assuming the function you want to optimize is the following:
void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);
Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):
void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);
if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);
Here the load and store operations are not vectorized, i.e. we are loading x
and y
one by one and we are storing into result
one by one. All other operations are vectorized.
With gcc the assembler output of the body of the main loop is below.
// scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2
You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.
If the meaning of these functions is not clear, you can read this.
A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double
x0, x1, x2, …. xn
and we want to add it to a constant k
, obtaining x0+k, x1+k, x2+k, …. xn+k
, in scalar mode the operation is decomposed in read x(i)
, add k
, store result. In vectorial mode, with SSE, it would look like read x[i]
and x[i+1]
simultaneously, add k
simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.
Based on the additional comments, assuming the function you want to optimize is the following:
void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);
Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):
void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)
__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);
if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);
Here the load and store operations are not vectorized, i.e. we are loading x
and y
one by one and we are storing into result
one by one. All other operations are vectorized.
With gcc the assembler output of the body of the main loop is below.
// scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2
You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.
If the meaning of these functions is not clear, you can read this.
A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double
x0, x1, x2, …. xn
and we want to add it to a constant k
, obtaining x0+k, x1+k, x2+k, …. xn+k
, in scalar mode the operation is decomposed in read x(i)
, add k
, store result. In vectorial mode, with SSE, it would look like read x[i]
and x[i+1]
simultaneously, add k
simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.
edited Nov 12 at 10:34
answered Nov 11 at 9:34
Fabio
823318
823318
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Also,x[i+1]
andx[i]
are not what the function should access. It should bex[index[i+1]]
andx[index[i]]
and so on.
– Tryer
Nov 11 at 11:37
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
1
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
|
show 1 more comment
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Also,x[i+1]
andx[i]
are not what the function should access. It should bex[index[i+1]]
andx[index[i]]
and so on.
– Tryer
Nov 11 at 11:37
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
1
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16
Also,
x[i+1]
and x[i]
are not what the function should access. It should be x[index[i+1]]
and x[index[i]]
and so on.– Tryer
Nov 11 at 11:37
Also,
x[i+1]
and x[i]
are not what the function should access. It should be x[index[i+1]]
and x[index[i]]
and so on.– Tryer
Nov 11 at 11:37
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37
1
1
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53246116%2fvectorization-on-nonconsecutive-elements-via-index-set%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if
foo
can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually usingintrinsic
. In my experience compilers' auto-vectorization capabilities are still rudimental.– Fabio
Nov 11 at 6:32
@Fabio in my case,
func
is a function that uses square root operator. What would it mean for a function (func
in this case) to be vectorized?– Tryer
Nov 11 at 7:18
The function takes an integer. What does it do with it? Is it
arr[i]=sqrt(arr[i])
?– Fabio
Nov 11 at 7:27
1
No, it is a somewhat more complicated operation. It takes an
int i
and then performs euclidean distance between point(x_i,y_i)
and some other globally available point(X*,Y*)
– Tryer
Nov 11 at 7:30
1
Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21