Vectorization on nonconsecutive elements via index set









up vote
4
down vote

favorite












The standard template for vectorization seems to be thus:



#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
arr[i] = func(i);


where all of the indices are consecutively accessed.



However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:



#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);


In this case, Intel Advisor reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set and indexset would change correspondingly.



I have two questions:



(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?



(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:




For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated

for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;

OuterCount--;


For Example 2, where the compiler cannot determine if there is aliasing between all

the pointers used inside the loop and loop boundaries: Assign the loop boundary value

to a local variable. In most cases, this is enough for the compiler to determine

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also

use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.



Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?










share|improve this question

















  • 1




    I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
    – Fabio
    Nov 11 at 6:32











  • @Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
    – Tryer
    Nov 11 at 7:18










  • The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
    – Fabio
    Nov 11 at 7:27







  • 1




    No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
    – Tryer
    Nov 11 at 7:30







  • 1




    Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
    – Tryer
    Nov 11 at 9:21














up vote
4
down vote

favorite












The standard template for vectorization seems to be thus:



#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
arr[i] = func(i);


where all of the indices are consecutively accessed.



However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:



#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);


In this case, Intel Advisor reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set and indexset would change correspondingly.



I have two questions:



(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?



(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:




For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated

for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;

OuterCount--;


For Example 2, where the compiler cannot determine if there is aliasing between all

the pointers used inside the loop and loop boundaries: Assign the loop boundary value

to a local variable. In most cases, this is enough for the compiler to determine

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also

use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.



Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?










share|improve this question

















  • 1




    I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
    – Fabio
    Nov 11 at 6:32











  • @Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
    – Tryer
    Nov 11 at 7:18










  • The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
    – Fabio
    Nov 11 at 7:27







  • 1




    No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
    – Tryer
    Nov 11 at 7:30







  • 1




    Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
    – Tryer
    Nov 11 at 9:21












up vote
4
down vote

favorite









up vote
4
down vote

favorite











The standard template for vectorization seems to be thus:



#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
arr[i] = func(i);


where all of the indices are consecutively accessed.



However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:



#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);


In this case, Intel Advisor reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set and indexset would change correspondingly.



I have two questions:



(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?



(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:




For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated

for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;

OuterCount--;


For Example 2, where the compiler cannot determine if there is aliasing between all

the pointers used inside the loop and loop boundaries: Assign the loop boundary value

to a local variable. In most cases, this is enough for the compiler to determine

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also

use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.



Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?










share|improve this question













The standard template for vectorization seems to be thus:



#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
arr[i] = func(i);


where all of the indices are consecutively accessed.



However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:



#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
arr[indexset[i]] = func(indexset[i]);


In this case, Intel Advisor reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set and indexset would change correspondingly.



I have two questions:



(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?



(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:




For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated

for the whole loop:
Move the calculation outside the loop using an additional variable.
Rewrite the loop to avoid goto statements or other early exits from the loop.
Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A)
int i;
int OuterCount = 90;
int limit = bar(int(A[0]));
while (OuterCount > 0)
for (i = 1; i < limit; i++)
A[i] = i + 4;

OuterCount--;


For Example 2, where the compiler cannot determine if there is aliasing between all

the pointers used inside the loop and loop boundaries: Assign the loop boundary value

to a local variable. In most cases, this is enough for the compiler to determine

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also

use one of the following:
Directive to ignore vector dependencies
Target ICL/ICC/ICPC Directive
Source loop #pragma ivdep
restrict
keyword.



Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?







c++ compiler-optimization






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 11 at 5:31









Tryer

1,07611120




1,07611120







  • 1




    I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
    – Fabio
    Nov 11 at 6:32











  • @Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
    – Tryer
    Nov 11 at 7:18










  • The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
    – Fabio
    Nov 11 at 7:27







  • 1




    No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
    – Tryer
    Nov 11 at 7:30







  • 1




    Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
    – Tryer
    Nov 11 at 9:21












  • 1




    I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
    – Fabio
    Nov 11 at 6:32











  • @Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
    – Tryer
    Nov 11 at 7:18










  • The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
    – Fabio
    Nov 11 at 7:27







  • 1




    No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
    – Tryer
    Nov 11 at 7:30







  • 1




    Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
    – Tryer
    Nov 11 at 9:21







1




1




I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32





I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32













@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18




@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18












The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27





The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27





1




1




No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30





No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30





1




1




Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21




Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Based on the additional comments, assuming the function you want to optimize is the following:



void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);




Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):



void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);

if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);




Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.



With gcc the assembler output of the body of the main loop is below.



 // scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2


You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.



If the meaning of these functions is not clear, you can read this.



A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double x0, x1, x2, …. xn and we want to add it to a constant k, obtaining x0+k, x1+k, x2+k, …. xn+k, in scalar mode the operation is decomposed in read x(i), add k, store result. In vectorial mode, with SSE, it would look like read x[i] and x[i+1] simultaneously, add k simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.






share|improve this answer






















  • Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
    – Tryer
    Nov 11 at 11:16










  • Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
    – Tryer
    Nov 11 at 11:37










  • Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
    – Fabio
    Nov 11 at 12:27










  • I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
    – Tryer
    Nov 11 at 15:37






  • 1




    I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
    – Fabio
    Nov 11 at 15:43











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53246116%2fvectorization-on-nonconsecutive-elements-via-index-set%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Based on the additional comments, assuming the function you want to optimize is the following:



void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);




Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):



void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);

if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);




Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.



With gcc the assembler output of the body of the main loop is below.



 // scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2


You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.



If the meaning of these functions is not clear, you can read this.



A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double x0, x1, x2, …. xn and we want to add it to a constant k, obtaining x0+k, x1+k, x2+k, …. xn+k, in scalar mode the operation is decomposed in read x(i), add k, store result. In vectorial mode, with SSE, it would look like read x[i] and x[i+1] simultaneously, add k simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.






share|improve this answer






















  • Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
    – Tryer
    Nov 11 at 11:16










  • Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
    – Tryer
    Nov 11 at 11:37










  • Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
    – Fabio
    Nov 11 at 12:27










  • I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
    – Tryer
    Nov 11 at 15:37






  • 1




    I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
    – Fabio
    Nov 11 at 15:43















up vote
1
down vote



accepted










Based on the additional comments, assuming the function you want to optimize is the following:



void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);




Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):



void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);

if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);




Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.



With gcc the assembler output of the body of the main loop is below.



 // scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2


You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.



If the meaning of these functions is not clear, you can read this.



A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double x0, x1, x2, …. xn and we want to add it to a constant k, obtaining x0+k, x1+k, x2+k, …. xn+k, in scalar mode the operation is decomposed in read x(i), add k, store result. In vectorial mode, with SSE, it would look like read x[i] and x[i+1] simultaneously, add k simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.






share|improve this answer






















  • Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
    – Tryer
    Nov 11 at 11:16










  • Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
    – Tryer
    Nov 11 at 11:37










  • Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
    – Fabio
    Nov 11 at 12:27










  • I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
    – Tryer
    Nov 11 at 15:37






  • 1




    I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
    – Fabio
    Nov 11 at 15:43













up vote
1
down vote



accepted







up vote
1
down vote



accepted






Based on the additional comments, assuming the function you want to optimize is the following:



void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);




Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):



void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);

if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);




Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.



With gcc the assembler output of the body of the main loop is below.



 // scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2


You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.



If the meaning of these functions is not clear, you can read this.



A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double x0, x1, x2, …. xn and we want to add it to a constant k, obtaining x0+k, x1+k, x2+k, …. xn+k, in scalar mode the operation is decomposed in read x(i), add k, store result. In vectorial mode, with SSE, it would look like read x[i] and x[i+1] simultaneously, add k simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.






share|improve this answer














Based on the additional comments, assuming the function you want to optimize is the following:



void euclidean(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

for (size_t i = 0; i < index_size; ++i)
double dx = x0 - x[index[i]];
double dy = y0 - y[index[i]];
result[index[i]] = sqrt(dx*dx + dy * dy);




Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):



void euclidean_sse2(double x0, double y0, const double *x, const double* y,
const size_t *index, size_t index_size, double *result)

__m128d vx0 = _mm_set1_pd(x0);
__m128d vy0 = _mm_set1_pd(y0);
for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
size_t i0 = index[i];
size_t i1 = index[i+1];
__m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
__m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
__m128d vdx = _mm_sub_pd(vx, vx0);
__m128d vdy = _mm_sub_pd(vy, vy0);
__m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
_mm_store_sd(result + i0, vr);
_mm_storeh_pd(result + i1, vr);

if (index_size & 1) // if index_size is odd, there is one more element to process
size_t i0 = index[index_size-1];
double dx = x0 - x[i0];
double dy = y0 - y[i0];
result[i0] = sqrt(dx*dx + dy * dy);




Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.



With gcc the assembler output of the body of the main loop is below.



 // scalar load operations
mov r10, QWORD PTR [rax]
mov r9, QWORD PTR [rax+8]
add rax, 16
movsd xmm3, QWORD PTR [rdi+r10*8]
movsd xmm2, QWORD PTR [rsi+r10*8]
movhpd xmm3, QWORD PTR [rdi+r9*8]
movhpd xmm2, QWORD PTR [rsi+r9*8]
// vectorial operations
subpd xmm3, xmm4
subpd xmm2, xmm5
mulpd xmm3, xmm3
mulpd xmm2, xmm2
addpd xmm2, xmm3
sqrtpd xmm2, xmm2
// scalar store operations
movlpd QWORD PTR [r8+r10*8], xmm2
movhpd QWORD PTR [r8+r9*8], xmm2


You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.



If the meaning of these functions is not clear, you can read this.



A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double x0, x1, x2, …. xn and we want to add it to a constant k, obtaining x0+k, x1+k, x2+k, …. xn+k, in scalar mode the operation is decomposed in read x(i), add k, store result. In vectorial mode, with SSE, it would look like read x[i] and x[i+1] simultaneously, add k simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 at 10:34

























answered Nov 11 at 9:34









Fabio

823318




823318











  • Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
    – Tryer
    Nov 11 at 11:16










  • Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
    – Tryer
    Nov 11 at 11:37










  • Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
    – Fabio
    Nov 11 at 12:27










  • I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
    – Tryer
    Nov 11 at 15:37






  • 1




    I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
    – Fabio
    Nov 11 at 15:43

















  • Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
    – Tryer
    Nov 11 at 11:16










  • Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
    – Tryer
    Nov 11 at 11:37










  • Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
    – Fabio
    Nov 11 at 12:27










  • I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
    – Tryer
    Nov 11 at 15:37






  • 1




    I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
    – Fabio
    Nov 11 at 15:43
















Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16




Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16












Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37




Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37












Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27




Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27












I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37




I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37




1




1




I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43





I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43


















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53246116%2fvectorization-on-nonconsecutive-elements-via-index-set%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

ReactJS Fetched API data displays live - need Data displayed static

政党