Vectorization on nonconsecutive elements via index set

up vote
4
down vote

favorite

The standard template for vectorization seems to be thus:

#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
 arr[i] = func(i);

where all of the indices are consecutively accessed.

However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:

#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
 arr[indexset[i]] = func(indexset[i]);

In this case, Intel Advisor reports that this loop was not vectorized because loop iteration count cannot be computed before executing the loop. In my application, this loop is executed for different subsets of indices, and for each such subset, number_in_index_set and indexset would change correspondingly.

I have two questions:

(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?

(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:

For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated 

for the whole loop:
 Move the calculation outside the loop using an additional variable.
 Rewrite the loop to avoid goto statements or other early exits from the loop.
 Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A) 
 int i;
 int OuterCount = 90;
 int limit = bar(int(A[0]));
 while (OuterCount > 0) 
 for (i = 1; i < limit; i++) 
 A[i] = i + 4;
 
 OuterCount--;
 

For Example 2, where the compiler cannot determine if there is aliasing between all 

the pointers used inside the loop and loop boundaries: Assign the loop boundary value 

to a local variable. In most cases, this is enough for the compiler to determine 

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also 

use one of the following:
 Directive to ignore vector dependencies
 Target ICL/ICC/ICPC Directive
 Source loop #pragma ivdep
 restrict
 keyword.

Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?

asked Nov 11 at 5:31

Tryer

1,07611120

1

I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32

@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18

The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27

1

No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30

1

Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21

|
show 3 more comments

up vote
4
down vote

favorite

The standard template for vectorization seems to be thus:

#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
 arr[i] = func(i);

where all of the indices are consecutively accessed.

However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:

#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
 arr[indexset[i]] = func(indexset[i]);

I have two questions:

(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?

(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:

For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated 

for the whole loop:
 Move the calculation outside the loop using an additional variable.
 Rewrite the loop to avoid goto statements or other early exits from the loop.
 Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A) 
 int i;
 int OuterCount = 90;
 int limit = bar(int(A[0]));
 while (OuterCount > 0) 
 for (i = 1; i < limit; i++) 
 A[i] = i + 4;
 
 OuterCount--;
 

For Example 2, where the compiler cannot determine if there is aliasing between all 

the pointers used inside the loop and loop boundaries: Assign the loop boundary value 

to a local variable. In most cases, this is enough for the compiler to determine 

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also 

use one of the following:
 Directive to ignore vector dependencies
 Target ICL/ICC/ICPC Directive
 Source loop #pragma ivdep
 restrict
 keyword.

Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?

asked Nov 11 at 5:31

Tryer

1,07611120

1

I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32

@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18

The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27

1

No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30

1

Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21

|
show 3 more comments

up vote
4
down vote

favorite

The standard template for vectorization seems to be thus:

#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
 arr[i] = func(i);

where all of the indices are consecutively accessed.

However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:

#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
 arr[indexset[i]] = func(indexset[i]);

I have two questions:

(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?

(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:

For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated 

for the whole loop:
 Move the calculation outside the loop using an additional variable.
 Rewrite the loop to avoid goto statements or other early exits from the loop.
 Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A) 
 int i;
 int OuterCount = 90;
 int limit = bar(int(A[0]));
 while (OuterCount > 0) 
 for (i = 1; i < limit; i++) 
 A[i] = i + 4;
 
 OuterCount--;
 

For Example 2, where the compiler cannot determine if there is aliasing between all 

the pointers used inside the loop and loop boundaries: Assign the loop boundary value 

to a local variable. In most cases, this is enough for the compiler to determine 

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also 

use one of the following:
 Directive to ignore vector dependencies
 Target ICL/ICC/ICPC Directive
 Source loop #pragma ivdep
 restrict
 keyword.

Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?

asked Nov 11 at 5:31

Tryer

1,07611120

The standard template for vectorization seems to be thus:

#define N 100
double arr[N];
double func(int i);

for(int i = 0; i <N; i++)
 arr[i] = func(i);

where all of the indices are consecutively accessed.

However, I have a situation where not all Nelements of arr need updation. The template I have is as follows:

#define N 100
double arr[N];
double func(int i);

int indexset[N];//this indexset has the indices of arr that get updated
int number_in_index_set;
//E.g., if I only need to update arr[4] and arr[10], number_in_index_set = 2
//indexset[0] = 4 and indexset[1] = 10

for(int i = 0; i <number_in_index_set; i++)
 arr[indexset[i]] = func(indexset[i]);

I have two questions:

(1)What does it mean for the problematic loop to even be vectorized? The array indices are not consecutive, so how would the compiler even go about vectorizing the loop?

(2)Assuming vectorization is possible, as Intel Advisor seems to suggest, how can the loop in question be safely vectorized? The recommendation from Intel Advisor are thus:

For Example 1, where the loop iteration count is not available before the loop 

executes: If the loop iteration count and iterations lower bound can be calculated 

for the whole loop:
 Move the calculation outside the loop using an additional variable.
 Rewrite the loop to avoid goto statements or other early exits from the loop.
 Identify the loop iterations lower bound using a constant.
For example, introduce the new limit variable:
void foo(float *A) 
 int i;
 int OuterCount = 90;
 int limit = bar(int(A[0]));
 while (OuterCount > 0) 
 for (i = 1; i < limit; i++) 
 A[i] = i + 4;
 
 OuterCount--;
 

For Example 2, where the compiler cannot determine if there is aliasing between all 

the pointers used inside the loop and loop boundaries: Assign the loop boundary value 

to a local variable. In most cases, this is enough for the compiler to determine 

aliasing may not occur.
You can use a directive to accomplish the same thing automatically.
Target ICL/ICC/ICPC Directive
Source loop #pragma simd or #pragma omp simd
Do not use global variables or indirect accesses as loop boundaries unless you also 

use one of the following:
 Directive to ignore vector dependencies
 Target ICL/ICC/ICPC Directive
 Source loop #pragma ivdep
 restrict
 keyword.

Specifically, which of the above recommendations are guaranteed to ensure safe vectorization?

c++ compiler-optimization

asked Nov 11 at 5:31

Tryer

1,07611120

asked Nov 11 at 5:31

Tryer

1,07611120

asked Nov 11 at 5:31

Tryer

1,07611120

asked Nov 11 at 5:31

Tryer

1,07611120

asked Nov 11 at 5:31

Tryer

1,07611120

1

I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32

@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18

The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27

1

No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30

1

Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21

|
show 3 more comments

1

I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32

@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18

The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27

1

No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30

1

Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21

I general memory operations cannot be vectorized on non-contiguous elements. However you may be able to vectorize execution of the function foo, then store its scattered results. It depends on what your function foo does. Recently CPUs have introduced vectorizing instructions which allow gather-scatter operations, but these are quite slow. Only worth it if foo can be vectorized and it makes a substantial number of operations. If you want good vectorization you have to do it manually using intrinsic. In my experience compilers' auto-vectorization capabilities are still rudimental.
– Fabio
Nov 11 at 6:32

@Fabio in my case, func is a function that uses square root operator. What would it mean for a function (func in this case) to be vectorized?
– Tryer
Nov 11 at 7:18

The function takes an integer. What does it do with it? Is it arr[i]=sqrt(arr[i])?
– Fabio
Nov 11 at 7:27

No, it is a somewhat more complicated operation. It takes an int i and then performs euclidean distance between point (x_i,y_i) and some other globally available point (X*,Y*)
– Tryer
Nov 11 at 7:30

Yes, that is exactly what I have. The Architecture is localhost: Intel x86-64. Pentium. Is that what you are referring to?
– Tryer
Nov 11 at 9:21

|
show 3 more comments

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Based on the additional comments, assuming the function you want to optimize is the following:

void euclidean(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 for (size_t i = 0; i < index_size; ++i) 
 double dx = x0 - x[index[i]];
 double dy = y0 - y[index[i]];
 result[index[i]] = sqrt(dx*dx + dy * dy);

Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):

void euclidean_sse2(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 __m128d vx0 = _mm_set1_pd(x0);
 __m128d vy0 = _mm_set1_pd(y0);
 for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
 size_t i0 = index[i];
 size_t i1 = index[i+1];
 __m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
 __m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
 __m128d vdx = _mm_sub_pd(vx, vx0);
 __m128d vdy = _mm_sub_pd(vy, vy0);
 __m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
 _mm_store_sd(result + i0, vr);
 _mm_storeh_pd(result + i1, vr);
 
 if (index_size & 1) // if index_size is odd, there is one more element to process
 size_t i0 = index[index_size-1];
 double dx = x0 - x[i0];
 double dy = y0 - y[i0];
 result[i0] = sqrt(dx*dx + dy * dy);

Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.

With gcc the assembler output of the body of the main loop is below.

 // scalar load operations
 mov r10, QWORD PTR [rax]
 mov r9, QWORD PTR [rax+8]
 add rax, 16
 movsd xmm3, QWORD PTR [rdi+r10*8]
 movsd xmm2, QWORD PTR [rsi+r10*8]
 movhpd xmm3, QWORD PTR [rdi+r9*8]
 movhpd xmm2, QWORD PTR [rsi+r9*8]
 // vectorial operations
 subpd xmm3, xmm4
 subpd xmm2, xmm5
 mulpd xmm3, xmm3
 mulpd xmm2, xmm2
 addpd xmm2, xmm3
 sqrtpd xmm2, xmm2
 // scalar store operations
 movlpd QWORD PTR [r8+r10*8], xmm2
 movhpd QWORD PTR [r8+r9*8], xmm2

You could also do some loop unrolling (this is probably what a compiler vectorizer would do). But in this case the loop body is quite substantial, and probably memory I/O bound, so I do not think it would help much.

If the meaning of these functions is not clear, you can read this.

A comprehensive discussion on vectorization is here. In brief, modern cpu offer vectorial registers I addition to the regular ones. Depending on the machine architecture, we have 128 bit registers (SSE instructions), 256 bit registers (AVX instructions), 512 bit registers (AVX512 instructions). Vectorizing means using these register at their full capability. For instance if we have a vector of double x0, x1, x2, …. xn and we want to add it to a constant k, obtaining x0+k, x1+k, x2+k, …. xn+k, in scalar mode the operation is decomposed in read x(i), add k, store result. In vectorial mode, with SSE, it would look like read x[i] and x[i+1] simultaneously, add k simultaneously, store 2 results simultaneously.
If the elements are not contiguous in memory, we cannot load x[i] and x[i+1] simultaneously, nor we can store the results simultaneously, but at least we can perform the two additions simultaneously.

edited Nov 12 at 10:34

answered Nov 11 at 9:34

Fabio

823318

Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16

Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37

Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27

I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37

1

I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53246116%2fvectorization-on-nonconsecutive-elements-via-index-set%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Based on the additional comments, assuming the function you want to optimize is the following:

void euclidean(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 for (size_t i = 0; i < index_size; ++i) 
 double dx = x0 - x[index[i]];
 double dy = y0 - y[index[i]];
 result[index[i]] = sqrt(dx*dx + dy * dy);

Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):

void euclidean_sse2(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 __m128d vx0 = _mm_set1_pd(x0);
 __m128d vy0 = _mm_set1_pd(y0);
 for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
 size_t i0 = index[i];
 size_t i1 = index[i+1];
 __m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
 __m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
 __m128d vdx = _mm_sub_pd(vx, vx0);
 __m128d vdy = _mm_sub_pd(vy, vy0);
 __m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
 _mm_store_sd(result + i0, vr);
 _mm_storeh_pd(result + i1, vr);
 
 if (index_size & 1) // if index_size is odd, there is one more element to process
 size_t i0 = index[index_size-1];
 double dx = x0 - x[i0];
 double dy = y0 - y[i0];
 result[i0] = sqrt(dx*dx + dy * dy);

Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.

With gcc the assembler output of the body of the main loop is below.

 // scalar load operations
 mov r10, QWORD PTR [rax]
 mov r9, QWORD PTR [rax+8]
 add rax, 16
 movsd xmm3, QWORD PTR [rdi+r10*8]
 movsd xmm2, QWORD PTR [rsi+r10*8]
 movhpd xmm3, QWORD PTR [rdi+r9*8]
 movhpd xmm2, QWORD PTR [rsi+r9*8]
 // vectorial operations
 subpd xmm3, xmm4
 subpd xmm2, xmm5
 mulpd xmm3, xmm3
 mulpd xmm2, xmm2
 addpd xmm2, xmm3
 sqrtpd xmm2, xmm2
 // scalar store operations
 movlpd QWORD PTR [r8+r10*8], xmm2
 movhpd QWORD PTR [r8+r9*8], xmm2

If the meaning of these functions is not clear, you can read this.

edited Nov 12 at 10:34

answered Nov 11 at 9:34

Fabio

823318

Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16

Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37

Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27

I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37

1

I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43

|
show 1 more comment

up vote
1
down vote

accepted

Based on the additional comments, assuming the function you want to optimize is the following:

void euclidean(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 for (size_t i = 0; i < index_size; ++i) 
 double dx = x0 - x[index[i]];
 double dy = y0 - y[index[i]];
 result[index[i]] = sqrt(dx*dx + dy * dy);

Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):

void euclidean_sse2(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 __m128d vx0 = _mm_set1_pd(x0);
 __m128d vy0 = _mm_set1_pd(y0);
 for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
 size_t i0 = index[i];
 size_t i1 = index[i+1];
 __m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
 __m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
 __m128d vdx = _mm_sub_pd(vx, vx0);
 __m128d vdy = _mm_sub_pd(vy, vy0);
 __m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
 _mm_store_sd(result + i0, vr);
 _mm_storeh_pd(result + i1, vr);
 
 if (index_size & 1) // if index_size is odd, there is one more element to process
 size_t i0 = index[index_size-1];
 double dx = x0 - x[i0];
 double dy = y0 - y[i0];
 result[i0] = sqrt(dx*dx + dy * dy);

Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.

With gcc the assembler output of the body of the main loop is below.

 // scalar load operations
 mov r10, QWORD PTR [rax]
 mov r9, QWORD PTR [rax+8]
 add rax, 16
 movsd xmm3, QWORD PTR [rdi+r10*8]
 movsd xmm2, QWORD PTR [rsi+r10*8]
 movhpd xmm3, QWORD PTR [rdi+r9*8]
 movhpd xmm2, QWORD PTR [rsi+r9*8]
 // vectorial operations
 subpd xmm3, xmm4
 subpd xmm2, xmm5
 mulpd xmm3, xmm3
 mulpd xmm2, xmm2
 addpd xmm2, xmm3
 sqrtpd xmm2, xmm2
 // scalar store operations
 movlpd QWORD PTR [r8+r10*8], xmm2
 movhpd QWORD PTR [r8+r9*8], xmm2

If the meaning of these functions is not clear, you can read this.

edited Nov 12 at 10:34

answered Nov 11 at 9:34

Fabio

823318

Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16

Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37

Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27

I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37

1

I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43

|
show 1 more comment

up vote
1
down vote

accepted

Based on the additional comments, assuming the function you want to optimize is the following:

void euclidean(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 for (size_t i = 0; i < index_size; ++i) 
 double dx = x0 - x[index[i]];
 double dy = y0 - y[index[i]];
 result[index[i]] = sqrt(dx*dx + dy * dy);

Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):

void euclidean_sse2(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 __m128d vx0 = _mm_set1_pd(x0);
 __m128d vy0 = _mm_set1_pd(y0);
 for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
 size_t i0 = index[i];
 size_t i1 = index[i+1];
 __m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
 __m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
 __m128d vdx = _mm_sub_pd(vx, vx0);
 __m128d vdy = _mm_sub_pd(vy, vy0);
 __m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
 _mm_store_sd(result + i0, vr);
 _mm_storeh_pd(result + i1, vr);
 
 if (index_size & 1) // if index_size is odd, there is one more element to process
 size_t i0 = index[index_size-1];
 double dx = x0 - x[i0];
 double dy = y0 - y[i0];
 result[i0] = sqrt(dx*dx + dy * dy);

Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.

With gcc the assembler output of the body of the main loop is below.

 // scalar load operations
 mov r10, QWORD PTR [rax]
 mov r9, QWORD PTR [rax+8]
 add rax, 16
 movsd xmm3, QWORD PTR [rdi+r10*8]
 movsd xmm2, QWORD PTR [rsi+r10*8]
 movhpd xmm3, QWORD PTR [rdi+r9*8]
 movhpd xmm2, QWORD PTR [rsi+r9*8]
 // vectorial operations
 subpd xmm3, xmm4
 subpd xmm2, xmm5
 mulpd xmm3, xmm3
 mulpd xmm2, xmm2
 addpd xmm2, xmm3
 sqrtpd xmm2, xmm2
 // scalar store operations
 movlpd QWORD PTR [r8+r10*8], xmm2
 movhpd QWORD PTR [r8+r9*8], xmm2

If the meaning of these functions is not clear, you can read this.

edited Nov 12 at 10:34

answered Nov 11 at 9:34

Fabio

823318

Based on the additional comments, assuming the function you want to optimize is the following:

void euclidean(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 for (size_t i = 0; i < index_size; ++i) 
 double dx = x0 - x[index[i]];
 double dy = y0 - y[index[i]];
 result[index[i]] = sqrt(dx*dx + dy * dy);

Since your target is Pentium, only SSE2 SIMD instructions are available. You can try the following optimized function (also available here):

void euclidean_sse2(double x0, double y0, const double *x, const double* y,
 const size_t *index, size_t index_size, double *result)

 __m128d vx0 = _mm_set1_pd(x0);
 __m128d vy0 = _mm_set1_pd(y0);
 for (size_t i = 0; i + 1 < index_size; i += 2) // process 2 at a time
 size_t i0 = index[i];
 size_t i1 = index[i+1];
 __m128d vx = _mm_set_pd(x[i1], x[i0]); // load 2 scattered elements
 __m128d vy = _mm_set_pd(y[i1], y[i0]); // load 2 scattered elements
 __m128d vdx = _mm_sub_pd(vx, vx0);
 __m128d vdy = _mm_sub_pd(vy, vy0);
 __m128d vr = _mm_sqrt_pd(_mm_add_pd(_mm_mul_pd(vdx, vdx), _mm_mul_pd(vdy, vdy)));
 _mm_store_sd(result + i0, vr);
 _mm_storeh_pd(result + i1, vr);
 
 if (index_size & 1) // if index_size is odd, there is one more element to process
 size_t i0 = index[index_size-1];
 double dx = x0 - x[i0];
 double dy = y0 - y[i0];
 result[i0] = sqrt(dx*dx + dy * dy);

Here the load and store operations are not vectorized, i.e. we are loading x and y one by one and we are storing into result one by one. All other operations are vectorized.

With gcc the assembler output of the body of the main loop is below.

 // scalar load operations
 mov r10, QWORD PTR [rax]
 mov r9, QWORD PTR [rax+8]
 add rax, 16
 movsd xmm3, QWORD PTR [rdi+r10*8]
 movsd xmm2, QWORD PTR [rsi+r10*8]
 movhpd xmm3, QWORD PTR [rdi+r9*8]
 movhpd xmm2, QWORD PTR [rsi+r9*8]
 // vectorial operations
 subpd xmm3, xmm4
 subpd xmm2, xmm5
 mulpd xmm3, xmm3
 mulpd xmm2, xmm2
 addpd xmm2, xmm3
 sqrtpd xmm2, xmm2
 // scalar store operations
 movlpd QWORD PTR [r8+r10*8], xmm2
 movhpd QWORD PTR [r8+r9*8], xmm2

If the meaning of these functions is not clear, you can read this.

edited Nov 12 at 10:34

answered Nov 11 at 9:34

Fabio

823318

edited Nov 12 at 10:34

answered Nov 11 at 9:34

Fabio

823318

answered Nov 11 at 9:34

Fabio

823318

answered Nov 11 at 9:34

Fabio

823318

Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16

Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37

Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27

I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37

1

I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43

|
show 1 more comment

Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16

Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37

Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27

I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37

1

I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43

Thank you for the detailed code answer. I will go through the links. Could you briefly explain (or add it into the answer) what is means for vectorization in the context of nonconsecutive indices?
– Tryer
Nov 11 at 11:16

Also, x[i+1] and x[i] are not what the function should access. It should be x[index[i+1]] and x[index[i]] and so on.
– Tryer
Nov 11 at 11:37

Tryer, if you do not post the exact function you want to optimize, I am stabbing in the dark here. I have added some explanation on vectorization.
– Fabio
Nov 11 at 12:27

I am new to intrinsics, but slowly, I am able to figure out what is going on above. Do you have any suggestion to learn it faster/better? For e.g., to generate this intrinsic-based code, did you use or take help from the assembly instructions on the godbolt site? I notice that it uses the xmm0 register, for instance. Or is it practice, writing your own code, debugging, etc.?
– Tryer
Nov 11 at 15:37

I very rarely write assembler directly. Most of the time there is no need, the intrinsic are sufficient. AFAIK there are no tools which help you write this type of code. God bolt is useful, because it shows immediately how your code will be translated in assembler. For simple loops, writing manually it is not needed, the compiler will vectorize automatically. It is an expensive type of optimization, takes time to write. The Intel manual is the best way to learn. Or search the web for "simd intrinsic tutorials". For instance software.intel.com/en-us/articles/how-to-use-intrinsics
– Fabio
Nov 11 at 15:43

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth