/
Accelerating Multimedia Applications using the Intel Accelerating Multimedia Applications using the Intel

Accelerating Multimedia Applications using the Intel - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
344 views
Uploaded On 2019-12-09

Accelerating Multimedia Applications using the Intel - PPT Presentation

Accelerating Multimedia Applications using the Intel SSE and AVX ISA Min Li 05082013 Intel SSE and AVX ISA Intel ISA SSE1 SSE2 SSE3 SSE4 SSE41 SSE42 SSE42 Specialized for String and Text applications suitable for applications like template ID: 769762

mm256 int temp3 isa int mm256 isa temp3 amp avx sse intel load maxval opencv points minval library float

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Accelerating Multimedia Applications usi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Accelerating Multimedia Applications using the Intel SSE and AVX ISA Min Li 05/08/2013

Intel SSE and AVX ISA Intel ISA SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2) SSE4.2 Specialized for String and Text applications (suitable for applications like template matching, Genome Sequence Comparison) AVX (mainly for floating point operations) AVX1: 256bits AVX2: 256bits (with some instructions extension) XMM register and YMM register XMM: 128bits YMM: 256bits

Intel OpenCV Library Opencv Library Various of multimedia applications Object detection, face recognition, image processing…Good candidates for using Intel SSE or AVX ISA for speedupIntensive computationsI made a video on Youtube to show some tricks in using Opencv library https://www.youtube.com/watch?v=ISap9zEGE2I https://www.youtube.com/watch?v=pqSgT0quMBc

guidelines for enabling the ISA Intel SSE and AVX c at / proc /cpuinfo Make sure SSE and AVX are enabled. Otherwise enable them.As you can see All SSE ISA are activatedHowever only AVX1 is activated, which means I can only use 128bits XMM registers Note: AVX2 is released in the mid of 2012

Intel OpenCV Library Opencv Library Various of multimedia applications Object detection, face recognition, image processing…

Acceleration Case I Original: for( int i = 0; i < length; i += 4 ){ double t0 = d1[ i ] - d2[i]; double t1 = d1[i+1] - d2[i+1]; double t2 = d1[i+2] - d2[i+2]; double t3 = d1[i+3] - d2[i+3]; total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3; } After modification: int chunk = length / 4 ; for( i = 0; i < chunk; i ++){ __ m128 m0, m1; m0 = _ mm_load_ps (&d1[4 * i ]); m1 = _ mm_load_ps (&d2[4 * i ]); m1 = _ mm_sub_ps (m0, m1); m1 = _ mm_mul_ps (m1, m1); m1 = _ mm_hadd_ps (m1, m1); m2 = _ mm_shuffle_ps (m1, m1 , _MM_SHUFFLE(2,3,0,1 )); m1 = _ mm_add_ps (m1, m2 ); total_cost += ((float*)&m1)[0]; if( total_cost > best ) break; }

Acceleration Case II Original: float minval = FLT_MAX, maxval = -FLT_MAX; for( i = 0; i < N; i ++, ++it ) { float v = *(const float*) it.ptr; if( v < minval ) { minval = v; minidx = it.node ()-> idx ; } if( v > maxval ) { maxval = v; maxidx = it.node()->idx; } } if( _minval ) *_minval = minval; if( _maxval ) *_maxval = maxval; After modification : __ mm128 m0, m1, m2, m3, m4, minArray , maxArray ; int chunk = N / 4; for( i = 1; i < chunk; i ++){ m0 = __ mm_load_ps ( ( const float*) it.ptr ); it += 4 ; m1 = _ mm_min_ps (m0, minArray ); m2 = _ mm_max_ps (m0, maxArray ); m3 = _ mm_cmp_ps (m0, minArray , _CMP_LT_OS); m4 = _ mm_cmp_ps (m0, maxArray , _CMP_GT_OS ); int * mask1 = ( int *) &m3; int * mask2 = ( int *) &m4; for( int j = 0; j < 4; j++){ if(mask1[j ] == -1) minPos [j ] = 4 * i + j ; if(mask2[j ] == -1) maxPos [j ] = 4 * i + j; } minArray = m3; maxArray = m4; }

Load of Structures Structues like this : typedef point_{ int x; int y ; } point; _mm_load_ only takes consecutive mem space! What is it like insider the XMM register? How to achieve the following using SSE && AVX ISA? point* points; points[0 ].x points[0].y points[1]. x points[1].y . . . X 0 Y 0 X 1 Y 1 X 2 Y 2 X 3 Y 3 X 0 X 1 X 2 X 3 Y 0 Y 1 Y 2 Y 3 Not easy!!!

permute and blend __ m256i temp = _mm256_load_si256((__m256i*) &points[4 * i ]); __m256 temp2 = _mm256_cvtepi32_ps(temp);v4si mask1 = {9,8,8,9};__m256 temp3 = _mm256_permutevar_ps(temp2, mask1);__m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01);temp3 = _mm256_blend_ps(temp3, temp4, 0b00110011); v4si mask2 = {0xd,4,4,0xd};temp3 = _mm256_permutevar_ps(temp2, mask2 ); __m128 m1 = _mm256_extractf128_ps(temp3, 1); __ m128 m2 = _mm256_extractf128_ps(temp3, 0); X 0 Y 0 X 1 Y 1 X 2 Y 2 X 3 Y 3 X 0 X 1 X 2 X 3 X 0 X 1 Y 0 Y 1 Y 2 Y 2 X 2 X 3 Y 2 Y 3 X 2 X 3 X 0 X 1 Y 0 Y 1 X 0 X 1 X 2 X 3 Y 2 Y 3 Y 0 Y 1 X 0 X 1 X 2 X 3 Y 0 Y 1 Y 2 Y 3 Y 0 Y 1 Y 2 Y 3

Simulation Results Not only finding min/max, but also the position Too many overhead for loading structures

Conclusion and future work Opencv suitable for SSE or AVX acceleration Single task has more chance to get speedup Loading and arranging a structure is really a cumbersome taskHints for smart automated compilation (such as loading structure)Suggestions for the expansion of the ISA (new instruction introduced)