

















| Instr.               | Operands          | Operation                   | Comment           |
|----------------------|-------------------|-----------------------------|-------------------|
| ADD <u>V</u>         | V1,V2,V3          | V1=V2+V3                    | vector + vector   |
| ADD <mark>S</mark> V | V1, <u>F0</u> ,V2 | V1= <mark>F0</mark> +V2     | scalar + vector   |
| MULTV                | V1,V2,V3          | V1=V2xV3                    | vector x vector   |
| MULSV                | V1,F0,V2          | V1=F0xV2                    | scalar x vector   |
| LV                   | V1,R1             | V1=M[R1R1+63]               | load, stride=1    |
| LV <u>WS</u>         | V1,R1,R2          | V1=M[R1R1+ <u>63*R2</u> ]   | load, stride=R2   |
| LV                   | V1,R1,V2          | V1=M[R1 <u>+V2i</u> ,i=063] | load, indexed     |
| CeqV                 | VM,V1,V2          | VMASKi = (V1i=V2i)?         | comp. setmask     |
| MOV                  | <u>VLR</u> ,R1    | Vec. Len. Reg. = R1         | set vector length |
| MOV                  | <u>VM</u> ,R1     | Vec. Mask = R1              | set vector mask   |



|                      |                  | 1               |
|----------------------|------------------|-----------------|
| ‡ C code             | # Scalar Code    | # Vector Code   |
| Eor (i=0; i<64; i++) | LI R4, 64        | LI VLR, 64      |
| C[i] = A[i] + B[i]   | ; loop:          | LV V1, R1       |
|                      | L.D F0, 0(R1)    | LV V2, R2       |
|                      | L.D F2, 0(R2)    | ADDV V3, V1, V2 |
|                      | ADD.D F4, F2, F0 | SV V3, R3       |
|                      | S.D F4, 0(R3)    |                 |
|                      | ADDIU R1, 8      |                 |
|                      | ADDIU R2, 8      |                 |
|                      | ADDIU R3, 8      |                 |
|                      | SUBIU R4, 1      |                 |
|                      | BNEZ R4, loop    |                 |





| Machine    | Year          | Clock   | Regs  | <b>Elements</b> | FUs | LSUs   |
|------------|---------------|---------|-------|-----------------|-----|--------|
| Cray 1     | 1976          | 80 MHz  | 8     | 64              | 6   | 1      |
| Cray XMP   | <b>1983</b> 1 | 20 MHz  | 8     | 64              | 8   | 2L, 1S |
| Cray YMP   | <b>1988</b> 1 | I66 MHz | 8     | 64              | 8   | 2L, 1S |
| Cray C-90  | 1991 2        | 240 MHz | 8     | 128             | 8   | 4      |
| Cray T-90  | 1996 4        | 455 MHz | 8     | 128             | 8   | 4      |
| Conv. C-1  | 1984          | 10 MHz  | 8     | 128             | 4   | 1      |
| Conv. C-4  | <b>1994</b> 1 | 133 MHz | 16    | 128             | 3   | 1      |
| Fuj. VP200 | <b>1982</b> 1 | 133 MHz | 8-256 | 32-1024         | 3   | 2      |
| Fuj. VP300 | <b>1996</b> 1 | 100 MHz | 8-256 | 32-1024         | 3   | 2      |
| NEC SX/2   | <b>1984</b> 1 | 60 MHz  | 8+8K  | 256+var         | 16  | 8      |
| NEC SX/3   | 1995 4        | 400 MHz | 8+8K  | 256+var         | 16  | 8      |





























| , F0 ; vector-Scalar multiply<br>vector Y<br>, V3 ; Add vectors<br>result in vector Y      |
|--------------------------------------------------------------------------------------------|
| vector Y<br>, V3 ; Add vectors<br>result in vector Y                                       |
| , V3 ;Add vectors<br>result in vector Y                                                    |
| result in vector Y                                                                         |
|                                                                                            |
| Chimes<br>Suppose VL=64<br>For 1 Lane: Chime = 64 cycles<br>For 2 Lanes: Chime = 32 cycles |
| For 4 Lanes: Chime = 16 cycles                                                             |
|                                                                                            |



| * Consider s            | ame examp         | le with 4 convo             | ys          |
|-------------------------|-------------------|-----------------------------|-------------|
| Vector length           | gth = n           |                             |             |
| ✤ Assume Ce             | onvoys don'       | t overlays                  |             |
| ✤ Show the t            | ime of each       | convoy assum                | ing 1 lane  |
| Convoy                  | Start time        | First result                | Last result |
| 1. LV                   | 0                 | 12                          | 11 + n      |
|                         |                   | 10 . n . 10                 | 23 + 2n     |
| 2. MULVS, LV            | 12 + n            | 12 + 11 + 12                |             |
| 2. MULVS, LV<br>3. ADDV | 12 + n<br>24 + 2n | 12 + 11 + 12<br>24 + 2n + 6 | 29 + 3n     |



| Adjacent el               | ements are not sequential in memory                       |
|---------------------------|-----------------------------------------------------------|
| do 10 i = 1, <sup>•</sup> | 100                                                       |
| do 10 j                   | = 1,100                                                   |
| Ā                         | (i,j) = 0.0                                               |
|                           | do 10 k = 1,100                                           |
| 10                        | A(i,j) = A(i,j) + B(i, <u>k</u> ) * C( <u>k</u> ,j)       |
| Either B or               | C accesses are not adjacent                               |
| ≻800 byte                 | s between adjacent vector elements                        |
| Stride: dist<br>be merged | ance separating elements that are to into a single vector |
| ≻Caches                   | do unit stride                                            |
| ≻LVWS (lo                 | ad vector with stride) instruction                        |
| Think of ad               | dresses per vector element                                |



| Want to vectorize                            | loops with indirect accesses         |
|----------------------------------------------|--------------------------------------|
| for (i=0; i <n;< th=""><th>; i++)</th></n;<> | ; i++)                               |
| A[i] = B[i                                   | ] + C[D[i]]                          |
| Indexed load inst                            | ruction (Gather)                     |
| LV vD, rD                                    | <pre># Load D vector (indices)</pre> |
| LVI vC, rC, vI                               | ) # Load C vector indexed            |
| LV vB, rB                                    | # Load B vector                      |
| ADDV vA, vB, v                               | 7C # Add Vectors                     |
| SV vA, rA                                    | # Store A vector                     |
| CSE 661 Parallal and Vactor Architectures    | Verter Commuters alida 26            |







| Problem: W  | lant to                          | vectorize loops with conditional code:      |
|-------------|----------------------------------|---------------------------------------------|
| for (i=0;   | i <n;< th=""><th>i++)</th></n;<> | i++)                                        |
| if (A[i     | ]>0)                             | then A[i] = B[i]                            |
| Solution: A | dd vec                           | tor mask registers                          |
| > Vector    | version                          | of predicate registers, 1 bit per element   |
| > Vector    | operatio                         | n becomes NOP at elements where mask bit is |
| Code exam   | ple:                             |                                             |
| CVM         |                                  | # Turn on all bits in Vector Mask           |
| LV vA,      | rA                               | # Load entire A vector                      |
| SGTV        | vA, O                            | # Set bits in mask register where A>0       |
| LV vA,      | rB                               | # Load B vector into A under mask           |
| C17 177     | rA                               | # Store A back to memory under mask         |















| Example | DAXPY:        |                                          |
|---------|---------------|------------------------------------------|
| L.D     | F0,a          | ;load scalar a                           |
| MOV     | F1, F0        | ;copy a into F1 for SIMD MUL             |
| MOV     | F2, F0        | ;copy a into F2 for SIMD MUL             |
| MOV     | F3, F0        | ;copy a into F3 for SIMD MUL             |
| DADDIU  | R4,Rx,512     | ;last address to load                    |
| Loop:   | L.4D F4,0[Rx] | ;load X[i], X[i+1], X[i+2], X[i+3]       |
| MUL.4D  | F4,F4,F0      | ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]       |
| L.4D    | F8,0[Ry]      | ;load Y[i], Y[i+1], Y[i+2], Y[i+3]       |
| ADD.4D  | F8,F8,F4      | ;a×X[i]+Y[i],, a×X[i+3]+Y[i+3]           |
| S.4D    | 0[Ry],F8      | ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] |
| DADDIU  | Rx,Rx,32      | ;increment index to X                    |
| DADDIU  | Ry,Ry,32      | ;increment index to Y                    |
| DSUBU   | R20,R4,Rx     | ;compute bound                           |
| BNEZ    | R20,Loop      | ;check if done                           |





















| if (X[i] != 0)<br>X[i] = X[<br>else X[i] = Z[i]                                                  | i] — Y[i];<br>;           |                                    |
|--------------------------------------------------------------------------------------------------|---------------------------|------------------------------------|
| ld.global.f64                                                                                    | RD0, [X+R8]               | ; RD0 = X[i]                       |
| setp.neq.s32                                                                                     | P1, RD0, #0               | ; P1 is predicate register 1       |
| @!P1, bra                                                                                        | ELSE1, *Push              | ; Push old mask, set new mask bits |
|                                                                                                  |                           | ; if P1 false, go to ELSE1         |
| ld.global.f64                                                                                    | RD2, [Y+R8]               | ; RD2 = Y[i]                       |
| sub.f64                                                                                          | RD0, RD0, RD2             | ; Difference in RD0                |
| st.global.f64                                                                                    | [X+R8], RD0               | ; X[i] = RD0                       |
| @P1, bra                                                                                         | ENDIF1, *Comp             | ; complement mask bits             |
|                                                                                                  |                           | ; if P1 true, go to ENDIF1         |
| ELSE1:                                                                                           | ld.global.f64 RD0, [Z+R8] | ; RD0 = Z[i]                       |
|                                                                                                  | st.global.f64 [X+R8], RD0 | ; X[i] = RD0                       |
| ENDIF1: <next in<="" td=""><td>struction&gt;, *Pop ; pop to</td><td>restore old mask</td></next> | struction>, *Pop ; pop to | restore old mask                   |





- **\*** Vector is a model for exploiting Data Parallelism
- If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines
- Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, and conditional operations
- Fundamental design issue is memory bandwidth
  With virtual address translation and caching

CSE 661 - Parallel and Vector Architectures

Vector Computers - slide 61