

## **A RISC-V vector CPU for HPC:** architecture, platforms and tools to make it happen

Filippo Mantovani, Barcelona Supercomputing Center (BSC)



# **Introduction to RISC-V**

## The value of having "standards"





## The value of having "standards"



DISCLAIMER: Apple users may not fully understand this slide



NHR PerfLab Seminar, Erlangen, 10 Dec 2024 4

## What is **RISC-V**?

#### **Core Instruction Formats**

- ⇒ Simple and modular Instruction Set Architecture (ISA)
- ⇒ Research project started at Berkeley in 2010
- $\Rightarrow$  In 2014 ISA ratified
- ⇒ RISC-V (pronounced "risc five", as it is the fifth generation of RISC ISA at Berkeley)

Waterman, Andrew., Patterson, David A.. **The RISC-V Reader: An Open Architecture Atlas**. United States: Strawberry Canyon LLC, 2017.

Patterson, David A., Hennessy, John L. Computer Organization and Design RISC-V Edition: The Hardware Software Interface. Netherlands: Elsevier Science, 2017.



| 31 27 26 25           | 24 20      | 19  | 15  | 14  | 12  | 11   | 7       | 6    | 0      |        |
|-----------------------|------------|-----|-----|-----|-----|------|---------|------|--------|--------|
| funct7                | rs2        | rs1 |     | fun | ct3 |      | rd      | opco | ode    | R-type |
| imm[11:               | )]         | rs1 | fun | ct3 | rd  |      | opcode  |      | I-type |        |
| imm[11:5]             | rs2        | rs1 |     | fun | ct3 | imn  | n[4:0]  | opco | ode    | S-type |
| imm[12 10:5]          | rs2        | rs1 |     | fun | ct3 | imm[ | 4:1 11] | opco | ode    | B-type |
|                       | imm[31:12] |     |     |     |     |      | rd      | opco | ode    | U-typ  |
| imm[20 10:1 11 19:12] |            |     |     |     |     | rd   | opco    | ode  | J-type |        |

#### **RV32I Base Integer Instructions**

| Inst  | Name                    | FMT | Opcode  | funct3 | funct7         | Description (C)              | Note        |
|-------|-------------------------|-----|---------|--------|----------------|------------------------------|-------------|
| add   | ADD                     | R   | 0110011 | 0x0    | 0x00           | rd = rs1 + rs2               |             |
| sub   | SUB                     | R   | 0110011 | 0x0    | 0x20           | rd = rs1 - rs2               |             |
| xor   | XOR                     | R   | 0110011 | 0x4    | 0x00           | rd = rs1 ^ rs2               |             |
| or    | OR                      | R   | 0110011 | 0x6    | 0x00           | rd = rs1   rs2               |             |
| and   | AND                     | R   | 0110011 | 0x7    | 0x00           | rd = rs1 & rs2               |             |
| s11   | Shift Left Logical      | R   | 0110011 | 0x1    | 0x00           | rd = rs1 << rs2              |             |
| srl   | Shift Right Logical     | R   | 0110011 | 0x5    | 0x00           | rd = rs1 >> rs2              |             |
| sra   | Shift Right Arith*      | R   | 0110011 | 0x5    | 0x20           | rd = rs1 >> rs2              | msb-extends |
| slt   | Set Less Than           | R   | 0110011 | 0x2    | 0x00           | rd = (rs1 < rs2)?1:0         |             |
| sltu  | Set Less Than (U)       | R   | 0110011 | 0x3    | 0x00           | rd = (rs1 < rs2)?1:0         | zero-extend |
| addi  | ADD Immediate           | Ι   | 0010011 | 0×0    |                | rd = rs1 + imm               |             |
| xori  | XOR Immediate           | Ι   | 0010011 | 0x4    |                | rd = rs1 ^ imm               |             |
| ori   | OR Immediate            | Ι   | 0010011 | 0x6    |                | rd = rs1   imm               |             |
| andi  | AND Immediate           | Ι   | 0010011 | 0x7    |                | rd = rs1 & imm               |             |
| slli  | Shift Left Logical Imm  | Ι   | 0010011 | 0x1    | imm[5:11]=0x00 | rd = rs1 << imm[0:4]         |             |
| srli  | Shift Right Logical Imm | Ι   | 0010011 | 0x5    | imm[5:11]=0x00 | rd = rs1 >> imm[0:4]         |             |
| srai  | Shift Right Arith Imm   | I   | 0010011 | 0x5    | imm[5:11]=0x20 | rd = rs1 >> imm[0:4]         | msb-extends |
| slti  | Set Less Than Imm       | Ι   | 0010011 | 0x2    |                | rd = (rs1 < imm)?1:0         |             |
| sltiu | Set Less Than Imm (U)   | Ι   | 0010011 | 0x3    |                | rd = (rs1 < imm)?1:0         | zero-extend |
| lb    | Load Byte               | Ι   | 0000011 | 0×0    |                | rd = M[rs1+imm][0:7]         |             |
| lh    | Load Half               | I   | 0000011 | 0x1    |                | rd = M[rs1+imm][0:15]        |             |
| lw    | Load Word               | Ι   | 0000011 | 0x2    |                | rd = M[rs1+imm][0:31]        |             |
| 1bu   | Load Byte (U)           | Ι   | 0000011 | 0x4    |                | rd = M[rs1+imm][0:7]         | zero-extend |
| lhu   | Load Half (U)           | Ι   | 0000011 | 0x5    |                | rd = M[rs1+imm][0:15]        | zero-extend |
| sb    | Store Byte              | S   | 0100011 | 0×0    |                | M[rs1+imm][0:7] = rs2[0:7]   |             |
| sh    | Store Half              | S   | 0100011 | 0x1    |                | M[rs1+imm][0:15] = rs2[0:15] |             |
| SW    | Store Word              | S   | 0100011 | 0x2    |                | M[rs1+imm][0:31] = rs2[0:31] |             |
| beq   | Branch ==               | В   | 1100011 | 0x0    |                | if(rs1 == rs2) PC += imm     |             |
| bne   | Branch !=               | В   | 1100011 | 0x1    |                | if(rs1 != rs2) PC += imm     |             |
| blt   | Branch <                | В   | 1100011 | 0x4    |                | if(rs1 < rs2) PC += imm      |             |
| bge   | Branch $\geq$           | В   | 1100011 | 0x5    |                | if(rs1 >= rs2) PC += imm     |             |
| bltu  | Branch $<$ (U)          | В   | 1100011 | 0x6    |                | if(rs1 < rs2) PC += imm      | zero-extend |
| bgeu  | Branch $\geq$ (U)       | В   | 1100011 | 0x7    |                | if(rs1 >= rs2) PC += imm     | zero-extend |
| jal   | Jump And Link           | J   | 1101111 |        |                | rd = PC+4; PC += imm         |             |
| jalr  | Jump And Link Reg       | Ι   | 1100111 | 0×0    |                | rd = PC+4; PC = rs1 + imm    |             |
| lui   | Load Upper Imm          | U   | 0110111 |        |                | rd = imm << 12               |             |
| auipc | Add Upper Imm to PC     | U   | 0010111 |        |                | rd = PC + (imm << 12)        |             |
|       | **                      | -   |         |        |                |                              |             |
| ecall | Environment Call        | I   | 1110011 | 0x0    | imm=0x0        | Transfer control to OS       |             |

## **Technical difference: incremental vs modular ISA**

⇒ Intel x86 is an incremental ISA. Each new release:

- Maintain backward compatibility
- Carry on new instructions (also for marketing reasons)



| Name      | Description                                                    | Version | Status | Instruction Count |  |  |
|-----------|----------------------------------------------------------------|---------|--------|-------------------|--|--|
| RV32I     | Base Integer Instruction Set - 32-bit                          | 2.1     | Frozen | 49                |  |  |
| RV32E     | Base Integer Instruction Set (embedded) - 32-bit, 16 registers | 1.9     | Open   | Same as RV32I     |  |  |
| RV64I     | Base Integer Instruction Set - 64-bit                          | 2.0     | Frozen | 14                |  |  |
| RV128I    | Base Integer Instruction Set - 128-bit                         | 1.7     | Open   | 14                |  |  |
| Extension |                                                                |         |        |                   |  |  |
| М         | Standard Extension for Integer Multiplication and Division     | 2.0     | Frozen | 8                 |  |  |
| Α         | Standard Extension for Atomic Instructions                     | 2.0     | Frozen | 11                |  |  |
| F         | Standard Extension for Single-Precision Floating-Point         | 2.0     | Frozen | 25                |  |  |
| D         | Standard Extension for Double-Precision Floating-Point         | 2.0     | Frozen | 25                |  |  |
| G         | Shorthand for the base and above extensions                    | n/a     | n/a    | n/a               |  |  |
| Q         | Standard Extension for Quad-Precision Floating-Point           | 2.0     | Frozen | 27                |  |  |
| L         | Standard Extension for Decimal Floating-Point                  | 0.0     | Open   | Undefined Yet     |  |  |
| С         | Standard Extension for Compressed Instructions                 | 2.0     | Frozen | 36                |  |  |
| В         | Standard Extension for Bit Manipulation                        | 0.90    | Open   | 42                |  |  |
| J         | Standard Extension for Dynamically Translated Languages        | 0.0     | Open   | Undefined Yet     |  |  |
| Т         | Standard Extension for Transactional Memory                    | 0.0     | Open   | Undefined Yet     |  |  |
| Р         | Standard Extension for Packed-SIMD Instructions                | 0.1     | Open   | Undefined Yet     |  |  |
| ۷         | Standard Extension for Vector Operations                       | 0.7     | Open   | 186               |  |  |
| Ν         | Standard Extension for User-Level Interrupts                   | 1.1     | Open   | 3                 |  |  |
| Н         | Standard Extension for Hypervisor                              | 1.0     | Frozen | 2                 |  |  |
| S         | Standard Extension for Supervisor-level Instructions           | 1.12    | Open   | 7                 |  |  |

# Non-technical differences: another business models

|                                          | (intel) | arm |                                  |
|------------------------------------------|---------|-----|----------------------------------|
|                                          |         |     |                                  |
| Open ISA                                 |         |     |                                  |
| Adopting the ISA is free                 |         | [2] | $\checkmark$                     |
| It allows development of commercial IPs  |         |     | $\mathbf{\nabla}$                |
| Everybody can develop commercial IPs     |         | [3] | $\mathbf{\overline{\mathbf{A}}}$ |
| It allows access to extension            |         |     |                                  |
| It allows development of open-source IPs |         |     | $\checkmark$                     |

<sup>[1]</sup> Adoption of the ISA "de facto" not possible (unless you are AMD)
<sup>[2]</sup> Adoption of the ISA is possible (under a fee)
<sup>[3]</sup> Only partners can develop commercial IPs (under a fee)



# False myth: it is not like Linux for software

⇒ A shallow analysis often uses the analogy

- "It is the same idea of Linux but in hardware"
- "RISC-V will do to the hardware what Linux did to software"







# False myth: it is not like Linux for software

⇒ A shallow analysis often uses the analogy

- "It is the same idea of Linux but in hardware"
- "RISC-V will do to the hardware what Linux did to software"





## Take-home message

Krste Asanović, SiFive, Advancing HPC with RISC-V (Supercomputing 2022, invited talk) <u>https://www.youtube.com/watch?v=iFlcJFcOJKk</u>

⇒RISC-V defines an open, free and standard ISA

- Simple and modular (as opposed to incremental)
- ⇒It defines a new business model
  - ISA + extensions remains free
  - Implementations can be closed (and sold)
  - Implementations can be open

⇒Likely to work as a standard/universal ISA

• Independent on market fluctuations (war, bans, ...)



## **European Processor Initiative (EPI)**

### **EPI** Main Objective



- To develop European microprocessor and accelerator technology
- Strengthen competitiveness of EU industry and science





BSC, SemiDynamics, EXTOLL, FORTH, ETHZ, UniBo, UniZG, Chalmers, CEA, E4

### **EPI** Main Objective



- To develop European microprocessor and accelerator technology
- Strengthen competitiveness of EU industry and science





BSC, SemiDynamics, EXTOLL, FORTH, ETHZ, UniBo, UniZG, Chalmers, CEA, E4

## **EPAC: EPI Accelerator v1.5**

GF22FDX, 27 mm2, 0.3 Btr Tape out Mar 2023, Bring up Oct 2023

#### **VEC tile**

General purpose RISC-V CPU Avispado Core (16 kl\$, 32 kD\$) with dedicated VPU Up to 256 DP element vector length

semidynamic<sup>s</sup>

Faculty of Electrical Engineering and

#### **VRP tile**

General purpose RISC-V CPU supporting variable precision arithmetic up to 256 bit elements

<u>cea</u>



#### L2-HN tile

Distributed L2 cache (256 kB/slice) and Coherence Home Node







Physical design by Fraunhofer Prototype board integration by

## STX tile

RISC-V many-core machine learning accelerator targeting stencil and tensor arithmetics.

**ETH** zürich

**Fraunhofer** 

#### CHI NoC and SerDes

On-chip high-speed network based on multiple CHI cross points (XP). Off-chip link based on SerDes. EXTOLL



## **EPAC: EPI Accelerator v1.5**

GF22FDX, 27 mm2, 0.3 Btr Tape out Mar 2023, Bring up Oct 2023

#### **VEC tile**

General purpose RISC-V CPU Avispado Core (16 kl\$, 32 kD\$) with dedicated VPU Up to 256 DP element vector length

semidynamic<sup>s</sup>

Barcelona Supercomputing Center

Faculty of Electrical Engineering and

#### **VRP** tile

General purpose RISC-V CPU supporting variable precision arithmetic up to 256 bit elements



#### L2-HN tile

Distributed L2 cache (256 kB/slice) and Coherence Home Node





Physical design by **Fraunhofer** Prototype board integration by

#### STX tile

RISC-V many-core machine learning accelerator targeting stencil and tensor arithmetics.

**H**zürich

**Fraunhofer** 

#### CHI NoC and SerDes

On-chip high-speed network based on multiple CHI cross points (XP). Off-chip link based on SerDes. EXTOLL



## What's special in EPAC – VEC?



## The "Avispado" RISC-V core



The Vector Processing Unit (VPU)



NHR PerfLab Seminar, Erlangen, 10 Dec 2024

What's special?

16 kB instruction cache

Cache coherent (CHI)

Decodes v0.7, v1.0 vector extension

32 kB data cache

It boots Linux 

- The scalar in-order RISC-V core can release several requests of cache lines to the main memory
- The core is connected to a Vector Processing Unit (VPU) with very wide vector registers (16kb)





## **VPU with Long Vector Length (VL) support**



#### **Short VL**

As many Functional Units as VL.Vector instructions executed in 1 cycle



#### Long VL

- Cannot afford (area, power, cost) hundreds of Functional Units

- Vector instructions are executed on multiple cycles



European Processor Initiative

ed

## An example: AXPY with x86 intrinsics

```
43 int main()
                                                                               44 {
                                                                                    int n = 32;
                                                                               45
                                                                                    float alpha = 2.0;
                                                                               46
                                                                               47
                                                                                    // Allocate memory for vectors x and y
                                                                               48
                                                                                    float* x = (float*)malloc(n * sizeof(float));
                                                                               49
                                                                               50
                                                                                    float* y = (float*)malloc(n * sizeof(float));
                                                                               51
                                                                                   // Initialize vectors x and y with sample values
                                                                               52
                                                                                    for (int i = 0; i < n; i++) {</pre>
                                                                               53
                                                                                    x[i] = i + 1;
                                                                               54
                                                                               55
                                                                                      v[i] = i + 6;
                                                                               56
                                                                               57
                                                                               58
                                                                                    printf("Original y:\n");
                                                                                    for (int i = 0; i < n; i++) {</pre>
                                                                               59
                                                                                      printf("%f ", y[i]);
                                                                               60
                                                                               61
                                                                                    }
        36 void axpy(int n, float alpha, float *x, float *y)
                                                                                    printf("\n");
                                                                               62
        37 {
                                                                               63
             for (int i = 0; i < n; i++) {</pre>
         38
                                                                                    axpy(n, alpha, x, y);
             y[i] = alpha * x[i] + y[i];
         39
                                                                               65
                                                                                   //axpy avx512(n, alpha, x, y);
        40 }
                                                                                    //axpy avx512 tail(n, alpha, x, y);
                                                                               66
        41 }
                                                                               67
                                                                                    printf("Resulting y after AXPY operation:\n");
                                                                               68
                                                                                    for (int i = 0; i < n; i++) {</pre>
                                                                               69
                                                                                      printf("%f ", y[i]);
                                                                               70
                                                                               71
                                                                                    }
                                                                                    printf("\n");
                                                                               72
                                                                               73
                                                                                   // Free the allocated memory
                                                                               74
                                                                                    free(x);
                                                                               75
Barcelona
                                                                                    free(y);
                                                                               76
Supercomputing
                                                                               77
                                                                               78
                                                                                    return 0;
Centro Nacional de Supercomputación
                                                                                                NHR PerfLab Seminar, Erlangen, 10 Dec 2024
                                                                               79 }
```

19

Center

## An example: AXPY with x86 intrinsics

```
43 int main()
                                                                                  44 {
                                                                                       int n = 32;
                                                                                  45
                                                                                       float alpha = 2.0;
                                                                                  46
                                                                                  47
                                                                                       // Allocate memory for vectors x and y
                                                                                  48
                                                                                       float* x = (float*)malloc(n * sizeof(float));
                                                                                  49
                                                                                  50
                                                                                       float* y = (float*)malloc(n * sizeof(float));
                                                                                  51
                                                                                      // Initialize vectors x and y with sample values
                                                                                  52
                                                                                       for (int i = 0; i < n; i++) {</pre>
                                                                                  53
                                                                                         x[i] = i + 1;
                                                                                  54
                                                                                  55
                                                                                         v[i] = i + 6;
                                                                                  56
                                                                                  57
                                                                                  58
                                                                                       printf("Original y:\n");
                                                                                       for (int i = 0; i < n; i++) {</pre>
                                                                                  59
24 void axpy avx512(int n, float alpha, float *x, float *y)
                                                                                         printf("%f ", y[i]);
                                                                                  60
25 {
                                                                                  61
                                                                                       }
    int i:
                                                                                       printf("\n");
26
                                                                                  62
      m512 alpha vec = mm512 set1 ps(alpha);
27
                                                                                  63
    for (i = 0; \bar{i} < n; \bar{i} + 1\bar{6}) {
28
                                                                                       axpy(n, alpha, x, y);
                                                                                  64
        m512 x vec = mm512 loadu ps(&x[i]);
29
                                                                                       //axpy avx512(n, alpha, x, y);
        m512 y vec = mm512 loadu ps(&y[i]);
30
                                                                                       //axpy avx512 tail(n, alpha, x, y);
                                                                                  66
        m512 result = mm512 fmadd ps(alpha vec, x vec, y vec);
31
                                                                                  67
32
       mm512 storeu ps(&y[i], result);
                                                                                       printf("Resulting y after AXPY operation:\n");
                                                                                  68
                                                                                       for (int i = 0; i < n; i++) {</pre>
33
                                                                                  69
34 }
                                                                                         printf("%f ", y[i]);
                                                                                  70
                                                                                  71
                                                                                       }
                                                                                       printf("\n");
                                                                                  72
                                                                                  73
                                                                                      // Free the allocated memory
                                                                                  74
                                                                                       free(x);
                                                                                  75
  Barcelona
                                                                                       free(y);
                                                                                  76
  Supercomputing
                                                                                  77
  Center
                                                                                  78
                                                                                       return 0;
  Centro Nacional de Supercomputación
                                                                                                   NHR PerfLab Seminar, Erlangen, 10 Dec 2024
                                                                                  79 }
```

20

## An example: AXPY with x86 intrinsics

|                                                                                                                     | 43 int main()                                                                                      |
|---------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| For a generic size of X and Y,                                                                                      | $44 \{$                                                                                            |
| we must handle "loop tails"                                                                                         | <pre>40 int n = 32;<br/>46 float alpha = 2.0;</pre>                                                |
| we must handle loop tails                                                                                           | 47                                                                                                 |
|                                                                                                                     | 48 // Allocate memory for vectors x and y                                                          |
|                                                                                                                     | <pre>49 float* x = (float*)malloc(n * sizeof(float));</pre>                                        |
|                                                                                                                     | <pre>50 float* y = (float*)malloc(n * sizeof(float));</pre>                                        |
|                                                                                                                     | 51                                                                                                 |
|                                                                                                                     | <pre>52 // Initialize vectors x and y with sample values 53 for (int i = 0; i &lt; n; i++) {</pre> |
|                                                                                                                     | 54 $x[i] = i + 1;$                                                                                 |
|                                                                                                                     | 55 $y[i] = i + 6;$                                                                                 |
|                                                                                                                     | 56 }                                                                                               |
|                                                                                                                     | 57                                                                                                 |
| <pre>5 void axpy_avx512_tail(int n, float alpha, float *x, float *y)</pre>                                          | <pre>58 printf("Original y:\n");<br/>50 far (int i = 0; i &lt; n; i+) [</pre>                      |
| 6 {<br>7 <b>int</b> i;                                                                                              | 59                                                                                                 |
| <pre>m512 alpha vec = mm512 set1 ps(alpha);</pre>                                                                   | 61 }                                                                                               |
| 9 <b>int</b> avx512 loop size = n - (n % 16);                                                                       | <pre>62 printf("\n");</pre>                                                                        |
| 10                                                                                                                  | 63                                                                                                 |
| <pre>11 for (i = 0; i &lt; avx512_loop_size; i += 16) {</pre>                                                       | 64 axpy(n, alpha, x, y);                                                                           |
| 12m512 x_vec = _mm512_loadu_ps(&x[i]);                                                                              | 65 //axpy_avx512(n, alpha, x, y);                                                                  |
| <pre>13m512 y_vec = _mm512_loadu_ps(&amp;y[i]);<br/>14 m512 result = mm512 fmadd ps(alpha vec, x vec, y vec);</pre> | <pre>66 //axpy_avx512_tail(n, alpha, x, y); 67</pre>                                               |
| <pre>14m512_result = _mm512_rmadu_ps(atpha_vec, x_vec, y_vec), 15mm512_storeu_ps(&amp;y[i], result);</pre>          | <pre>68 printf("Resulting y after AXPY operation:\n");</pre>                                       |
| 16 }                                                                                                                | 69 <b>for</b> ( <b>int</b> i = 0; i < n; i++) {                                                    |
| 17                                                                                                                  | <pre>70 printf("%f ", y[i]);</pre>                                                                 |
| 18 for (; i < n; i++) {                                                                                             | 71 }                                                                                               |
| 19 y[i] = alpha * x[i] + y[i];                                                                                      | <pre>72 printf("\n"); 73</pre>                                                                     |
| 20 }<br>21 }                                                                                                        | 73<br>74 // Free the allocated memory                                                              |
|                                                                                                                     | 75 free(x);                                                                                        |
| Barcelona                                                                                                           | 76 free(y);                                                                                        |
| BSC Supercomputing<br>Center                                                                                        | 77                                                                                                 |
| Centro Nacional de Supercomputación                                                                                 | <b>78 return 0</b> ; NHR PerfLab Seminar, Erlangen, 10 Dec 2024                                    |
|                                                                                                                     | 79 }                                                                                               |

21

# A bit more elegant: Variable Vector Length

⇒ Vector length (VL) register limits the max number of elements to be processed by a vector instruction

- VL is loaded prior to executing the vector instruction with a special instruction
- No need to handle "loop tails"
- Makes the code "vector length agnostic"

```
void axpy(double a, double *dx, double *dy, int n) {
 7
 8
      int i;
 9
10
      long gvl = builtin epi vsetvl(n, epi e64, epi m1);
11
      epi 1xf64 v a = MM SET f64(a, gvl);
12
13
      for (i = 0; i < n; i += gvl) {</pre>
14
      gvl = builtin epi vsetvl(n - i, epi e64, epi m1);
15
       epi 1xf64 v dx = MM LOAD f64(\&dx[i], gvl);
16
        epi 1xf64 v dy = MM LOAD f64(\&dy[i], gvl);
17
        epi 1xf64 v res = MM MACC f64(v dy, v a, v dx, gvl);
18
        MM STORE f64(&dy[i], v res, gvl);
19
```







VL can have any value < VL\_max It does not work only with intrinsics

# Try it yourself

Add... • More •

void axpy(int n, float alpha, float \*x, float \*y)

CopInsights

 $\square \times$ 

A -

10

11

12

13

14

15

16

17

18 19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

C++

EPI-0.7 (Development) (Editor #1, Compiler #1) C++ X

add sltu

sltu

xori

xor

bnez

flw

flw

fsw

addi

addi

addi

bne<sub>z</sub>

slli

add

vle.v

slli

srli

vse.v

bne

42 ret

vfmacc.vf

or

EPI-0.7 (Development)

Libraries - + Add new... -

.LBB0 2:

.LBB0 3:

LBB0 4:

-02 -ffast-math -mepi -mcpu=avispado

# %for.body

# %vector.ph

# %vector.body

# %for.cond.clea

□ 11010 □./a.out ☑.LX0: □ lib.f: ☑.text ☑// □\s+ ☑ Intel ☑ Demangle

Add tool...

a3, a3, a2

a4, a2, a4

a3, a1, a3

a3, a3, a4

a3, .LBB0

ft0, 0(a1)

ft1, 0(a2)

fmadd.s ft0, ft0, fa0, ft1

ft0, 0(a2)

a2, a2, 4

a1, a1, 4

**LBB0** 5

a3, zero

a4, a3, 2

a6, a1, a4

a5, a0, a3

a4, a4, a2

v1, (a6)

v2, (a4)

a5, a5, 32

a5, a5, 32

a3, a3, a5 v2, (a4)

a3, a0, .LBB0\_4

a5, a5, e32,m1

v2, fa0, v1

a0, a0, -1

a0, LBBO

a4. 1

EXPLORER

2

3

4

5

6

C++ source #1 X C source #2 )

■ Save/Load + Add new... • V Vim

for (int i = 0; i < n; i++) {</pre>

y[i] = alpha \* x[i] + y[i]

- ⇒ "Compiler Explorer" allows developers to write and compile code in various programming languages, including C++, C, Rust, and others.
- ⇒ Web-based interface for quickly testing and experimenting with code snippets, especially in the context of compiler optimizations.
- ⇒ <u>https://repo.hca.bsc.es/epic/</u>







## How do I program EPAC - VEC?





#### Autovectorization

- Leave it to the compiler
- #pragma omp simd (aka "Guided vectorization")
  - Relies on vectorization capabilities of the compiler
    - Usually works but gets complicated if the code calls functions
  - Also usable in Fortran
- C/C++ builtins (aka "Intrinsics")
  - Low-level mapping to the instructions
  - Allows embedding it into an existing C/C++ codebase
  - Allows relatively quick experimentation
- Assembler
  - Always a valid option but not the most pleasant

#### How do I use EPAC - VEC?



- Like a standard HPC system!
- Compile your code
  - We give you a compiler
- Link libraries
- Write/Submit a job script
  - SLURM
- Wait for the results
- Analyse execution traces and study how well your code is vectorized

**Applications** 

Programming Model (OpenMP, MPI)

Libraries (FFTW, SpMV, ...)

Scheduler (Slurm)

Compiler (LLVM)

**OS** (Linux)

Hardware (RISC-V self-hosted)

# Take home message

#### ⇒EPI is developing:

- Arm-based CPU (not part of this talk/workshop)
- RISC-V-based Accelerator

#### ⇒We focus on the RISC-V vector accelerator (VEC) that:

- Can be self-hosted
- Support variable vector length
- Is vector length agnostic
- Uses long vectors (256 DP elements, 32x larger than x86)



## **Software Development Vehicles (SDV)**

## What to do until the hardware is ready?





#### Software development





Wake up Neo... Follow the Software Development Vehicles

#### **Software Development Vehicles (SDV)**







ILA manual instrumentation

Signal-level trace

#### **Co-design with SDV**





developers

to compiler

Feedback

Feedback to hardware

<u>architects</u>



# Navigate, visualize and quantify







#### **Software Development Vehicles (SDV)**

- 3 Steps:
  - 1<sup>st</sup> step: Run in a commercial RISC-V platform (scalar CPU)

**2<sup>nd</sup> step**: RISC-V software emulation supporting RVV (RAVE)

3<sup>rd</sup> step: Run on VEC mapped into FPGA



## Take home message

While RTL is becoming actual hardware, EPI develops tools for boosting the co-design cycle

• Software and Hardware prototypes (aka Software Development Vehicles)

⇒We can leverage SDVs to:

- Influence hardware design
- Improve compiler autovectorization and system-software support
- Study and improve vectorization of real scientific HPC codes



## **Vectorization of a CFD code**

#### **Vectorization of a real CFD code (Alya)**

1 time step



- Alya is a modular code → We study the module called "Nastin"
- "VECTOR\_SIZE"
  - Allocates data structures in a vector-friendly way
  - Values under study → [16, 64, 128, 240, 256, 512]



## Alya mini-app



- We worked on a mini-app that mimics the behaviour of the Assembly of Alya
- We divided the mini-app in "phases"
  - Mini-app phases are regions of codes with one or more loops
  - We are interested in loops because is where there is potential for vectorization
  - 8 phases identified: P1+P2+P3+P4+P5+P6+P7+P8 = mini-app
- We based our study and optimization on the autovectorization capabilities
  - No intrinsics → portability is preserved



### 1<sup>st</sup> step: Run on commercial RISC-V platforms (scalar CPU)

| Phase             |       | 2     | 3      | 4      | 5     | 6      | 7      | 8     |
|-------------------|-------|-------|--------|--------|-------|--------|--------|-------|
| % of total cycles | I,29% | 3,33% | 19,80% | 14,45% | 3,49% | 40,99% | 14,68% | I,96% |

- Phases taking longer (6,3,7,4) correspond to compute intensive regions
- Phases lasting less (5,2,8,1) are memory bound regions
- VECTOR\_SIZE parameter has almost no influence on the execution (5% coefficient of variation)





Commercial RISC-V platform (scalar CPU)



### **1**<sup>st</sup> step: Enabling auto-vectorization

- Auto-vectorization results without touching any line of code
- VECTOR\_SIZE parameter strongly influences when executing with vectors





#### **2<sup>nd</sup> step: Emulation supporting RVV (RAVE)**

|             |       | Phase |        |         |         |        |        |       |        |
|-------------|-------|-------|--------|---------|---------|--------|--------|-------|--------|
| VECTOR_SIZE | l     | 2     | 3      | 4       | 5       | 6      | 7      | 8     |        |
| 16          | 0,00% | 0,00% | I,84%  | 0,00%   | 0,00%   | 0,95%  | 24,64% | 0,00% |        |
| 64          | 0,00% | 0,00% | 12,73% | 17,37%  | 17,86%  | 21,58% | 25,87% | 0,00% |        |
| 128         | 0,00% | 0,00% | 16,05% | I 6,80% | 17,94%  | 20,39% | 25,23% | 0,00% |        |
| 240         | 0,00% | 0,00% | 15,31% | 16,45%  | 16,82%  | 19,90% | 23,90% | 0,00% | 30,00% |
| 256         | 0,00% | 0,00% | 15,36% | 16,21%  | I 5,88% | 19,78% | 24,23% | 0,00% | 15,00% |
| 512         | 0,00% | 0,00% | 16,65% | 18,19%  | 18,47%  | 21,82% | 26,20% | 0,00% | 0,00%  |



#### Analysis of % of vector instructions:

- Higher VECTOR\_SIZE helps the compiler to insert more vector instructions
- Higher VECTOR\_SIZE reduces the total number of vector instructions
- 70% of vector instructions are memory type

NHR PerfLab Seminar, Erlangen, 10 Dec 2024



## **3<sup>rd</sup> step: Run on VEC mapped into FPGA**

|             |       | Phase |        |        |        |        |        |       |  |
|-------------|-------|-------|--------|--------|--------|--------|--------|-------|--|
| VECTOR SIZE |       | 2     | 3      | 4      | 5      | 6      | 7      | 8     |  |
| 16          | 0,00% | 0,00% | 15,72% | 0,00%  | 0,00%  | 7,66%  | 73,30% | 0,00% |  |
| 64          | 0,00% | 0,00% | 72,59% | 76,62% | 57,73% | 86,85% | 77,70% | 0,00% |  |
| 128         | 0,00% | 0,00% | 81,94% | 79,36% | 64,01% | 88,96% | 79,59% | 0,00% |  |
| 240         | 0,00% | 0,00% | 83,69% | 83,08% | 70,75% | 90,61% | 81,94% | 0,00% |  |
| 256         | 0,00% | 0,00% | 83,76% | 83,03% | 71,29% | 90,26% | 82,83% | 0,00% |  |
| 512         | 0,00% | 0,00% | 85,74% | 87,59% | 80,61% | 91,14% | 88,50% | 0,00% |  |

#### Analysis of % of vector cycles:

- High vCPI → we are computing several elements per instruction (GOOD)
- AVL == VECTOR\_SIZE → the more elements we process per vector instruction, the less vector instructions we execute (GOOD)

| VECTOR_SIZE | vCPI  | AVL | Number vector instructions |
|-------------|-------|-----|----------------------------|
| 16          | 9.71  | 16  | $14.3 \times 10^{5}$       |
| 64          | 23.39 | 64  | $19.1 \times 10^{5}$       |
| 128         | 28.56 | 128 | $9.6 \times 10^{5}$        |
| 240         | 41.19 | 240 | $5.1 \times 10^{5}$        |
| 256         | 43.10 | 256 | $4.7 \times 10^{5}$        |
| 512         | 45.30 | 256 | $4.7 \times 10^{5}$        |



# **3<sup>rd</sup> step: Run on VEC mapped into FPGA**

|             |       | Phase |        |        |        |        |        |       |  |  |
|-------------|-------|-------|--------|--------|--------|--------|--------|-------|--|--|
|             |       |       |        |        |        |        |        |       |  |  |
| VECTOR_SIZE | 1     | 2     | 3      | 4      | 5      | 6      | 7      | 8     |  |  |
| 16          | 0,00% | 0,00% | 15,72% | 0,00%  | 0,00%  | 7,66%  | 73,30% | 0,00% |  |  |
| 64          | 0,00% | 0,00% | 72,59% | 76,62% | 57,73% | 86,85% | 77,70% | 0,00% |  |  |
| 128         | 0,00% | 0,00% | 81,94% | 79,36% | 64,01% | 88,96% | 79,59% | 0,00% |  |  |
| 240         | 0,00% | 0,00% | 83,69% | 83,08% | 70,75% | 90,61% | 81,94% | 0,00% |  |  |
| 256         | 0,00% | 0,00% | 83,76% | 83,03% | 71,29% | 90,26% | 82,83% | 0,00% |  |  |
| 512         | 0,00% | 0,00% | 85,74% | 87,59% | 80,61% | 91,14% | 88,50% | 0,00% |  |  |

- Phases 1, 2 and 8 are not vectorized (pattern colored in plot)
- Next step: focus in vectorize phase 2
  - Costing 30% of time





#### **Example of optimization: phase 2 aka VEC2**

#### Problem

Compiler unable to vectorize loop, not sure of VECTOR\_DIM value

subroutine nsi\_miniapp(VECTOR\_DIM, pnode, pgaus, list\_elements)

1 loop not vectorized: unsafe dependent memory operations in loop.

#### **Solution**

We know VECTOR\_DIM value

integer(ip), parameter :: VECTOR\_DIM = VECTOR\_SIZE



### **Optimization - VEC2**

- Enabled vectorization in phase 2
  - Performance get worst instead of improving
  - AVL of vector instructions is low!
     We are not taking advantage of the full-VL. Why?



Phase 2 cycles



### **Optimization - VEC2+VL**

#### **Problem**

- pnode comes from input, we do not know its value
- Experimentally found pnode << VECTOR\_DIM</p>

#### **Solution**

Swap induction variables

```
do ivect = 1,VECTOR_DIM
   do inode = 1,pnode
      !WORK
   end do
end do
```

```
do inode = 1,pnode 
  do ivect = 1,VECTOR_DIM
    !WORK
   end do
end do
```



#### **Optimization VEC2+VL: results**

- Improved AVL vectorization in phase 2
  - Vector instructions running with AVL == VECTOR\_SIZE





#### **Alya preliminary results - VEC2+VL**





#### **Evaluation: RISC-V vector prototype**

- After a detailed study and manual optimizations, we achieve a peak of 7.6x speedup (VEC1)
- Code remains portable No intrinsics!



[\*] Speed-up defined as: scalar VECTOR\_SIZE<sub>16</sub> / optimized vector



### **Portability across other HPC platforms**

- Optimizations portable to other architectures
  - "Traditional" cluster (Intel x86)
  - Long-vector architecture (NEC SX-Aurora)

 $\boxtimes$  RISC-V vector prototype  $\boxtimes$  NEC SX-Aurora $\boxtimes$  Intel HPC cluster



[\*] Speed-up defined as: vanilla vector / optimized vector

# Take home message

⇒We leveraged the EPI Software Development Vehicles (SDVs) to study and improve vectorization of a complex CFD code (Alya) written in Fortran

⇒ Vectorization techniques improve performance on EPAC – VEC and are portable

⇒Similar studies are on going for several scientific codes part of EU CoEs







#### References



- Mantovani, Filippo, et al. "Software Development Vehicles to enable extended and early co-design: a RISC-V and HPC case of study." International Conference on High Performance Computing. Cham: Springer Nature Switzerland, 2023. <u>https://arxiv.org/abs/2306.01797</u>
- Vizcaino, Pablo, et al. "Short reasons for long vectors in HPC CPUs: a study based on RISC-V." Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. 2023. <u>https://arxiv.org/abs/2309.06865</u>
- Vizcaino, Pablo, et al. "RAVE: RISC-V Analyzer of Vector Executions, a QEMU tracing plugin." arXiv preprint arXiv:2409.13639 (2024). <u>https://arxiv.org/abs/2409.13639</u>
- Blancafort, Marc, et al. "Exploiting long vectors with a CFD code: a co-design show case." 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2024. <u>https://arxiv.org/abs/2411.00815</u>

- https://www.eetimes.com/examining-the-top-five-fallacies-about-risc-v/
- <u>https://www.youtube.com/watch?v=iFlcJFcOJKk</u>



Google Scholar

#### **EPI FUNDING**





SPONSORED BY THE



Federal Ministry of Education and Research





the UE NextGenerationEU/PRTR.









ncé **Solution** par **GOUVERNEMENT** Liberté Égalité Fraternité

This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI

SGA2). The JU receives support from the European Union's Horizon 2020 research

and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI /10.13039/501100011033 and by





NHR PerfLab Seminar, Erlangen, 10 Dec 2024