## COE 308 – Computer Architecture

## Final Exam - Fall 2008

Saturday, February 7, 2009 7:30 – 10:00 AM

Computer Engineering Department
College of Computer Sciences & Engineering
King Fahd University of Petroleum & Minerals

| Student Name: |  |  |
|---------------|--|--|
|               |  |  |
|               |  |  |
| Student ID:   |  |  |

| Q1    | / 20 | Q2  | / 15 |
|-------|------|-----|------|
| Q3    | / 20 | Q4  | / 20 |
| Q5    | / 25 |     |      |
| Total |      | / 1 | 00   |

## Important Reminder on Academic Honesty

Using unauthorized information on an exam, peeking at others work, or altering graded exams to claim more credit are severe violations of academic honesty. Detected cases will receive a failing grade in the course.

| <b>Q1.</b> (2 | 0 points) True or False? Explain or give the right answer for a full mark.                                                                                                                                                                                                 |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| a)            | On a read, the value returned by the cache depends on which blocks are in the cache.                                                                                                                                                                                       |
| <b>b</b> )    | Most of the cost of the memory hierarchy is at the L1 cache.                                                                                                                                                                                                               |
| c)            | The higher the memory bandwidth, the larger the cache block size should be.                                                                                                                                                                                                |
| d)            | In reducing cache misses, capacity is more important than associativity.                                                                                                                                                                                                   |
| e)            | Compulsory cache misses can be reduced.                                                                                                                                                                                                                                    |
| f)            | Allowing ALU and branch instructions to take fewer stages and complete earlier than other instructions does not improve the performance of a pipeline.                                                                                                                     |
| g)            | Increasing the depth of pipelining by splitting stages always improves performance.                                                                                                                                                                                        |
| h)            | The single-cycle datapath must have separate instruction and data memories because the format of instructions and data is different.                                                                                                                                       |
| i)            | A given application runs in 15 seconds. A new compiler is released that requires only 0.6 as many instructions as the old compiler. Unfortunately, it increases the CPI by 1.1. We expect the application to run using this new compiler in $15 \times 0.6/1.1 = 8.18$ sec |
| j)            | If Computer A has a higher MIPS rating than computer B, then A is faster than B.                                                                                                                                                                                           |

| Q2.        | (15 pts) Consider a <b>direct-mapped</b> cache with <b>128 blocks</b> . The block size is <b>32 bytes</b> .                       |
|------------|-----------------------------------------------------------------------------------------------------------------------------------|
| a)         | (3 pts) Find the number of tag bits, index bits, and offset bits in a 32-bit address.                                             |
|            |                                                                                                                                   |
|            |                                                                                                                                   |
|            |                                                                                                                                   |
| <b>b</b> ) | (4 pts) Find the number of bits required to store all the valid and tag bits in the cache.                                        |
|            |                                                                                                                                   |
|            |                                                                                                                                   |
|            |                                                                                                                                   |
|            |                                                                                                                                   |
| c)         | (8 pts) Given the following sequence of address references in decimal:                                                            |
|            | 20000, 20004, 20008, 20016, 24108, 24112, 24116, 24120                                                                            |
|            | Starting with an <b>empty cache</b> , show the <b>index</b> and <b>tag</b> for each address and indicate whether a hit or a miss. |
|            |                                                                                                                                   |
|            |                                                                                                                                   |
|            |                                                                                                                                   |

|            | cycles due to cache misses. Load and store instructions count 30% of all instructions.                                                                                                                                                           |
|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|            | The processor has an I-cache and a D-cache. The hit time is 1 clock cycle. The I-cache has a 2% miss rate. The D-cache has a 5% miss rate on load and store instructions.                                                                        |
|            | The miss penalty is 50 ns, which is the time to access and transfer a cache block between main memory and the processor.                                                                                                                         |
| a)         | (3 pts) What is the average memory access time for instruction access in clock cycles?                                                                                                                                                           |
|            |                                                                                                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
| b)         | (3 pts) What is the average memory access time for data access in clock cycles?                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
| a)         | (A nts) What is the manh or of stell evalue non-instruction and the event CDIO                                                                                                                                                                   |
| <b>c</b> ) | (4 pts) What is the number of stall cycles per instruction and the overall CPI?                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
|            | Suppose we add now an L2 cache that has a hit time of 5 ns, which is the time to access and transfer a block between the L2 and the L1 cache. Of all the memory references sent to the L2 cache, 80% are satisfied without going to main memory. |
| d)         | (4 pts) What is the average memory access time for instruction access in clock cycles?                                                                                                                                                           |
|            |                                                                                                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
| e)         | (4 pts) What is the number of stall cycles per instruction and the overall CPI?                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
|            |                                                                                                                                                                                                                                                  |
| f)         | (2 pts) How much faster will the machine be after adding the L2 cache?                                                                                                                                                                           |
|            |                                                                                                                                                                                                                                                  |

Q3. (20 pts) A processor runs at 2 GHz and has a CPI of 1.2 without including the stall

Q4. (20 pts) A program consists of two nested loops: a smaller inner loop and a larger outer loop. The general structure of the program is shown below. All memory addresses are shown in decimal. Each decimal address points to an instruction in memory. All memory locations in the various sections contain instructions to be executed in straight-line sequencing. Instructions 128→255 form the inner loop and are repeated 20 times for each pass of the outer loop. Instructions 64→127 and 256→575 form the outer loop and are repeated 10 times. The program is to be run on a computer that has a direct-mapped instruction-cache with 64 blocks, where each block can store 4 instructions. The hit time is 1 clock cycle and the miss penalty is 20 cycles.



**a)** (6 pts) What is the total instruction count and how many I-cache misses are caused by the program?

| • \ | Page 6 of 9                                                                                                                                                                  |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| b)  | (2 pts) What is the I-cache miss rate?                                                                                                                                       |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
| c)  | (4 pts) If only cache misses stall the processor, what is the execution time (in nanoseconds) of the above program on a 2 GHz pipelined processor?                           |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
| d)  | (8 pts) Repeat (a) thru (c) if a bigger block size that can store 8 instructions is used. The total number of blocks in the I-cache is still 64. What is the speedup factor? |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |
|     |                                                                                                                                                                              |

**Q5** (25 pts) Use the following MIPS code fragment:

```
$3, $0, 100
I1:
     ADDI
                                  # $3 = 100
             $4, $0, $0
I2:
     ADD
                                  # $4 = 0
Loop:
             $5, 0($1)
I3:
     LW
                                  # $5 = MEM[$1]
             $4, $4, $5
                                  # $4 = $4 + $5
I4:
     ADD
             $6, 0($2)
                                  # $6 = MEM[$2]
I5:
     LW
                             # $4 = $4 - $6
             $4, $4, $6
I6:
     SUB
             $1, $1, 4
                                  # $1 = $1 + 4
I7:
     ADDI
I8:
     ADDI
             $2, $2, 4
                                  # $2 = $2 + 4
             $3, $3, -1
                                  # $3 = $3 - 1
I9:
     ADDI
I10: BNE
             $3, $0, Loop
                                  # if ($3 != 0) goto Loop
```

a) (10 pts) Show the timing of one loop iteration on the 5-stage MIPS pipeline without forwarding hardware. Complete the timing table, showing all the stall cycles. Assume that the branch will stall the pipeline for 1 clock cycle only.

|          | 1  | 2  | 3  | 4 | 5  | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
|----------|----|----|----|---|----|---|---|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| I1: ADDI | IF | ID | EX | M | WB |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I2: ADD  |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I3: LW   |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I4: ADD  |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I5: LW   |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I6: SUB  |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I7: ADDI |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I8: ADDI |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I9: ADDI |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I10: BNE |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I3: LW   |    |    |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| I4: ADD  |    | ·  |    |   |    |   |   |   |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |

| b) | (5 pts) According to the timing diagram of part (a), compute the number of clock cycles and the average CPI to execute ALL the iterations of the above loop.                 |
|----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |
| ۵  | (5 mts) Doordon the instructions of the chave loop to fill the look delay and the branch                                                                                     |
| c) | (5 pts) Reorder the instructions of the above loop to fill the load-delay and the branch-delay slots, without changing the computation. Write the code of the modified loop. |
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |
| d) | (5 pts) Compute the number of cycles and the average CPI to execute ALL the iteration of the modified loop. What is the speedup factor?                                      |
|    |                                                                                                                                                                              |
|    |                                                                                                                                                                              |