King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department
 COE 501 Computer Architecture
Course Project
 
 
Course Projects Working Framework
 
  1. First 1-4 weeks of semester: Students are expected to independently carry out literature Survey on their selected topics, compile information of relevant papers in a progress report PR-1, have an action plan of what they are going to contribution and more specifically on their project implementation, and prepare for a presentation to expose all the above aspects to the class and instructor. During this period, the student is expected to interact with the instructor regarding the major issues of his task while being prepared for his topic and its details. A 15-minute presentation is to be given in week 5 in front of the class and instructor. This part will be rated 25% of project for both presentation and timely submission of PR-1.

  2. From week 5 to week 10, Students are expected to carry out the implementation part of their project by taking into account: (1) his presented or revised action plan, and (2) the instructor observations formulated during or after his presentation. During this period, the student is expected to interact with the instructor regarding the major design issues, difficulties in addressing some parts, resource problems, or any kind of problem that require specific attention to avoid negative impact on the project.  Prepare report PR-2 after adding  the implementation aspects presented above to P-1. A 15-minute presentation involving success and failure design and implementation aspects is to be given in week 11 in front of the class and instructor. This part will be rated 35% of project for both presentation and timely submission of PR-2.

  3. From week 11 to week 14, Students are expected to carry out: (1) debugging and testing of their implementation and refer to instructor in case of problems, (2) carry out performance evaluation of the project with data collection, (3)  revise PR-1 by including the performance evaluation and detailed analysis of collected results, (4) prepare the final project presentation. During this period, the student is expected to interact with the instructor regarding the debugging issues and performance interpretation. A presentation involving overall project will be delivered by the team (or student) before the last day of classes (week 15). The emphasis should be here on the method and results. This part will be rated 40% of project for both presentation and timely submission of PR-3 which is the final report. Here the student is to submit a zipped folder including all of original word reports (not pdf), well documented source code, evaluation data, and reference papers. A demo to the instructor will also be required.

List of projects for T082

  1. Code Restructuring Using the DLX Simulator.
    Investigation of the DLX simulator with respect to (1) take a code like ADI benchmark, write it in C, generate its code in DLX, and collect its run time, (2) restructure the DLX ADI assembly to the best you can, run it, and compare its performance to the compiled version, (3) Try loop unrolling and compare performance. Now you may advice a method for compiler loop restructuring such as the use of loop unrolling. Explain the compiler approach with respect to a source code in DLX and give example. The students will prepare for a presentation within four weeks.

    Suggested implementation:

    1. Consider a set (say 5) of small programs (computation loops) like the (1) Alternating Direction Integration (ADI), (2) the Red-Black Relaxation loop, and (3) 2D Matrix Multiply, etc.

    2. Carry out recurrence analysis of each loop and classify it as LID, B-LCD, and F-LCD.

    3. Restructure the loop to minimize its execution time using the loop unrolling and software pipelining.

    4. Run each of the above programs without restructuring, with loop unrolling (possibly different configurations), and software pipelining (possibly different configurations) and collect performance data form the simulator.

    5. Report performance at all levels after collecting the execution time as the number of clocks. Compare performance for each restructuring approach, and present your discussion and justification.

    6. Suggest formal restructuring rules for loop unrolling and software pipelining.

    7. Write a the program for a tool that takes a loop with a few statement in the body as input and (1) carry out data dependence analysis, (2) check the profitability of loop unrolling and software pipelining under specific dependence conditions, (3) restructure the loop using unrolling and / or software pipelining, (4) run the resulting loops on DLX and show the results.

              Working Students:

    1. Group-1:

     

  2. Study of Benchmarks programs used for Desktop computers.

For the first four weeks the students will carry out following tasks: (1) determine representative types of benchmarks used for Desktop computers which are well defined by their function and generated parameters, (2) classify the benchmarks depending on their objectives, (3) acquire some available benchmarks, run them on a desktop PC, and provide demonstration to the class, and (4) classify some known computers based on their benchmarks performance. The suggested plan is: (1) determine representative benchmarks most commonly used for Desktop computers, (2) classify the benchmarks depending on their objectives such as benchmarking the CPU, the display system, the hard disk, and the NIC and Network protocol, (3) acquire benchmarks as stated in (2) provide some benchmark run during your presentation, comments on the results, and provide a copy of those benchmarks to the students. Define clearly what is benchmarked and how the benchmark is working and provide the meaning of the output, its interpretation and justification. Make sure each considered benchmark is well defined by its function and its input and output parameters are clearly defined. The students will prepare for a presentation within four weeks.

Suggested implementation:

  1. Find available or public domain benchmarks that allows assessing the local performance of a specific issue of Computer Architecture such as the following items which are arranged in descending order of priority: (1) the CPU capability and its instruction pipeline, (2) the SIMD multimedia extension (MMX), (3) the hierarchical memory system, (4) the graphics capability and its AGP card, (5) the network stack protocol, NIC, and network delays, etc. For each considered benchmark we must have information on its load profile and clear interpretation of its output. Avoid benchmarks for which we only have a generic meaning of the outcome.

  2. Run the above benchmarks on specific PCs with the idea of accessing the performance of its architectural implementation not just displaying the output performance of the benchmark. Assess the data in (1) comparative way (output from different benchmarks and for different PCs), and (2) critical thinking (matching obtained performance with known concepts about the specific PC architecture).

  3. Provide copy of the studied benchmarks and the report. Write a report about the above experiments, comparative tables, discussions, and attach the benchmarks or indicate where to find them.

 Working Students:

  1. Group-1:

3.  Study of Techniques used for Improving Performance of Instruction Pipelining.

For the first four weeks the students carry out (1) Surveying of I-pipelining techniques in at two recent processors, (2) review of dynamic execution models, (3) methods of resolving hazards, and (4) compiler techniques to I-pipelining. The suggested plan is: (1) Surveying of I-pipelining techniques in recently proposed micro-architectures (last ten years), (2) how structural, data, and control hazards are resolved and at what level, (3) main proposed features like branch-prediction, speculation, etc., (4) provide a comparison of these micro-architectures with respect to major aspects especially expected performance and limitations. Implement a branch-prediction table with 2-bit history and evaluate performance by using typical loop structures. The students will prepare for a presentation within four weeks.

Select one of the following implementations:

  1. Select one branch prediction algorithm from a recent research paper and implement it. Carry out performance evaluation by using a statistically generated branch behavior. For example, the probability of some instruction to be a CB is 0.2 and that the branch has a Set of statistical T/NT patterns (for example pattern 1110010101 prob. taken is 0.6). Evaluate performance of the used branch prediction and suggest some improvement.

  2. Select one correlating branch prediction algorithm from a recent research paper and implement it. Carry out performance evaluation by using a statistically generated branch behavior. For example, set up a a set of correlation rules between 3 or more CBs that are hidden to the computer. Each correlation element indicate the T/NT of each of the CBs. You may randomly a given element at a start of first CB. The problem is to evaluate the performance of the algorithm when the set correlated branches behaves as above. Study performance (correct prediction as compared incorrect ones) and suggest some algorithm tuning.

Working Students:

  1. Group-1: Mr. Dia Eddin AbuZeina

The student is recommended to:

  • Carry out literature survey including the papers in this folder to be included in PR2.

  • The relevant papers Folder  including the student report with his presentation and the two reference papers references

  • The student is recommended to implement a correlating branch predictor to improve the performance as compared to the previously implemented by the student (see item above).

  • My recommendation is to carry out following tasks:

    ·     Task-1: carefully read (1) the class presentation, (2) the book, and (3) the references used by the previous student (see Folder above).

    ·     Task-2: Analyze how the student did his implementation (code, presentation, and report): generation of branching cases, test program, collecting results. The student work need not to be perfect but we need to see what he did and to improve the student approach in the implementation. Also please implement the various schemes that were presented in the lecture. Please improve his implementation based on your views and the views I suggested to you.

4. Design of a Simulator for a Pipelined Dynamic Execution Processor.

Dynamic execution consists of a multiple functional units (FUs) with an issue unit (IU) that adopts an approach to resolve resource conflicts and data hazards (RAW and WAR) as well as providing a scheme for sending produced data out of arithmetic units and memory to potential consumers (instructions). The state of the machine should be tracked by a clock to determine availability of data and resources. The project is on writing a simulator for the above processor while allowing the user to set up the processor parameters (all kind of processor latencies) depending on need. For this we may adopt a simple RISC assembler Instruction Set like the MIPS or others. The simulator will be tested using small programs. The suggested implementation: (1) searching the Internet for similar simulators, (2) Review of dynamic execution techniques for resolving hazard, (3) review of the simulator software aspects like the event-driven architecture, (4) Review of  the pipeline architecture and data simulator structure for the pipeline within the IU and FUs, (5) implement the issue logic and reservation stations within the simulator including the mechanism for resolving hazards, (6) define performance metrics, and (7) evaluate performance using small programs written in Assembly language.

Working Students:

  1. Group-1: Mr. Fayez Al Saadi, ID: g200804900

  2. Group-2: Mr. Mohammad Sallout, (ID# 217059), Mobile: 0500189750, mhd2111@gmail.com

The student is recommended to:

  • Carry out literature survey including the papers in this folder to be included in PR2.

  • http://en.wikipedia.org/wiki/UltraSPARC_T1

  • Designing and implementing an event-driven simulator for a Tomasolu micro-architecture running MIPS. The following issues need to be addressed: (1) the simulator data structure (to facilitate register renaming) for the various ISA components like instruction memory (IM), register file (RF), functional units such as MIPS 4000 (FUs), the data memory, the reservation stations which are attached to the FUs, statuses (busy/free) of all needed resource, etc., (2) the major program units like the Issue unit (IS), Execution (E), Write Back (WB), and common bus (CDB). It is recommended to consider each issued instruction as an event that undergoes a number of states separated by a number of clocks (letency). The latencies for all operations must be described in one external file that is read once when the simulator is run. IS generates events (issued instructions) which are stored in a server queue (time) with a time stamp (availability of producer data). An event will transit from one state (example waiting for data, executing, WB, etc) until it completes.  The simulator should be tested for Correctness with small MIPS programs such as those found tin the book. The above work is expected for presentation (week 11) and be documented in PR2. 

  • The student may choose to improve an existing simulator (see instructor) starting from the simulator code, getting familiar with the code, experiencing the simulation and its correctness, and upgrading the simulator in one aspect to be agreed upon.  For this option download this available Folder

5.  Investigate and Parallel Processing using the E1350 IBM eServer cluster.

The KFUPM-ITC has a High-Performance Computing System (HPC) which consists of a 128-node Dual-boot Cluster with following resources (see URL: http://www.kfupm.edu.sa/hpc/): 

          The technical staff of the HPC can be contacted at hpc@kfupm.edu.sa

          Background:

IBM Cluster Compiling Systems: A HPC cluster is provided with FORTRAN 77 (Portland Group compiler pgf77), Fortran 90 (pgf90), C (pgcc), and C++ (pgCC) compilers in addition to either of the Intel or / and Portland Group suites of optimizing compilers (faster code than that GNU compilers).Generally these different produce Linux executable for each of the Portland Group and Intel compilers. There are also compiler optimization options to produce more optimized executable codes.

Parallelization over one Multi-Core Node Acting as a Shared-Memory Multiprocessor: Users can automatically optimize single-node sequential programs for shared-memory parallel execution using the Portland Group -Mconcur or Intel -parallel compiler option (pgcc -O2 -Mconcur sample.c or icc -O2 -parallel sample.c). Both the Fortran and C/C++ compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. The -mp (Portland Group) and -openmp (Intel) compiler options activate translation of source-level OpenMP directives and pragmas. This allows compiling for OpenMP threaded execution, runs the executable using a number of threads on 1 node (multi-core), each thread on a separate core.

Parallelization over Distributed-Memory System using the Message Passing Interface (MPI): The system uses the MPICH implementation of the Message Passing Interface (MPI), generally optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. Each program file using MPI must. Near the beginning of each C or Fortran source file we must include the MPI header file. Generally an MPI wrapper scripts (Portland Group or Intel compilers) is loaded prior to executing the compilation command. The MPI compilers take the same options as the compiler they wrap. A command “mpiexec” (the -pernode for command line options) spawns one MPI process per CPU in a batch job. The -pernode option requests that one MPI process be spawned per node. These options are intended to be used for codes which mix MPI message passing with some form of shared memory programming model, such as OpenMP or POSIX threads. Finally a GNU debugger gdb is recommended for interactive or post analysis of sequential programs.

The project is to:

  1. Explore the e1350 IBM eServer cluster system architecture, processor node architecture and resource, inter-node interconnection, inter-node communication.

  2. Write a document for other users on how to provide KFUPM desktop access to ITC-HPC, needed configuration and show some demos. You may conmtact technical staff from the ITC-KFUPM:  Mr. Tariq Maghribi (Project head with phone 3979),  Mr. Frahan (7325), and Mr. Nabeel (3910).

  3. Check the existence of FORTRAN 77 (Portland Group compiler pgf77), Fortran 90 (pgf90), C (pgcc), and C++ (pgCC) compilers in addition to either of the Intel or / and Portland Group (PG) suites of optimizing compilers (faster code than that GNU compilers) and report to the course.
  4. If the above are available, make sure that the automatic optimization of single-node sequential programs for shared-memory parallel execution using the OpenMP set of directives under the PG or Intel compiler option is available and report to the course. In this case, one student is to study in details the OpenMP set of directives within in C programs.
  5. Check the existence of the MPICH implementation of MPI (for Infiniband interconnect or what!) so that it can be integrated within each C or Fortran source. Make sure documentation is available. In this case, one student is to study in details MPICH library and its application in C programs.
  6. Using parallel compiler, generate parallel code for a specific set of processors (sub-cluster or while cluster if possible) based on sequential program written in C-like such as the Alternating Direction Integration (ADI), the Red-Black Relaxation loop, and Matrix Multiply, etc., Collect results (like execution time of parallelized programs) and evaluate performance of  e1350 IBM eServer cluster and used parallel compiler. 

Resource:

Working Students:

  1. Group-1: Mr. Ameer Ahmed Abbasi (ID: 200703670)

  2. Group-2:  Mr. Abdulhadi A. Al Ghanem

  3. Group-3: Mr. Irfan Ali Khan, (ID# G200802800), g200802800@kfupm.edu.sa, 0530058907

  4. Group-4: Mr. ADENIYE, Suli Charles, (Id# g20080394)

The students are recommended to:

  • Carry out literature survey including the papers in this folder-1  and folder-2 to be included in PR2.

 

 

 

6.  Developing a Software Tool to support OpenMP in Promoting Parallelism in Loops.

OpenMp is a set of compiler directives that are inserted by the user in a sequential program (Fortran 77, 90, C, C++) to produce a multithreaded execution on Shared -Memory multiprocessors (SMP). The target of OpenMP multithreaded parallelization can be a node formed by a number of core processors sharing memory in a cluster supercomputer. Loops are the main source of parallelism. The major constraints to loop parallelism is the presence of recurrences which can be LID, backward-LCD, and forward-LCD. OpenMP is concerned with converting the sequential loop into a multithreaded execution form. However, dependence analysis (DA), recurrence classification (RC), and loop restructuring (LR) like loop distribution and fusion are still under the responsibility of the user. Non Computer Science users cannot carry out the above tasks. Therefore, the need for a software tool that takes as inputs a source sequential program and promotes parallelism by applying DA, RC, and LR.  The tool is to generate a restructured code that will be used by the user to include the OpenMP directives before final compilation and execution.  

          Some of the project steps are to:

  1. Investigate the tasks performed by OpenMP. To what extent OpenMP is concerned with dependence analysis, recurrence classification, and loop restructuring. Do the above fully fall under the user responsibility. To what extent OpenMP is concerned with the correctness of the program  results due to the potential loop recurrences especially in loop embedding recurrences and blindly subjected to OpenMp parallelization!  Focus on OpenMP operations over loops.

  2. In the case OpenPM is not concerned with the above analysis (DA, RC, and LR), the student develop an algorithm to carry out dependence analysis, recurrence classification, and loop restructuring for loops and generation of a parallelization report for loop written in C.  You may limit the scope to linear dependence analysis and consider complex dependences as LCD.

  3. Implement the tool and experience its application by using Loops written in simple C programs. Carry out performance evaluation of programs that were restructured using the tool before exposing to OpenMP and without using the tool and compare their execution time and speedup.

Resource:

Working Students:

  1. Group-1: Mr. Abdul-Hakim Qahtan, kahtani@kfupm.edu.sa, phone 0505624032 and Mr. Fahd Al-Haidari [fahdhyd@yahoo.com] (ID # g200203820) and Phone: 0501371232

In the second period of the project, the students are recommended to: