First 1-4 weeks of semester: Students are expected to independently carry out literature Survey on their selected topics, compile information of relevant papers in a progress report PR-1, have an action plan of what they are going to contribution and more specifically on their project implementation, and prepare for a presentation to expose all the above aspects to the class and instructor. During this period, the student is expected to interact with the instructor regarding the major issues of his task while being prepared for his topic and its details. A 15-minute presentation is to be given in week 5 in front of the class and instructor. This part will be rated 25% of project for both presentation and timely submission of PR-1.
From week 5 to week 10, Students are expected to carry out the implementation part of their project by taking into account: (1) his presented or revised action plan, and (2) the instructor observations formulated during or after his presentation. During this period, the student is expected to interact with the instructor regarding the major design issues, difficulties in addressing some parts, resource problems, or any kind of problem that require specific attention to avoid negative impact on the project. Prepare report PR-2 after adding the implementation aspects presented above to P-1. A 15-minute presentation involving success and failure design and implementation aspects is to be given in week 11 in front of the class and instructor. This part will be rated 35% of project for both presentation and timely submission of PR-2.
From week 11 to week 14, Students are expected to carry out: (1) debugging and testing of their implementation and refer to instructor in case of problems, (2) carry out performance evaluation of the project with data collection, (3) revise PR-1 by including the performance evaluation and detailed analysis of collected results, (4) prepare the final project presentation. During this period, the student is expected to interact with the instructor regarding the debugging issues and performance interpretation. A presentation involving overall project will be delivered by the team (or student) before the last day of classes (week 15). The emphasis should be here on the method and results. This part will be rated 40% of project for both presentation and timely submission of PR-3 which is the final report. Here the student is to submit a zipped folder including all of original word reports (not pdf), well documented source code, evaluation data, and reference papers. A demo to the instructor will also be required.
List of projects for T082
Suggested implementation:
Consider a set (say 5) of small programs (computation loops) like the (1) Alternating Direction Integration (ADI), (2) the Red-Black Relaxation loop, and (3) 2D Matrix Multiply, etc.
Carry out recurrence analysis of each loop and classify it as LID, B-LCD, and F-LCD.
Restructure the loop to minimize its execution time using the loop unrolling and software pipelining.
Run each of the above programs without restructuring, with loop unrolling (possibly different configurations), and software pipelining (possibly different configurations) and collect performance data form the simulator.
Report performance at all levels after collecting the execution time as the number of clocks. Compare performance for each restructuring approach, and present your discussion and justification.
Suggest formal restructuring rules for loop unrolling and software pipelining.
Write a the program for a tool that takes a loop with a few statement in the body as input and (1) carry out data dependence analysis, (2) check the profitability of loop unrolling and software pipelining under specific dependence conditions, (3) restructure the loop using unrolling and / or software pipelining, (4) run the resulting loops on DLX and show the results.
Working Students:
- Group-1:
For the first four weeks the students will carry out following tasks: (1) determine representative types of benchmarks used for Desktop computers which are well defined by their function and generated parameters, (2) classify the benchmarks depending on their objectives, (3) acquire some available benchmarks, run them on a desktop PC, and provide demonstration to the class, and (4) classify some known computers based on their benchmarks performance. The suggested plan is: (1) determine representative benchmarks most commonly used for Desktop computers, (2) classify the benchmarks depending on their objectives such as benchmarking the CPU, the display system, the hard disk, and the NIC and Network protocol, (3) acquire benchmarks as stated in (2) provide some benchmark run during your presentation, comments on the results, and provide a copy of those benchmarks to the students. Define clearly what is benchmarked and how the benchmark is working and provide the meaning of the output, its interpretation and justification. Make sure each considered benchmark is well defined by its function and its input and output parameters are clearly defined. The students will prepare for a presentation within four weeks.
Suggested implementation:
Find available or public domain benchmarks that allows assessing the local performance of a specific issue of Computer Architecture such as the following items which are arranged in descending order of priority: (1) the CPU capability and its instruction pipeline, (2) the SIMD multimedia extension (MMX), (3) the hierarchical memory system, (4) the graphics capability and its AGP card, (5) the network stack protocol, NIC, and network delays, etc. For each considered benchmark we must have information on its load profile and clear interpretation of its output. Avoid benchmarks for which we only have a generic meaning of the outcome.
Run the above benchmarks on specific PCs with the idea of accessing the performance of its architectural implementation not just displaying the output performance of the benchmark. Assess the data in (1) comparative way (output from different benchmarks and for different PCs), and (2) critical thinking (matching obtained performance with known concepts about the specific PC architecture).
Provide copy of the studied benchmarks and the report. Write a report about the above experiments, comparative tables, discussions, and attach the benchmarks or indicate where to find them.
Working Students:
- Group-1:
3. Study of Techniques used for Improving Performance of Instruction Pipelining.
For the first four weeks the students carry out (1) Surveying of I-pipelining techniques in at two recent processors, (2) review of dynamic execution models, (3) methods of resolving hazards, and (4) compiler techniques to I-pipelining. The suggested plan is: (1) Surveying of I-pipelining techniques in recently proposed micro-architectures (last ten years), (2) how structural, data, and control hazards are resolved and at what level, (3) main proposed features like branch-prediction, speculation, etc., (4) provide a comparison of these micro-architectures with respect to major aspects especially expected performance and limitations. Implement a branch-prediction table with 2-bit history and evaluate performance by using typical loop structures. The students will prepare for a presentation within four weeks.
Select one of the following implementations:
Select one branch prediction algorithm from a recent research paper and implement it. Carry out performance evaluation by using a statistically generated branch behavior. For example, the probability of some instruction to be a CB is 0.2 and that the branch has a Set of statistical T/NT patterns (for example pattern 1110010101 prob. taken is 0.6). Evaluate performance of the used branch prediction and suggest some improvement.
Select one correlating branch prediction algorithm from a recent research paper and implement it. Carry out performance evaluation by using a statistically generated branch behavior. For example, set up a a set of correlation rules between 3 or more CBs that are hidden to the computer. Each correlation element indicate the T/NT of each of the CBs. You may randomly a given element at a start of first CB. The problem is to evaluate the performance of the algorithm when the set correlated branches behaves as above. Study performance (correct prediction as compared incorrect ones) and suggest some algorithm tuning.
Working Students:
Group-1: Mr. Dia Eddin AbuZeina
The student is recommended to:
Carry out literature survey including the papers in this folder to be included in PR2.
The relevant papers Folder including the student report with his presentation and the two reference papers references
The student is recommended to implement a correlating branch predictor to improve the performance as compared to the previously implemented by the student (see item above).
My recommendation is to carry out following tasks:
· Task-1: carefully read (1) the class presentation, (2) the book, and (3) the references used by the previous student (see Folder above).
· Task-2: Analyze how the student did his implementation (code, presentation, and report): generation of branching cases, test program, collecting results. The student work need not to be perfect but we need to see what he did and to improve the student approach in the implementation. Also please implement the various schemes that were presented in the lecture. Please improve his implementation based on your views and the views I suggested to you.
4. Design of a Simulator for a Pipelined Dynamic Execution Processor.
Dynamic execution consists of a multiple functional units (FUs) with an issue unit (IU) that adopts an approach to resolve resource conflicts and data hazards (RAW and WAR) as well as providing a scheme for sending produced data out of arithmetic units and memory to potential consumers (instructions). The state of the machine should be tracked by a clock to determine availability of data and resources. The project is on writing a simulator for the above processor while allowing the user to set up the processor parameters (all kind of processor latencies) depending on need. For this we may adopt a simple RISC assembler Instruction Set like the MIPS or others. The simulator will be tested using small programs. The suggested implementation: (1) searching the Internet for similar simulators, (2) Review of dynamic execution techniques for resolving hazard, (3) review of the simulator software aspects like the event-driven architecture, (4) Review of the pipeline architecture and data simulator structure for the pipeline within the IU and FUs, (5) implement the issue logic and reservation stations within the simulator including the mechanism for resolving hazards, (6) define performance metrics, and (7) evaluate performance using small programs written in Assembly language.
Working Students:
Group-1: Mr. Fayez Al Saadi, ID: g200804900
Group-2: Mr. Mohammad Sallout, (ID# 217059), Mobile: 0500189750, mhd2111@gmail.com
The student is recommended to:
Carry out literature survey including the papers in this folder to be included in PR2.
Designing and implementing an event-driven simulator for a Tomasolu micro-architecture running MIPS. The following issues need to be addressed: (1) the simulator data structure (to facilitate register renaming) for the various ISA components like instruction memory (IM), register file (RF), functional units such as MIPS 4000 (FUs), the data memory, the reservation stations which are attached to the FUs, statuses (busy/free) of all needed resource, etc., (2) the major program units like the Issue unit (IS), Execution (E), Write Back (WB), and common bus (CDB). It is recommended to consider each issued instruction as an event that undergoes a number of states separated by a number of clocks (letency). The latencies for all operations must be described in one external file that is read once when the simulator is run. IS generates events (issued instructions) which are stored in a server queue (time) with a time stamp (availability of producer data). An event will transit from one state (example waiting for data, executing, WB, etc) until it completes. The simulator should be tested for Correctness with small MIPS programs such as those found tin the book. The above work is expected for presentation (week 11) and be documented in PR2.
The student may choose to improve an existing simulator (see instructor) starting from the simulator code, getting familiar with the code, experiencing the simulation and its correctness, and upgrading the simulator in one aspect to be agreed upon. For this option download this available Folder.
5. Investigate and Parallel Processing using the E1350 IBM eServer cluster.
The KFUPM-ITC has a High-Performance Computing System (HPC) which consists of a 128-node Dual-boot Cluster with following resources (see URL: http://www.kfupm.edu.sa/hpc/):
- 128 compute-node e1350 IBM eServer cluster.
- The cluster is unique in its dual-boot capability with Microsoft Windows HPC Server 2008 and Red Hat Enterprise Linux 5 operating systems.
- The cluster has 3 master nodes, one for Red Hat Linux, one for Windows HPC Server 2008 and one for cluster management.
- The cluster has 128 compute nodes.
- Each compute node of the cluster is dual-processor having two 2.0 GHz x3550 Xeon Quad-core E5405 processors.
- The total number of cores in the cluster is 1024.
- Each master node has 1 TB of hard disk space.
- Each compute node has 500 GB of hard disk space.
- Each master node has 8 GB of RAM.
- Each compute node has 4 GB of RAM.
- The interconnect is 10 GBASE-SR
The technical staff of the HPC can be contacted at hpc@kfupm.edu.sa
Background:
IBM Cluster Compiling Systems: A HPC cluster is provided with FORTRAN 77 (Portland Group compiler pgf77), Fortran 90 (pgf90), C (pgcc), and C++ (pgCC) compilers in addition to either of the Intel or / and Portland Group suites of optimizing compilers (faster code than that GNU compilers).Generally these different produce Linux executable for each of the Portland Group and Intel compilers. There are also compiler optimization options to produce more optimized executable codes.
Parallelization over one Multi-Core Node Acting as a Shared-Memory Multiprocessor: Users can automatically optimize single-node sequential programs for shared-memory parallel execution using the Portland Group -Mconcur or Intel -parallel compiler option (pgcc -O2 -Mconcur sample.c or icc -O2 -parallel sample.c). Both the Fortran and C/C++ compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. The -mp (Portland Group) and -openmp (Intel) compiler options activate translation of source-level OpenMP directives and pragmas. This allows compiling for OpenMP threaded execution, runs the executable using a number of threads on 1 node (multi-core), each thread on a separate core.
Parallelization over Distributed-Memory System using the Message Passing Interface (MPI): The system uses the MPICH implementation of the Message Passing Interface (MPI), generally optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. Each program file using MPI must. Near the beginning of each C or Fortran source file we must include the MPI header file. Generally an MPI wrapper scripts (Portland Group or Intel compilers) is loaded prior to executing the compilation command. The MPI compilers take the same options as the compiler they wrap. A command “mpiexec” (the -pernode for command line options) spawns one MPI process per CPU in a batch job. The -pernode option requests that one MPI process be spawned per node. These options are intended to be used for codes which mix MPI message passing with some form of shared memory programming model, such as OpenMP or POSIX threads. Finally a GNU debugger gdb is recommended for interactive or post analysis of sequential programs.
The project is to:
Explore the e1350 IBM eServer cluster system architecture, processor node architecture and resource, inter-node interconnection, inter-node communication.
Write a document for other users on how to provide KFUPM desktop access to ITC-HPC, needed configuration and show some demos. You may conmtact technical staff from the ITC-KFUPM: Mr. Tariq Maghribi (Project head with phone 3979), Mr. Frahan (7325), and Mr. Nabeel (3910).
- Check the existence of FORTRAN 77 (Portland Group compiler pgf77), Fortran 90 (pgf90), C (pgcc), and C++ (pgCC) compilers in addition to either of the Intel or / and Portland Group (PG) suites of optimizing compilers (faster code than that GNU compilers) and report to the course.
- If the above are available, make sure that the automatic optimization of single-node sequential programs for shared-memory parallel execution using the OpenMP set of directives under the PG or Intel compiler option is available and report to the course. In this case, one student is to study in details the OpenMP set of directives within in C programs.
- Check the existence of the MPICH implementation of MPI (for Infiniband interconnect or what!) so that it can be integrated within each C or Fortran source. Make sure documentation is available. In this case, one student is to study in details MPICH library and its application in C programs.
Using parallel compiler, generate parallel code for a specific set of processors (sub-cluster or while cluster if possible) based on sequential program written in C-like such as the Alternating Direction Integration (ADI), the Red-Black Relaxation loop, and Matrix Multiply, etc., Collect results (like execution time of parallelized programs) and evaluate performance of e1350 IBM eServer cluster and used parallel compiler.
Resource:
The Ohio Supercomputing Center: http://www.osc.edu/supercomputing/training/ see documentation and courses.
Papers selected by the instructor: Download zipped folder
Ameer A Abbasi: An online article on the internet explaining how to begin Parallel Programming With OpenMP in MS Visual Studio Professional C++ 2005/2008 or better. MS Visual Studio C++ supports the OpenMP 2.0 standard and provides various functions to set the OMP_NUM_THREADS Environment Variable for openMP such as to get the number of processors in the system [omp_get_num_procs()], to set the number of threads [omp_set_num_threads(int number)] and so on. The OpenMP C and C++ application program interface lets you write applications that effectively use multiple processors. By using OpenMP, you can gain performance on multi-core systems for free, without much coding other than a line or too. There is no excuse not to use OpenMP. The benefits are there, and the coding is simple. In the following please see the link to the article explaining how to begin Parallel Programming With OpenMP in MS Visual C++. http://www.codeproject.com/KB/cpp/BeginOpenMP.aspx
To see the directives, clauses and constructs used in the OpenMP API please visit the following link. The link will lead you to the openMP reference page in MSDN library from Microsoft. http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
Ameer A Abbasi: I just came to know that CCSE has a cluster with 20 nodes running Linux red hat operating system. This cluster has MPICH 2.0 facility available to compile C++ (gcc4.0) and FORTRAN programs and perform parallel processing using a distributed-memory model. Today, I met the Unix administrator in CCSE and sit with him for a while to learn about it. He has created an account for me on this cluster and set the path and directories to run the programs. In addition, he added some example programs and help files on how to compile and run the programs with MPICH 2.0 smoothly. I suggest we should first try this cluster in CCSE and then go to ITC for further investigation. Any student can meet Unix administrator during office timings to create Linux account for him and to set the initial path and directory.
Reply from instructor: Prog. MPICH for Distr. Mem is not a very creative activity as it just requires repeatedly inserting a Send or Receive at the producer and consumer processes after the prog is parallelized by hand. This Distr- Mem paradigm exposes only Coarse-Grain Parallelism. This is not very interesting from research. Shared-memory exposes Fine-Grain Parallelism which is much more promising as it take place at the loop iteration level. Recurrence analysis is then useful to extract more parallelism. Please ask me in the class to provide more details on this issue. Overall I do not recommend only working with MPICH on a Disr-Mem system, even if it called cluster. My approach is to address fine-grain first using OpenMP (threading) and at program task level address the Coarse-grain using MPICH, e.g. hierarchical parallelization.
Working Students:
Group-1: Mr. Ameer Ahmed Abbasi (ID: 200703670)
Group-2: Mr. Abdulhadi A. Al Ghanem
Group-3: Mr. Irfan Ali Khan, (ID# G200802800), g200802800@kfupm.edu.sa, 0530058907
Group-4: Mr. ADENIYE, Suli Charles, (Id# g20080394)
The students are recommended to:
Carry out literature survey including the papers in this folder-1 and folder-2 to be included in PR2.
6. Developing a Software Tool to support OpenMP in Promoting Parallelism in Loops.
OpenMp is a set of compiler directives that are inserted by the user in a sequential program (Fortran 77, 90, C, C++) to produce a multithreaded execution on Shared -Memory multiprocessors (SMP). The target of OpenMP multithreaded parallelization can be a node formed by a number of core processors sharing memory in a cluster supercomputer. Loops are the main source of parallelism. The major constraints to loop parallelism is the presence of recurrences which can be LID, backward-LCD, and forward-LCD. OpenMP is concerned with converting the sequential loop into a multithreaded execution form. However, dependence analysis (DA), recurrence classification (RC), and loop restructuring (LR) like loop distribution and fusion are still under the responsibility of the user. Non Computer Science users cannot carry out the above tasks. Therefore, the need for a software tool that takes as inputs a source sequential program and promotes parallelism by applying DA, RC, and LR. The tool is to generate a restructured code that will be used by the user to include the OpenMP directives before final compilation and execution.
Some of the project steps are to:
Investigate the tasks performed by OpenMP. To what extent OpenMP is concerned with dependence analysis, recurrence classification, and loop restructuring. Do the above fully fall under the user responsibility. To what extent OpenMP is concerned with the correctness of the program results due to the potential loop recurrences especially in loop embedding recurrences and blindly subjected to OpenMp parallelization! Focus on OpenMP operations over loops.
In the case OpenPM is not concerned with the above analysis (DA, RC, and LR), the student develop an algorithm to carry out dependence analysis, recurrence classification, and loop restructuring for loops and generation of a parallelization report for loop written in C. You may limit the scope to linear dependence analysis and consider complex dependences as LCD.
Implement the tool and experience its application by using Loops written in simple C programs. Carry out performance evaluation of programs that were restructured using the tool before exposing to OpenMP and without using the tool and compare their execution time and speedup.
Resource:
The Ohio Supercomputing Center: http://www.osc.edu/supercomputing/training/ see documentation and courses.
Papers selected by the instructor: Download zipped folder
Working Students:
Group-1: Mr. Abdul-Hakim Qahtan, kahtani@kfupm.edu.sa, phone 0505624032 and Mr. Fahd Al-Haidari [fahdhyd@yahoo.com] (ID # g200203820) and Phone: 0501371232
In the second period of the project, the students are recommended to:
Carry out literature survey including the papers in this folder-1 and folder-2 to be included in PR2.
Determine a data structure (DS) to represent the array data dependencies in the body of a given loop consisting of a set of assignment statements (ASs). DS is to capture the various array invoked. Each array name has a unique representation. A multi-dimensional array indexing is associated to each array name. A linked list is used to represent each assignment statement. Assignment statements are numbered. The DS should allow determining the existence of a recurrence (index space and array name) in one AS or a more general case like a recurrence extending along a few ASs. The DS must allow building the Data-Flow of the loop body as well as the detection of recurrences and their path. Finding out only whether the body (e.g. the ASs) has a recurrence is far insufficient because our objective is to re-organize the ASs into parallelizable and non-parallelizable parts, e.g. partition the ASs into a set of "new" ASs which are LID and other new ASs which have some recurrences. The last operation is the pre-requisite for restructuring the loop, e.g. distribute the loop whenever possible into different loops to promote parallelism. Therefore, the key issues to this section are: (1) the design of a DS, (2) augment the DS with links to represent data-flow (deriving from the ASs), (3) characterize recurrences by some properties, (4) be able to Split and Merge sub-structures according to some rules and record some ordering, and (5) be able to rebuild new assignment statements (labeled as LID, or LCD) after the Split/Merge process. The above work is expected for presentation (week 11) and be documented in PR2.