> Root Entry F`!F.F:@WordDocumentH0ObjectPoolOO@@ $F$FOOSummaryInformation(8
< !"#$%&'()*+,./012345X;^=>?@ABCDEFGHIJKLMNOPQRSYZ[\]_abcdefghijklnopqrstuvwxy}~CompObjSummaryInformationGj_931020014F$F`F_931069595Fy5F`BFF_931020010F`BFF@WF
!"#$&'(),.125789:;<=>?@ABCEHrdDocWord.Document.69q> Times New Romanp 2
v~R 2
vR 2
v R 2
vR
&
"Systemn> 6LBh8 > 6>
FMicrosoft Word Document
MSWo yield contrary results. When the fastest know algorithm is executed on a single processor the speedup is called real speedup (it is also called classical or traditional speedup). However, most researchers will use what is known as relative speedup that uses the same parallel algorithm on one processor for the serial time instead of the best known serial algorithm [3]. When the best known serial algorithm is run on the fastest machine the speedup is then called absolute speedup [4].
An example of misleading results is apparent when the serial time is larger than the parallel time multiplied by the number of processors, that is T(N)>P*TP(N). This could happen when the sequential processing requires high memory access time as in the shared virtual memory machine KSR1. The results of the performance evaluation of the KSR1 can be found in [5] and [3]. The impact of the high memory latency in shared virtual memory machines can be reduced using data prefetching and data forwarding as described in [5] and [6].
The above mentioned speedup definitions are grouped in one type called fixed size speedup since the problem size does not scale with the number of processors. Another type is the scaled speedup which allows the problem size to increase with the number of processors. When measuring the scaled speedup on a distributed memory machine we can not measure the serial time because of the memory limitations. However, by measuring the inter processor communication time and the idle time on each processor the efficiency can be computed, and then the scaled speedup is obtained. The scaled speedup was introduced in [7].
The definition of scaled speedup allows the problem size to scale to fill the available memory without limitations on the execution time. In some applications there is an upper bound on execution time. Worley, in 1990, used the time constrained speedup to estimate the limit of the problem size to be solved using P processors. Examples of deriving time constrained models for some algorithms are in [8]. However, the analysis was not very realistic since many costs were ignored in deriving such models.
3 Efficiency and other Measures
While speedup measures how much faster the algorithm is capable to run on more than one processor, the processor efficiency measures how each processor is utilized. The efficiency is defined as the ratio of the speedup to the number of processors so we may write
EP =SP/P (2)
The efficiency could be used as a measure of the wasted processor cycles. In parallel computers not all processors are busy all the time. There will be usually a tradeoff between speedup and efficiency. Finding bounds on such tradeoffs was presented in [9], in their work the average parallelism was used to characterize these tradeoffs. The average parallelism could be simply defined as the average number of processors that are kept usefully busy during the execution time of a given algorithm using unlimited number of processors.
Another measure that can be used to characterize performance is the redundancy. As it has been observed, the development of parallel algorithms introduces extra scalar computations to achieve higher speedup. When the communication cost is higher than computations extra code will replace communication code by local computations. The redundancy factor RP, is defined as the ratio of the total scalar operations performed by a Pprocessor parallel algorithm to the total scalar operations of serial algorithm [1].
4 Scalability
Scalability is used as a measure of how the parallel system will perform when the number of processors or the problem size is scaled. In scalability studies, three types of scalability are often defined. One is called the machine scalability, the second is the algorithm scalability, and the third is the scalability of a parallel system consisting of algorithmmachine combination. Machine scalability was defined based on the asymptotic speedup for a given algorithm and problem size on the architecture under study. The asymptotic speedup is the best achievable speedup using unlimited number of processors [10]. However, scalability of the machine may include many factors other than speedup. Cost, addressing, communication, and physical attributes should be considered to meet the machines scalability requirements [11].
When studying the algorithm scalability the overhead related to the target machine will be described by a general function. This function will either depend on a general model or some specific assumptions about the hardware. In the literature many parallel hardware models are available such as the PRAM (Parallel Random Access Machine), BSP (BulkSynchronous Parallel), and LogP (Latency, Overhead, Gap, and Processor count) which can be used as bases for such analysis. However, in some cases there is no assumed model of the underlying hardware. In this case before analyzing the scalability of a given algorithm certain assumptions about the target hardware should be stated. The work done by [12] is an example of such analysis.
There are different approaches to quantify the parallel system scalability. Grama et al. 1993, defined scalability as the ability of parallel system to increase speedup as the number of processors increases [13]. Following an analytical approach, they introduced the isoefficiency function that relates problem size to the number of processors required to keep the efficiency fixed. The isoefficiency function provides the rate at which the problem size must increase to keep the efficiency fixed as the number of processors increased. As a measure of scalability, a system with small isoefficiency function is more scalable.
The isospeed metric was proposed by [14]. The scalability of parallel algorithmmachine combination was defined based on this metric. An algorithmmachine combination is scalable if the achieved average speed remains constant when the number of processors is increased assuming the problem size can be increased.
In another approach, Network latency was used to measure and evaluate the scalability of parallel programs and architecture [15]. The evaluation of the scalability can be used to predict the performance of large problems on large machines. The scalability of a parallel system at a fixed efficiency level for two machine sizes is the ratio of the smaller machines latency to the latency of the larger machine. The efficiency is kept at fixed level by scaling the problem size.
A promising technique was suggested by Lyon et al. 1994, for tuning parallel code by identifying the program bottlenecks [16]. This technique treats the code as a black box with number of input parameters and a measured output. An approximation of the multivariate Taylors expansion for the codes execution response function was estimated by using factorial designs. While this technique is in use for industrial processes, it was not used for computer programs because of the lack of natural parameters. The new approach is to incorporate artificial parameters into the program text [16]. These parameters can be delay routines inserted in the code to simulate changes in codes performance. The number of processors used to execute the code and the size of the problem are another parameters to be considered. We used the above approach to analyze the ARPS (Advanced Regional Prediction System) code.
ARPS is a nonhydrostatic atmospheric prediction model appropriate for use on scales ranging from few meters to hundreds of kilometers. The governing equations of the atmospheric model components of the ARPS include momentum, heat, mass, water substances, and the equation of state. ARPS solves prognostic equations for x, y, and z components of the Cartesian velocity, the perturbation potential temperature, perturbation pressure, and six categories of water substance. More details about ARPS can be found in [17]. The ARPS source code consists of about 300 subroutines and functions, developed over the past six years by about 30 scientific and support personnel. The analytical approach will be expensive and cumbersome to apply for such huge code. For this application the reduction of overall wallclock run time is the primary goal. For practical operation the system should run faster than the speed of the weather. The remaining of this paper will present the primary results of this work.
5 Scalability Analysis of ARPS
Experimental design techniques were employed to measure sensitivity of ARPS code to changes in performance. The analysis of the effects of these changes will point out the primary locations of code to optimize. The experiments in this work were conducted on theRoot Entry`!FF:WordDocumentH0ObjectPoolOO@@$F$FOOSummaryInformation(
<8 !"#$%&'()*+,./012345X;^=>?@ABCDEFGHIJKLMNOPQRSYZ[\]_abcdefghijklnopqrstuvwxy}~7<8<:<;<D<E<H<I<<<<<<<V=W=e=f===========>>2>4>5>6>=>s>t>u>>>>>>>>>>>>>>>>>>>
???!?"?#?6?7?8?L?M?O?c?d?f?y?z???????????????@
@bhVJc
JmVcUcchVccJcX
@@@"@)@0@7@>@E@L@S@Z@a@h@i@(B0BvBwBxByBzB{BB}B~BCCCCC C!C"C+CnCCCCCCCFFFFFFFFFGxLLNOTTXXYZ\\\\\ \!\"\#\$\%\&\'\(\)\*\+\,\c
uD<~7KuD<~7avKchVcU
uD~7KuD~7acvKuDhVUccbO,\\.\/\0\1\2\3\4\5\6\7\8\9\:\;\=\>\?\A\B\C\E\F\G\I\p\s\\\\\\\\]"]%]F]I]j]m]]]]]]]]^"^&^G^K^j^n^^^^^^^^_!_%_E_I_i_m_________`!`A`E`e`i```````aaaaaaPuDPc]Vh_abb1bu:qFGMN]
9
g`a!Q$%i'*..p#Pp#Pp#Pp#p#p#p#p#p# p#p# p# p# p# p# p#p# p# p# p# p# p# p# p#@p# p# p# p#
p# p# p# p# p# p# p# p# 2777S7S$..000B4C4D444444444444444444444444444p# p# p# p#
p# p# p# p# p#``$$$``$$$``$$$``$$$``$$
l
x!l4
x!4446/67888P8:==<>=>D>O>[>r>s>v>>>>>>>>>>$p#p#p#~p# p# p# p# p# p# p#fp#
lP!l4P x!l4
>>>>>>>>>>>>>>>>>>>>>>>? ?
?
??? ?!?$?)?0?5?6?9?>?E?K?L?P?W?
lPx*W?\?b?c?g?n?s?x?y?}?????????????????p# p# p# p# hhhhhhhh!43 hSxx!l4P
lPx?@@ @@
@@@@@@"@#@%@'@)@0@1@3@5@7@>@?@A@C@E@L@M@O@Q@S@Z@[@]@_@a@h@hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
3h$h@i@'B(BuBvByBBBBBBBBBBBBBBBBBBBBBhp# p# p# p# hhhhhhhhhhhhhhhhhhhhh
3!43 hSxx!43 BBBBBBBBBBBBBBBBBBBBBBBCCCCC"CaCbCnC:DFhhhhhhhhhhhhhhhhhhhhhhhhp# p# p# p#p#p# p# p# p# !43
3h!FFFGGG'JKxLyLLNNNYOOP!QQRkSTTUUQVWWX5YYZZ
\\\\p#p#p#p#p# p# p# p# p# p# p# p# p# p#p#p#p#p#p#p#p#p#p#p#p#p#p#p#p#p#p#p# p# 4 7[]S$\\"\%\(\+\.\1\4\7\:\>\B\F\G\I\L\O\R\U\X\[\^\a\d\g\j\hhhhhhhhhhhhhhhhhhhhhhhhh^(`
0
h8JIIJIJIJIJIJIJIJIJIJIJIJIJIJIj\m\p\q\s\v\y\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%\\\\\\\\\\\\\\\\\\\\\\\\\\\]]] ]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%]] ]"]#]%](]+]]/]2]5]8]:]<]>]@]C]F]G]I]L]O]Q]S]U]W]Y]\]_]b]e]g]j]k]m]p]r]hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%r]u]x]{]}]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%]]]]]]]]]]]]]]]]]]]]]^^^^^^^^^^^^^"^#^&^)^hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%)^+^^0^2^4^7^:^<^>^A^D^G^H^K^N^P^R^T^W^Y^\^_^a^d^f^h^j^k^n^q^s^u^w^y^^~^^hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%^^^^^^^^^^^^^^^___ __
________!_"_%_'_*_,_/_2_5_7_hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%7_:_<_>_A_C_E_F_I_K_N_P_S_U_W_Z_\___b_d_g_i_j_m_o_r_t_v_y_{_~________hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%______________________________________hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%___________``` ``
````````!`#`%`(`*`,`/`2`4`6`9`<`>`A`hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%A`B`E`G`I`K`N`Q`S`U`W`Y`\`_`b`e`f`i`k`m`o`r`t`w`z`}`````````````hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I%````````````````````````hhhhhhhhhhhhhhhhhhhhhh"4(`
0
h8I```aaabbbbhp#p# p#p#`%^(`
0
h8JIJJJJJJJJJJJJJ Root Entry`!F3`F:WordDocumentHgObjectPoolOO@@$F$FOOSummaryInformation(
<8X;^=>?@ABCDEFGHIJKLMNOPQRSTUVWYZ[\]_abcdefghijklnopqrstuvwxy}~DocumentSummaryInformation8_931020014F$F`F_931069595Fy5F`BFF_931020010F`BFF@WF
!"#$&'(),.125789:;<=>?@ABCE՜.+,0HP\dlt B((8Performance Evaluation of Parallel Systems: An Overview> 6Oh+'0$4 LX
8Performance Evaluation of Parallel Systems: An OverviewHMohammed Alabdulkareem~FHNormal
Saudi Aramcodul5HMicrosoft Word for Windows 95@!
@1k@@F
MN՜.+,0`8@LT
\&Sheet1Sheet2Sheet3Sheet4 shared memory super computer CRAY J90 and on the scalable distributed memory machine IBM SP2. We used factorial experiment designs in our analysis.
5.1 Factorial Designs
Unlike one factor at a time experiments, factorial designs are powerful technique to study the effect of more than one factor at a time. To perform a general factorial design an experimenter selects fixed number of levels for each factor, and then run the experiments with all possible combinations Since the number of treatments grow exponentially with the number of factors, fractional factorial designs may be used to reduce the number of treatments. An example of a factorial design with two factors at two level each is in Table 1. The two levels of each factor are denoted with  for the low setting and + for the high setting. When the factor represents a segment of the code it is at high level when delay is applied and at low level when no delay is applied. The delay could be implemented as a call to delay routine which can be inserted and removed without affecting the function of the original code.
Table 1: Example of factorial design with two factors at two level each.
TreatmentFactor F0Factor F1Response1R12+R23+R34++R4
The effect of a factor is a measure of the contribution of that factor to the observed response. The effect of factor F0 will be denoted (0, that can be estimated as the difference between the averages of the responses at the high and the low settings for that factor. The effect of F0 is estimated as
EMBED Equation.2 (3)
The estimated effects are compared against the estimated standard error (denoted SE)or the estimated standard deviation resulting from experimental noise. Only the factors with effects out of the standard error interval are considered significant. We use 2SE corresponding to 95.45% under the normal distribution curve or 3SE corresponding to 99.73%. According to the central limit theorem the error noise is very close to the normal distribution when the runs are randomized and the number of runs is reasonably large.
5.2 Experimental Design
We performed a series of experiments each with 13 factors where F0 is the scale factor and the remaining factors are representative of selected segments of the ARPS code [18]. The measured response was the wallclock running time. We used a fractional factorial design of resolution 4 with 32 treatments (see Appendix A), each treatment was replicated three times. The effects were estimated for various scales of problem size and number of processors. The scale factor could represent the scaling of the number of processors or the scaling of the problem size. The effects of the scale and the other factors were compared against the standard error to ensure their significance.
The effects of each factor were estimated after each experiment. We construct a table each experiment with the estimated effects and interactions. Table 2 is an example of one of these results. The table show the effects and interactions with scale for an experiment on the IBM SP2 when scaling the problem size from 67(67(35 to 131(131(35 using one processor. All the tables show that two subroutines in the ARPS code have extremely high effects on the measured response on both machines. The two factors F8 and F9 representing the subroutines BCSU and BCSV, which set the boundary conditions for the uvelocity and vvelocity components respectively, catch our attention.
Table 2: Main effects and interactions when scaling the problem size from 67(67(35 to 131(131(35. The mean (=1493.47 the estimated standard error SE=(2.33.
FactorSubroutineMain EffectInteraction with ScaleF0(Grid Size)1464.86N/AF1ADVCTS27.561.28F2ADVU5.233.13F3ADVV8.112.66F4ADVW4.980.05F5BCKMKH9.741.47F6BCS2D0.630.85F7BCSCLR16.730.44F8BCSU217.043.23F9BCSV213.001.18F10BOUNDU7.160.79F11BOUNDV8.240.47F12JACOB0.741.35
Table 3: The average responses of a full factorial experiment on CRAY J90.
F0F8F9Average Response605.35+742.55+785.73++953.76+224.87++400.26++400.02+++575.95We isolated these two factors from other factors using a 3 factor full factorial design as in Tables 3 and 4.In both experiments we scaled the number of processors form 1 to 4, while the problem size was kept the same. For the scale factor we used "" to indicate the use of 1 processor and "+" to indicate the use of 4 processors. While for the other factors "+" was used when synthetic perturbation is introduced, and "" when not introduced.
Table 4: The average responses for the full factorial experiment on IBM SP2.
F0F8F9Average Response1966.45+2173.58+2174.57++2396.68+530++744.87++742.81+++960.21
EMBED Excel.Chart.5 \s
Figure 1: Interaction plot using relative speedup on CRAY J90.
5.3 Results
The primary results indicated that factors F8 and F9 are great targets for future enhancement of the code. The execution time of the corresponding treatments indicates longer running time on the IBM SP2.
The interaction plots of the two highly significant factors are in Figures 1 and 2. The plots were obtained using relative speedup instead of wallclock time to normalize the results. This allow us to compare the scalability of the two machines. It is clear that the ARPS code is more scalable on the IBM SP2 than on the CRAY J90. In other words using more processors on the IBM SP2 has higher effect on the response than using more processors on the CRAY J90. Nevertheless, from Tables 1 and 2 the global estimated mean of the run time for the CRAY J90 indicates a shorter wallclock execution time than the IBM SP2.
EMBED Excel.Chart.5 \s
Figure 2: Interaction plot using relative speedup on IBM SP2.
6 Conclusions and Future Work
In this paper we reviewed different approaches and measures to evaluate the parallel systems. While speedup is in existence for long time the scalability of parallel systems is still under study. When studying the scalability of parallel systems many factors are introduced. The complex nature of parallel systems requires the consideration of more than one factor at a time when evaluating the performance. Factorial designs were a good tool for studying simultaneously the effects of more than one factor. The analytical analysis of codes requires a deep knowledge of the code which is not practical for huge codes as ARPS. The approach we followed here does not require much knowledge about the code and it will point out the segments of code that affects its scalability.
We used the interaction plots of speedup as a normalized measure of scalability. These plots enabled us to compare the scalability of the two machines. However, saying the machine is more scalable does not infer the machine is faster. The average run time was used in conjunction with the speedup interaction plots to obtain a global judgment.
The factorial designs used here are called two level factorial designs. Our future work will include the use of more complex factorial designs that allow more than two levels. Also we plan to include more than one scale factor in each experiment.
Acknowledgments
ARPS model code used in this analysis is under development at the Center for the Analysis and Prediction of Storms (CAPS), an NSF Science and Technology Center at the University of Oklahoma. Thanks to the Environmental Computing and Applications Systems at the University of Oklahoma (ECAS) for providing us with the access to CRAY J90. We would thank the Department of Physics and Astronomy for providing the access to the IBM SP2 that we used in this study. Our thanks are also due to Dr. Gordon Lyon, NIST for suggesting the approach described in this paper and for his continued guidance.
References
Lakshmivarahan, S. and S. K. Dhall. 1990. Analysis and design of parallel algorithms, McGrawHill Inc., New York.
Wilson, G. 1993. A Glossary of Parallel Computing Terminology. IEEE Parallel & Distributed Technology 1, no. 1 (Feb.): 52~67.
Sun, X. and J. Zhu. 1996. "Performance Prediction a Case Study Using Scalable Shared Virtual Memory Machine." IEEE Parallel & Distributed Technology 4, no. 4 (Winter): 36~49.
Sahni, S. and V. Thanvantri. 1996. Performance Metrics: Keeping the Focus on Runtime. IEEE Parallel & Distributed Technology 4, no. 1 (Spring): 43~56.
Ramachandran, U., G. Shah, and S. Ravikumar. 1993. " Scalability Study of the KSR1." Proceedings of the 1993 International Conference on Parallel Processing Vol. I (Syracuse, NY Aug. 1620).
Koufaty, D.; X. Chen; D. Poulsen; and J. Torrellas. 1996. Data Forwarding in Scalable Shared Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 7, no. 12 (Dec.): 1250~1264.
Gustafson, J.; G. Montry; and R. Benner. 1988. Development of Parallel Methods for A 1024Processor Hypercube. SIAM Journal on Scientific and Statistical Computing 9, no. 4 (July): 609~638.
Worley, Patrick 1990. The Effect of Time Constraints on Scaled Speedup. SIAM Journal on Scientific and Statistical Computing 11, no. 5 (Sep.): 838~858.
Eager, D.; J. Zahorjan; and E. Lazowska. 1989. Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers 38, no. 3 (Mar.): 408~423.
Nussbaum, D. and A. Agarwal. 1991. Scalability of Parallel Machines. Communications of the ACM 34, no. 3 (Mar.): 56~61.
Gustavson, D. 1994. The Many Dimensions of Scalability. Digest of Papers from the COMPCON 94 (San Francisco, California, Feb. 28  Mar. 4). IEEE Computer Society Press.
MullerWichards, D. and W. Ronsch. 1995. Scalability of Algorithms: An Analytical Approach. Parallel Computing 21, no. 6 (June): 937~952.
Grama, A. Y.; A. Gupta; and V. Kumar. 1993. Isoefficiency: Measuring the scalability of parallel algorithms and architectures. IEEE Parallel & Distributed Technology 1, no. 3 (Aug.): 12~21.
Sun, X. and D. Rover. 1994. Scalability of Parallel AlgorithmMachine Combinations. IEEE Transactions on Parallel and Distributed Systems 5, no. 6 (June): 599~613.
Zhang X.; Y. Yan; and Q. Mia. 1994. Measuring and Analyzing Parallel Computing Scalability. Technical Report TR940302. High Performance Computing and Software Laboratory, The University of Texas at San Antonio, San Antonio, Texas.
Lyon, G.; R. Kacker; and A. Linz. 1995. A scalability test for parallel code. Software Practice and Experience 25, no. 12 (Dec.): 1299~1314.
Xue, M.; K. Droegemeier; V. Wong; A. Shapiro; and K. Brewster. 1995. ARPS Version 4.0 Users Guide. Center for Analysis and Prediction of Storms, The University of Oklahoma, Norman, Oklahoma.
Alabdulkareem, M.; S. Lakshmivarahan; and S. K. Dhall. 1997. A Scalability Analysis of Large Code of Interest in Meteorology Using Experimental Design Techniques. Proceedings of the High Performance Computing , Atlanta, Georgia, April 610, 1997.
APPENDIX A
The following table is the fractional factorial design we used for the 13 factor experiments. Since the runs were performed in random, the treatment number is used for identification only and does not represent the order in which the experiments were actually performed.
Treatment
NumberF0F1F2F3F4F5F6F7F8F9F10F11F12111111111111112111111111111131111111111111411111111111115111111111111161111111111111711111111111118111111111111191111111111111101111111111111111111111111111121111111111111131111111111111141111111111111151111111111111161111111111111171111111111111181111111111111191111111111111201111111111111211111111111111221111111111111231111111111111241111111111111251111111111111261111111111111271111111111111281111111111111291111111111111301111111111111311111111111111321111111111111
The above design was obtained using the "FACTEX" procedure available with the SAS package. This procedure will generate orthogonally confounded designs including factorial and fractional factorial designs. After constructing a design, it can be printed, saved to file, or used as SAS data set.
PAGE
"77.A b/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////.#####"d$Vt$ftr~Q^ËNt$ RPjQsF ^hH@j jQrǆ ^Í PVQjRr^Íd$d$D$SPcD$PS3ҹh@8u8AtBDCr@u<@u[ÍR[El@ø[ÐVWt$P%5}3Ɋ3$3PVVD$jPQ3{_^ÿ AZ~>PPu8;t3ujjPz3_^jjPz3_^3_^Ë PTAZ~GV!VP6PHu8;t3ujjPz3_^jjPz3_^3_^ÿ :tGPPt3_^ËP;tjjPy3_^22c2c22O3O3O3~3O3c23
L$P<;3҄T$HT$@T$LT$DD$(T$8T$ Oh+'0@HhMohammed AlabdulkareemMohammed AlabdulkareemMicrosoft Excel@k<@e7e> 6
FMicrosoft Excel ChartBiff5Excel.Chart.59q>L&f > 6> ՜.+,0`8@LT
\&)i =
""")))UUUMMMBBB999PP3f333f3333f3ffffff3f̙3ff333f333333333f33333333f33f3ff3f3f3f3333f33̙33333f333333f3333f3ffffff3f33ff3f3f3f3fff3ffffffffff3ffff̙fff3fffff3fff333f3f3ff3ff33f̙̙3̙ff̙̙̙3f̙3f333f3333f3ffffff3f̙3f3f3f333f3333f3ffffff3f̙3f3ffffffffff!___www45' ArialArial"Systemn' '  "
$LLUUL " "' YH#
LL' # "L
LUUL' "L
UUULLUU_K_K' ' XL' a@ "ff
$hkgdio$qtpmrx'$&z}.y'v{.=$<DE=~ES$RZ[S[j$ipqjq$$ " f
" $\fpf\$ " '  XL'  '  ' 
2
Az1 2
z2 2
z3 2
8z4'  ' 
2
qF92
qF9+'  '  u<
Arial2
CSpeedup'   "
#' $' $0.
$.66.6F$FNNFN^$^ff^fv$v~~v~$$n
" $nxndn 2
F8'
 $'
 $ "0
n xd 2
F8+'  $'  '  ' ' 4'> 697&f K
""")))UUUMMMBBB999PP3f333f3333f3ffffff3f̙3ff333f333333333f33333333f33f3ff3f3f3f3333f33̙33333f333333f3333f3ffffff3f33ff3f3f3f3fff3ffffffffff3ffff̙fff3fffff3fff333f3f3ff3ff33f̙̙3̙ff̙̙̙3f̙3f333f3333f3ffffff3f̙3f3f3f333f3333f3ffffff3f̙3f3ffffffffff!___www45' ' '  "
$5k5k5 " "' o5T.P5
P4545' T. "5
kk55' "5
k5Arial+k2k5P2P5424525k5kn5h5nh"Systemn' ' n5' p0 "!Q!Q
$P"R#S"Q!"S#W$V"W#Y$X#$X%\$[&]'^&\%&^(b$b(d(d)b)(d*g$f+h,i+g*+i,m$ln.om,o.s$rs.u/t./t1x$x1z1z2x21z3~$}454~345$6765677 "6QB!Q " $QT!Q$N!Q7$47:746Q "
9T3NBE?'
n5'
'
'
2
d'1 2
I'2 2
'3 2
'4'
'
2
tIF92
tF9+'
Arial'
]%(
ArialT2
[Speedup'
 "
U.' S/' S/88
$889988$889988$889988$889988$889988$889988$88998888 "
$58;85 2
1F8' 
S/' 
S/ "
K
KK NH 2
DF8+' S/' ' 

' ' 4
'> 697 3B="
=
P
<X@"1Arial1Arial1Arial1Arial1Arial1Arial1Arial"$"#,##0_);\("$"#,##0\)"$"#,##0_);[Red]\("$"#,##0\) "$"#,##0.00_);\("$"#,##0.00\)%""$"#,##0.00_);[Red]\("$"#,##0.00\)5*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* ""_);_(@_),))_(* #,##0_);_(* \(#,##0\);_(* ""_);_(@_)=,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* ""??_);_(@_)4+1_(* #,##0.00_);_(* \(#,##0.00\);_(* ""??_);_(@_) + ) , *
Chart1
Sheet1
'Sheet2
(Sheet3
)Sheet4
g*Sheet5
O+Sheet6
7,Sheet7
Sheet8
.Sheet9.Sheet10/Sheet110Sheet121Sheet132Sheet14w3Sheet15_4Sheet16
3&APage &PMHP DeskJet 690C Series Printer7d,,HP DeskJet 690C Series PrinterLPT1 ,,"d,,??3Sheet1пo3
3Q:
F8Q;Q;'3
4E43Q:c
F8+Q;Q;3
4E4D
FA'$ 3O 0`
3 43*???#!
4%J3O0&Q
Speedup'4523
43"
rI3O
%_3OQ44$%_3O&Q4444
F9F9F9+F9+g
@k@8/3X@++@>
++
3
dMbP?_*+%&APage &PMPanasonic KXP4420@g,,@MSUDPanasonic KXP4420d
"d,,??U
' bb P
b L
DP b b @ P Pb b
F0
F8
F9
R1
R2
R3 RVarStd'=
ףp@< GznL@'(\@< QkF@'@< 33333B@)̹@DDDG:pΌW@1DDDDDD$$Ni#@DA?'Q@<$\(<@'@<$@'(\B@<$(\B#@)](@DDDGa2g@1DDDDDD㒉1y+@DA?')\@<${Gz7@'(\@<$
ףp=
@'3333@<$1@)q=
#@DDDG?a]@1DDDDDDw%@DA??'(\@<'ףp=
WJ@'G@<'QL@'
ףp@<'(\M@)(\@DDDGz'*@ 1DDDDDDNiM
@DA?'{Gᨀ@"<GzJ@'p=
}@<=
ףpG@')\@<(\I@)(\@#DDDGܵ@1DDDDDD2<@DA??'q=
ף*@<GzT5@'33333a@<fffff&<@'ףp=
I@<zG!9@))\F@DDDG6q['@1DDDDDD/hV@DA??'QA@<p=
#8@'Q1@<p=
#6@'Q1@<p=
#6@)@
t6@%DDDGUUUUUU?1DDDDDD3Ey?DA???'fffffD@<!@'(\ߍ@<\(K@')\@<(\L@)N@DDDGI7ӗJ@ 1DDDDDD99p@DA, eeeB, `ffffj@ eeB, ^,Œj@ eeB ! QԖ@ %B. wxV$@%B u @ D A
Sp
g
@ DD
Sp+8/3X@ DDSp+k@ DDSp++++@ DDF9F9+F8g
@D
8/3X@DF8+k@D++@
DF0*F7*F8
F0*F7
F0*F8$DDD? DD? DD$? DDD? DD DD,Y66666666SSSI4__J bb! " # $ P%
& b$ ?!DDD ! DD ?! DD$!&DDD!& DD!& DD$"?#DDD" DD"# DD$#$DDD#$ DD#?$ DD$$%DDD$? DD$% DD$%?DDD%? DD%? DD,&W%?e%eB,&Q"e%eB,&?"e%eBxnnnnnn]>@A@,n 3&APage &P"??3r33Q:#
F8Q;#Q;#E43Q:#
F8+Q;#Q;E4D
FA /)f 3O% q
3 43*???#!
4%T3O5&Q
Speedup'4523
43"
'3O%t3OQ44$%t3O&Q4444
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
> 697zSheet1Sheet2Sheet3Sheet4Sheet5Sheet6Sheet7Sheet8Sheet9Sheet10Sheet11Sheet12Sheet13Sheet14Sheet15Sheet16Chart1WorksheetsCharts> Oh+'0HPtCRAY full factorial, F8&F9Mohammed AlabdulkareemMohammed AlabdulkareemMicrosoft Excel@LM!@e7e> > 6
FMicrosoft Excel ChartBiff5Excel.Chart.59q>L&f > 6> )i =
""")))UUUMMMBBB999PP3f333f3333f3ffffff3f̙3ff333f333333333f33333333f33f3ff3f3f3f3333f33̙33333f333333f3333f3ffffff3f33ff3f3f3f3fff3ffffffffff3ffff̙fff3fffff3fff333f3f3ff3ff33f̙̙3̙ff̙̙̙3f̙3f333f3333f3ffffff3f̙3f3f3f333f3333f3ffffff3f̙3f3ffffffffff!___www45' Arial Arial "Systemn' '  "
$LLUUL " "' YH#
LL' # "L
LUUL' "L
UUULLUU_K_K' ' XL' a@ "
$$'$&.'.=$<CD=DS$RYZSZi$hpqiq$$  " 
" $ $  "
%'  XL'  '  ' 
2
Az1 2
z2 2
z3 2
8z4'  ' 
2
qF92
qF9+'  '  u<
Arial2
CSpeedup'   "
#' $' $0.
$.66.6F$FNNFN^$^ff^fv$v~~v~$$n
" $nxndn 2
F8'
 $'
 $ "0
n xd 2
F8+'  $'  '  ' ' 4'> 697z&f K
""")))UUUMMMBBB999PP3f333f3333f3ffffff3f̙3ff333f333333333f33333333f33f3ff3f3f3f3333f33̙33333f333333f3333f3ffffff3f33ff3f3f3f3fff3ffffffffff3ffff̙fff3fffff3fff333f3f3ff3ff33f̙̙3̙ff̙̙̙3f̙3f333f3333f3ffffff3f̙3f3f3f333f3333f3ffffff3f̙3f3ffffffffff!___www45' ' '  "
$5k5k5 " "' o5T.P5
P4545' T. "5
kk55' "5
k5Arialrek2k5P2P5424525k5kn5h5nh"Systemn' ' n5' p0 "=Q=Q
$P>R?S>Q=>S?W$V>W?Y@X?@XB\$\B^B^C\CB^Db$aEcFdEbDEdFg$fGhHiGgFGiIm$mIoIoJmJIoKs$rJsKuLtKLtMx$wNyOzNxMNzP~$~PPQ~QPR$STSRSTT "QQY=Q " $Q:T=Q@N=Q:T$QTWTQQQ "
TTNNY\V'
n5'
'
'
2
d'1 2
I'2 2
'3 2
'4'
'
2
tIF92
tF9+'
Arial f'
]%(
Arial2
[Speedup'
 "
U.' S/' S/88
$889988$889988$889988$889988$889988$889988$88998888 "
$58;85 2
1F8' 
S/' 
S/ "
K
KK NH 2
DF8+' S/' ' 

' ' 4
'> 697z 3B="
=
P
<X@"1Arial1Arial1Arial1Arial1Arial1Arial1Arial"$"#,##0_);\("$"#,##0\)"$"#,##0_);[Red]\("$"#,##0\) "$"#,##0.00_);\("$"#,##0.00\)%""$"#,##0.00_);[Red]\("$"#,##0.00\)5*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* ""_);_(@_),))_(* #,##0_);_(* \(#,##0\);_(* ""_);_(@_)=,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* ""??_);_(@_)4+1_(* #,##0.00_);_(* \(#,##0.00\);_(* ""??_);_(@_) + ) , *
Chart1
Sheet1
6Sheet2
6Sheet3
7Sheet4
8Sheet5
9Sheet6
:Sheet7
;Sheet8
d<Sheet9L=Sheet104>Sheet11?Sheet12@Sheet13@Sheet14ASheet15BSheet16
3&APage &PMHP DeskJet 690C Series Printer7d,,HP DeskJet 690C Series PrinterLPT1 ,,"d,,??3Sheet1пo3
3Q:+
F8Q;GQ;c3
4E43Q:{
F8+Q;Q;3
4E4D
FA'$ 3O 0`
3 43*?@??#!
4%J3O0&Q
Speedup'4523
43"
rI3O
%_3OQ44$%_3O&Q4444
F9F9F9+F9+7,O@O~m?4;:?e~?>
e
3
dMbP?_*+%&CCRAY full factorial, F8 & F9Page &PMPanasonic KXP4420@g,,@MSUDPanasonic KXP4420d
"d,,??U
'
bb
D@
bb
D@
L
@
@
Tb
b
K
@
P
Pb
b
bb
F0
F8
F9
R1
R2
R3 RVarStd F8F9
F8F9+F8F9+F8F9++A`@S㥈@.@)g>@DDDGte}@1DDDDDDaă5@DAM g>@7DDB$$
BG
1DDB$$
BG
1DDB$$
BA+DDB$$
B?^Ix@rh@$@)/b4@DDDG6k6@1DDDDDD~tMnY@DAM 7DDB$$
BG
/b4@
1DDB$$
BG1DDB$$
BA+DDB$$
B?Mt@@S@)`"ۍ@DDDGpe?h@1DDDDDDKF=+@DAM 7DDB$$
BG
1DDB$$
BG`"ۍ@1DDB$$
BA+DDB$$
B??)@Zd;@~
!U@)e @DDDGO@1DDDDDD1).@DAM 7DDB$$
BG
1DDB$$
BG1DDB$$
BAe @+DDB$$
B?uV
l@Zd;O9l@NbX9l@)l@DDDG/w?1DDDDDDݞ>?DAM l@ 7DDB$$
BG
1DDB$$
BG1DDB$$
BA+DDB$$
B??y@5^Ix@J+y@)ۇϰ+y@DDDGK&(?1DDDDDD5Va?DAM 7DDB$$
BG
ۇϰ+y@
1DDB$$
BG1DDB$$
BA+DDB$$
B??d;Oy@"~x@xy@)Qy@DDDGJΟ6?1DDDDDD=֜?DAM 7DDB$$
BG
1DDB$$
BGQy@1DDB$$
BA+DDB$$
B???M@A`@~
q"@)(\@DDDGdeO? 1DDDDDDt$u?DAM 7DDB$$
BG
1DDB$$
BG1DDB$$
BA(\@+DDB$$
B, l)9weeB, :&x3g@ eeB, ad@ eeB ! FoZ}P@ %B. Kl@"%B $$3qd.@ D A
%
ny@% %
~j<ہ@
%
%
^I@
%%
۲@
%
F9F9+ F8
ny@
D
~j<ہ@
D
Sp
7,O@ DD
F8+
^I@
D
۲@D
Sp+4;:? DDSp+O~m? DDSp++e~? DDF9F9+F87,O@D
4;:?DF8+O~m?De~?
DF0*F8*F9
F0*F8
F0*F9$DDD? DD? DD$? DDD? DD DD2*Fq]]I4__J
bb!
D@"
bb#
$
D@%
L&
$ ?!DDD ! DD ?! DD$!&DDD!& DD!& DD$"?#DDD"# DD"# DD$#$DDD#$ DD#?$ DD$$%DDD$?% DD$% DD$%?DDD%? DD%? DD,&J e%eB,&#b$"e%eB,&!iJ '@"e%eBxnnnnnn]>@A@HԆ 3&APage &P"??3r33Q:
F8Q;+Q;GE43Q:_
F8+Q;{Q;E4D
FA /)f 3O% q
3 43*?@??#!
4%T3O5&Q
Speedup'4523
43"
'3O%t3OQ44$%t3O&Q4444
]>
@A@XԆ 3&APage &P"??3r33Q:
F8Q;
Q;
E43Q:
F8+Q;
Q;
E4D
FA/f 3O r
343*#!
4%T3O&Q
Sec'4523
43"
'3O%t3OQ44$%t3O&Q4444
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
3
dMbP?_*+%&APage &P"??U
>
> 697z.ҀmJdqJhmJ
b0
=R3
+R4
2R1
+R2
2> > 6
FMicrosoft Equation 2.0DS EquationEquation.29q>Bh .1&&MathType "R3 pSymbol 2
;b Times New Roman5 2
0p 2
o3p 2
r4p 2
1p 2
2pTimes New Romanp 2
b2 2
2Symbol 2
= 2
vS+ 2
 2
v+Times New Romanp 2
v~R 2
vR 2
v R 2
vR
&
"Systemn> 6LBh8 > 6> K @ Normala 0@0 Heading 1<U]ck,@, Heading 2<U[c*@* Heading 3<Uc"A@"Default Paragraph FontOEqVc"O""Picture,B@, Body Textc&"@&Caption
xxUc*O2*Tcaption0hcVOBVrefC
x4
[]. @R.Footer!c @b Header!)@qPage Number_b
N".t7?=a@HQSW_fw
:qFGMN]
9
g`aQ!"i$'+++B1C1D1111111113/37585P57::<;=;s;;;;;;;
<!<6<L<c<y<<<<<<<==#=1=?=M=[=i='?(?u?v?????????@@@@"@a@b@n@:ACCCDDD'GHxIyIIKKKYLLM!NNOkPQQRRQSTTU5VVWW
YYGYqYYYYY#ZGZkZZZZZ#[H[k[[[[["\F\j\\\\\]B]f]]]]]^^^_$$$$7<
@,\a1b234567.4>W??h@BF\j\\]r]])^^^7___A```b89:;<=>?@ABCDEFGHIJKLMNO3#3%3@@ @CCC_:::!
@KQ_in17x+0$3&3+3.3/323CCCCCCKKLLLLMMMM!NNCNLNNNNNNO
OOOOOOOOPPkPqPQQ$Q,QQQQQR#RRRRRQSVSbSgSpSuSTTTTUUUU5V8VAVLVQVUVZVaVjVrVVWWW%W*WG^S^^jMohammed Alabdulkareem$C:\My Documents\thesis\NCC97011.docSaudi AramcoE:\work1\Papers\11\Paper11.doc@HP LaserJet 4Si\\Edh_2wpr1_1_f3\pq_cadadm7HPPCL5MSHP LaserJet 4SiHP LaserJet 4Si@w XX@MSUDNHP LaserJet 4Si
;d
HP LaserJet 4Si@w XX@MSUDNHP LaserJet 4Si
;d
.:Times New RomanSymbol&ArialUnivers&"0h&fpF^
MN($m7Performance Evaluation of Parallel Systems: An OverviewMohammed AlabdulkareemSaudi Aramco> 9ܥhW eb0^JJ
(2J1:Ʃ:$&&&Ek?kXïm1281Ʃ$FDآr$j
Performance Evaluation of Parallel Systems: An Overview
M. AlAbdulkareem* S. Lakshmivarahan** S. K. Dhall**
* Department of Computer Science, CCIS, King Saud University, P.O.BOX 2454,
Riyadh 11451, Saudi Arabia.
**Parallel Processing Institute, School of computer Science, University of Oklahoma, Norman, OK 73019, USA.
ABSTRACT: Assessing parallel systems is very crucial. With the advances in parallel software and hardware technologies this task is more challenging. This paper reviews some of the existing performance measures and introduces a new measure for scalability. We followed a statistical approach that employs the factorial designs to measure the sensitivity of the code to scale and changes in the code's performance. This approach will estimate empirically an approximation of a multivariate Taylor's expansion for the code's execution response function. The coefficients of the first order terms are estimates of the code's sensitivity to scaling and to changes in codes' efficiency. The higher order terms will be utilized as relative indicators of the codes scalability.
1 Introduction
One of the main goals of parallel processing is to solve large problems in time much shorter than that for the serial processing. To achieve this goal, the effect of the serial part of the parallel program needs to be minimized. The notion of speedup, which is defined as the ratio of the best known serial time to the given parallel time, was used to assess parallel codes. Amdahl's law 1967, see [1], was the first step on the way to analyze parallel code. The notion of scalability is used to measure how good a parallel system is when the system size is increased. The system size is increased usually by using more processors to solve a problem. An ideal parallel algorithm will have a speedup proportional to the number of processors. However, this is not the case in real life applications. A parallel program will reach a limit, called parallel balance point, after which adding more processors will increase the time taken to solve a fixed size problem [2]. The following sections will briefly survey different measures to evaluate parallel systems.
2 Speedup
The most widely used measure to characterize a parallel algorithm is the speedup ratio. However, different versions of speedup are used to evaluate parallel codes. The classical definition relates the required time to solve a given problem using the best known serial algorithm to the time required to solve the same problem by parallel algorithm using P processors [1]. The speedup will be as follows,
SP =T(N)/TP(N) (1)
Where N is the problem size, T(N) is the serial time, and TP(N) is the parallel time on P processors. Depending on the measured serial and parallel times the definition of speedup will change and may yield contrary results. When the fastest know algorithm is executed on a single processor the speedup is called real speedup (it is also called classical or traditional speedup). However, most researchers will use what is known as relative speedup that uses the same parallel algorithm on one processor for the serial time instead of the best known serial algorithm [3]. When the best known serial algorithm is run on the fastest machine the speedup is then called absolute speedup [4].
An example of misleadin11111111111281111111111111291111111111111301111111111111311111111111111321111111111111
The above design was obtained using the "FACTEX" procedure available with the SAS package. This procedure will generate orthogonally confounded designs including factorial and fractional factorial designs. After constructing a design, it can be printed, saved to file, or used as SAS data set.
PAGE
"77.A b/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////.#####"d$Vt$ftr~Q^ËNt$ RPjQsF ^hH@j jQrǆ ^Í PVQjRr^Íd$d$D$SPcD$PS3ҹh@8u8AtBDCr@u<@u[ÍR[El@ø[Ð:KL_anpqrGON]QX7 B
!
"
#
$
%
&
'
)
*
+
,

.
5
8
9
?
@
V
W
X
Y
s
t
u
v
w
q}}~GYKc`VchhVhVVccchUhU[^_"#U$]$%%3,?,..00D4K444444444444444444444W5X5Y5j5k5l555566#6$6%6&6/6h6w666/717r7t778P88887<
uD<~7eKuD<~7avKuD
JbVcchhUcVchVccVS