# **AUTHOR QUERY FORM**

|          | Journal: MICPRO      | Please e-mail or fax your responses and any corrections to:        |
|----------|----------------------|--------------------------------------------------------------------|
|          | Article Number: 1849 | E-mail: corrections.esch@elsevier.sps.co.in<br>Fax: +31 2048 52799 |
| ELSEVIER |                      |                                                                    |

Dear Author,

Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in the PDF file) or compile them in a separate list.

For correction or revision of any artwork, please consult <u>http://www.elsevier.com/artworkinstructions.</u>

No queries have arisen during the processing of your article.

Thank you for your assistance.

Microprocessors and Microsystems xxx (2010) xxx-xxx

Contents lists available at ScienceDirect

# Microprocessors and Microsystems

journal homepage: www.elsevier.com/locate/micpro

# <sup>2</sup> A hardwired NoC infrastructure for <u>embedded systems</u> on FPGAs

## Muhammad E.S. Elrabaa \*, Abdelhafidh Bouhraoua

4 Computer Engineering department, King Fahd University for Petroleum and Minerals, Dhahran 31261, Saudi Arabia

#### ARTICLE INFO

Multi-core embedded systems

Field Programmable Gate Arrays (FPGAs)

Available online xxxx

Networks-on-chip

Systems-on-chips

Article history

Keywords:

ABSTRACT

A hardwired network-on-chip based on a modified Fat Tree (MFT) topology is proposed as a communication infrastructure for future FPGAs. With extremely simple routing, such an infra structure would greatly enhance the ongoing trend of embedded systems implementation using multi-cores on FPGAs. An efficient H-tree based floor plan that naturally follows the MFT construction methodology was developed. Several instances of the proposed NoC were implemented with various inter-routers links progression schemes combined with very simple router architecture and efficient <u>client network</u> interface (CNI). The performance of all these implementations was evaluated using a cycle-accurate simulator for various combinations of NoC sizes and traffic models. Also a new data transfer circuit for transferring data between clients and NoC operating at different (unrelated) clock frequencies has been developed. Allowing data transfer at one data per cycle, the operation of this circuit has been verified using gate-level simulations for several ratios of NoC/client clock frequencies.

© 2010 Elsevier B.V. All rights reserved.

27 28 29

54

55

56

57

17

18

19

20

21

22

23

24 25

26

#### 30

5

16

8 9

10

11

12

13

14 15

## 31 1. Introduction

Field Programmable Gate Arrays (FPGAs) have gained consider-32 able acceptance among VLSI designers not only as prototyping 33 platforms but also as system implementation platforms for appli-34 cations with short time to market constraints. State of the art 35 FPGAs have also become very attractive for system-on-chip (SOC) 36 and embedded systems designs due to their ease of use, flexibility, 37 large gate count, abundance of I/Os and efficient hardware macros, 38 embedded processors (both hard and soft), re-configurability, 39 40 extensive tool support and improved speed [1.2]. FPGAs can be configured (and re-configured) to implement any digital processor 41 or group of processors (called cores or Intellectual Property blocks 42 or IPs). Several cores could be implemented simultaneously that 43 can communicate among themselves and the outer world. This 44 45 matches the current trend of multi-core implementation of 46 embedded systems [1]. This lead to a new trend of implementing 47 embedded systems on FPGAs. The inter-core communications in 48 these implementations, however, are still implemented as 49 dedicated point-to-point or using shared buses. The first method 50 though provides good speed, consumes relatively large portion of the FPGA's logic/interconnect resources reducing the total number 51 of cores that can be implemented. The second method is more 52 53 resource efficient but suffers from performance penalties and does

not scale well. Having a scalable interconnection network (i.e. a network-on-chip) on future FPGAs has the highest potential for satisfying the communication needs of future FPGA-based embedded systems [1].

Conventional FPGAs suffered from relatively high configuration 58 time (time needed to load all the configuration bits into the FPGA) 59 due to the limited number of external configuration ports and the 60 serial nature of the configuration process itself. Also, configuration 61 interconnects were completely separated from data/functional 62 interconnects. Since adding more external configuration ports 63 would be at the expense of regular data I/Os, FPGA vendors added 64 internal reconfiguration ports [3] that can be accessed by various 65 cores implemented on the FPGA itself. Vendors also have divided 66 their FPGAs into areas that can be configured separately with each 67 area having its own local interconnect with functional and config-68 uration interconnects remaining separate. This alleviated some of 69 the problems associated with the serial nature of reconfiguration 70 making dynamic reconfiguration (re-configuring the FPGA during 71 use) more practical. Still, having separate configuration and func-72 73 tional interconnects is not only wasteful of space but means that special circuitry are required to connect the regular data port of 74 an IP to the configuration interconnects and then to the internal 75 configuration port of the region being re-configured. This requires 76 having a dedicated configuration management circuit on the FPGA, 77 78 yet more wasting of valuable logic resources. Hence it makes perfect sense to share the physical interconnects by functional 79 (regular) data and configuration data. This would not only free 80 valuable FPGA space but would also make available more intercon-81



<sup>\*</sup> Corresponding author. Tel.: +966 3 860 1496; fax: +966 3 860 3059.

*E-mail addresses*: elrabaa@kfupm.edu.sa (M.E.S. Elrabaa), abouh@kfupm.edu.sa (A. Bouhraoua).

<sup>0141-9331/\$ -</sup> see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2010.09.008

176

177

178

179

180

181

182

183

184

185

186

187

188

189

nect resources available for reconfiguration, enabling fast dynamic
 run-time reconfiguration. This in turn would open the door for new
 and innovative computing paradigms.

85 Another major problem with FPGA interconnects is their rela-86 tively large delays. This is due to the fact that in order to make 87 them configurable switch boxes (made of MUXs or Pass gates) 88 are inserted at regular lengths along the interconnect wires. These 89 wire delays constitute a challenge for circuit designers to meet 90 their timing constraints (i.e. timing closure of the design). The 91 same problem (timing closure) has faced designers of conventional 92 SoCs where a chip is assembled by integrating different IPs, possi-93 bly from different vendors, with various operating frequencies and 94 communication patterns. Networks-on-chip were proposed to solve these problems by providing a scalable shared interconnect 95 96 medium that can meet the different communication requirements 97 of IPs [4–6]. As a result, there has been a significant amount of ef-98 fort made in the area of NoCs, and the focus has mostly been on 99 proposing new topologies, and routing strategies. However, re-100 cently the trend has shifted towards engineering solutions and 101 providing design tools that are more adapted to reality. For exam-102 ple, power analysis of NoC circuitry has intensively been studied 103 [7,8], more realistic traffic models have been proposed [9], and more adapted hardware synthesis methodologies have been 104 105 developed.

High throughput architectures, however, haven't been addressed enough in the literature. Besides the efforts related to the Nostrum [10] and the Æthreal [11], most of the other efforts were based on a regular mesh topology with throughputs (expressed as a fraction of the wire speed) not exceeding 30% [11].

111 In [12,13] a NoC topology based on a modified Fat Tree (MFT) 112 was proposed to address the throughput issue. MFT is a subclass 113 of Fat Trees (FT) multi-stage-interconnection networks [14]. The 114 conventional Fat Tree topology was modified by adding enough 115 links such that contention was completely eliminated thus achiev-116 ing a throughput of nearly 100% [12] while eliminating any buffer-117 ing requirement in the routers. Also, simplicity of the routing 118 function, typical of Trees, meant that the router architecture is 119 greatly simplified. The approach was extended to satisfy heteroge-120 neous bandwidth requirements of different clients [15]. The new 121 approach allowed NoC designers to satisfy the bandwidth requirement of each client without the need to "overdesign" the NoC, 122 123 reducing power consumption without impacting the performance 124 of applications running on the NoC.

125 Many researchers have proposed soft NoCs for FPGAs that utilize the FPGA's configurable resources to implement the NoC's var-126 127 ious components (e.g. [16-19]). Though many of these soft NoCs 128 have demonstrated various degrees of performance improvement, 129 they consumed significant portion of the FPGA's resources (logic/ 130 interconnects/memories). Most critical of these are memory blocks 131 (for implementing I/O buffers in the NoC interface) which leave 132 designers with insufficient RAM blocks to implement their 133 applications.

As a result of the limitations of soft NoCs, several researchers 134 have recently proposed introducing NoCs as FPGA macros (i.e. as 135 136 hardwired, full-custom designed circuitry) [20-26]. In [20] a reconfigurable router that can be configured to have various num-137 bers of ports was first proposed. Users would need to implement 138 the routing algorithm they desire and configure the NoC accord-139 ingly. In [21] a configurable Mesh-based NoC is proposed. This 140 141 NoC is only modeled in SystemC using channels to implement 142 the routing protocols with no actual implementation provided. 143 Also the NoC itself require significant reconfiguration (routing ta-144 bles, logical to physical addresses mapping, virtual point-to-point 145 connections, etc.). Although Mesh topology would fit conveniently 146 with the tile-based FPGA re-configuration regions, it has sever per-147 formance limitations in terms of achievable throughput (<30% of the wire speed) [10,11]. Another Mesh-based NoC is proposed in 148 [22]. It combines hardwired resources (links and routers) with soft 149 resources (configurable network interface). This provided extra 150 flexibility at the cost of using up the FPGA's resources and reducing 151 the speed since soft parts will inevitably have lower speed than the 152 hardwired parts. Elmiligi et al. [23] proposes a centralized pro-153 grammable switch fabric to connect routing nodes and coarse grain 154 configurable processing elements. This strategy won't work with 155 regular fine-grained FPGAs because the size of the central switch 156 will grow exponentially with the number of I/O ports. A similar ap-157 proach is proposed in [24]. Based on the Æthereal NoC [11], routers 158 and links are implemented as hardwired blocks while the network 159 interfaces are split between hard and soft parts. The NoC requires 160 significant configuration to implement the Æthereal's different QoS 161 services. This in turn requires the implementation of a dedicated 162 NoC configuration manager as part of a system manager [25], using 163 up vet more FPGA resources or area. Configurable router that can 164 support several NoC topologies is proposed in [26]. Routers contain 165 the required logic circuitry to implement the supported topologies' 166 routing algorithms, something that is wasteful of FPGA chip area. 167 Furthermore, Routers and clients are connected using the FPGA's 168 configurable interconnects and using asynchronous handshaking 169 (Req-Acknowledge). This is not only wasteful of further FPGA re-170 sources, but also limiting to the over all NoC performance. As 171 was demonstrated in [33], full-custom hardware components can 172 achieve much higher performance at a very small fraction of the 173 area of configurable logic. 174

In this work a new FPGA hardwired NoC is proposed. The proposed NoC would provide an efficient infrastructure for functional inter-module communications as well as reconfiguration of the FPGA's regions. A brief overview of the proposed approach is first introduced in the next section followed by a brief description of how the NoC is constructed (in terms of links and routers) and the proposed physical floor plan. In Section 3 the detailed NoC design is presented including router's architecture, links progression, and client interface design. This is followed by performance evaluation (in terms of throughput and packet delays) of the proposed NoC implementations options under various conditions of size and traffic types. This includes a detailed gate count for these options and the area impact of adding a hardwired NoC to an FPGA. Finally conclusions are provided in Section 5.

#### 2. Approach

To circumvent the limitations of the previously proposed FPGA 190 NoCs, the target of this work was to develop an FPGA NoC infra-191 structure with minimal reconfiguration requirements. This would 192 minimize the use of soft parts in the NoC implementation, increas-193 ing the speed and reducing the overall area footprint of the NoC. 194 Also to achieve maximum data rates in the NoC a full synchronous 195 design of the NoC is adopted. Since the NoC is designed as a full-196 custom circuit with a relatively small area, it is not difficult to de-197 sign it as a synchronous circuit with its own clock tree. In fact an 198 efficient H-tree floor plan was developed for the proposed NoC that 199 would match the clock tree's own topology (usually laid out as an 200 H-tree) simplifying the timing closure of the NoC. Since IPs will 201 have their own clocks, specialized circuitry was developed to 202 transfer data between the NoC and IPs at the maximum possible 203 data rate. Also the developed NoC would have minimal intrusion 204 on existing FPGA architectures for seamless integration into exist-205 ing FPGA design flows. It is meant to be an infra structure for pro-206 viding a fast medium for delivering functional/configuration data 207 to all parts of the FPGA. In other word it provides the lower layers 208 (up to routing) of the networking protocol while the higher layers 209 are left for the system designer to choose and implement (or 210

Please cite this article in press as: M.E.S. Elrabaa, A. Bouhraoua, A hardwired NoC infrastructure for embedded systems on FPGAs, Microprocess. Microsyst. (2010), doi:10.1016/j.micpro.2010.09.008

2

106

107

108

109

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

211 choose from a library of IPs). The fixed NoC infrastructure is based 212 on the MFT architecture [12,13]. The router architecture is further 213 simplified by eliminating adaptive routing altogether. This would 214 not only reduce the router area, but it will also insure in-order reception of packets, thus simplifying the client network interface 215 design significantly. Also to reduce the NoC area further and to 216 217 simplify the clients' network interface, the original MFT architecture was modified by reducing the number of links significantly. 218 Different strategies for link reduction were evaluated using simu-219 lations to determine the best strategy in terms of throughput and 220 gate count. 221

#### 3. NoC construction and floor plan

## 223 3.1. Modified Fat Tree (MFT) NoC construction

Fig. 1 shows the recursive construction of a FT network. The 224 starting basic unit, called a group of order 1, is composed of a single 225 router with two clients attached to it. This unit is replicated and a 226 new row of routers is added. Higher order groups are then formed 227 in the same way from lower order groups, Fig. 1. Hence an *n*-client 228 network will have  $(n/2)^*$  LOG<sub>2</sub>(*n*) routers organized as n/2 columns 229 230 by  $LOG_2(n)$  rows. In this work, rows are numbered from 0 to  $LOG_2(n) - 1$ , with row 0 being the first level of routers which are 231 232 connected to the clients. The MFT is constructed in a similar man-233 ner with the addition of additional links, Fig. 2. In the original MFT the downward output ports (links) are double the number of upper 234 links for each router. This link doubling (referred to in this work 235 236 as geometrical link progression) would proceed from the top row to row 0. Hence at the client/NoC interface routers will have 237  $2^{n+1}_{\mu}$  - 1 downward links (i.e.  $2^{n}_{\mu}$  1 input links per client). Each 238 of these I/P links featured a FIFO buffer as called for by the original 239 MFT architecture [12]. In this work two new modifications were 240 adopted for the MFT; (1) adaptive upward routing was eliminated 241 to further simplify the router's design and guaranteeing in-order 242 packet arrival, thus simplifying the client interface greatly, Fig. 2) 243 Different schemes for increasing the number of downward output 244 links (i.e. link progression) were adopted. The later modification 245 was adopted to reduce the number of links and FIFOs at the edge 246 of the NoC at the client interface. This was a result of observations 247 from numerous simulations with different traffic models that most 248 of the extra links were inactive for most of the time. The router de-249 sign was modified to support the different link progression 250 schemes evaluated. 251

#### 3.2. Routing in MFT

Packets are routed up till they reach a summit router where 253 there is a direct path to the destination. So when an upward packet 254 is received by a router at row r,  $LOG_2(n) - r_1 = 1$  bits of the destina-255 tion address are simply compared to a stored pattern of equal 256 length that indicates all the addresses reachable from that router. 257 If they match the packet is routed down the opposite side, if not, 258 the packet is routed up on the same side till it reach the summit. 259 Downward packets are routed right or left based on the value of 260 one of the destination address bits. The exact bit locations within 261 the destination address that are used for upward and downward 262 routing depends on the router location and are hardwired into 263



Fig. 2. (a) The original MFT topology and (b) router architecture with downward link doubling (geometrical link progression).

Please cite this article in press as: M.E.S. Elrabaa, A. Bouhraoua, A hardwired NoC infrastructure for embedded systems on FPGAs, Microprocess. Microsyst. (2010), doi:10.1016/j.micpro.2010.09.008

3

295

315

4

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

the router during design. This greatly simplifies the routers design.
Also due to the deterministic routing and the buffer-less nature of
the network packet order is maintained.

## 267 3.3. NoC floor plan

268 An efficient floor plan is proposed for the NoC based on the H-269 tree topology, Fig. 3. Each tile in the floor plan represents a configuration region (CR) and it will be considered as a single client from 270 the NoC design point of view. This is just for the purpose of config-271 272 uration (i.e. each of these tiles represents the minimum re-config-273 uration region for the purpose of partial configuration). An actual 274 client may occupy any number of CRs and can utilize any (or all) 275 of the NoC ports in these CRs. Also two (or more) clients can occupy the same CR tile, but the designer needs to implement his/her own 276 interface switch to time multiplex the NoC port in that CR. It 277 should be noted that this floor plan is just illustrative and not an 278 279 actual layout (i.e. it is not to scale). In reality, the routers and links area would be very small compared to CR tiles. Fig. 3 also lists the 280 281 maximum length of links in terms of CR length.

As Fig. 3 shows, the routers and links layout follow the borders of the CRs to minimize the impact on conventional FPGA structure. With this floor plan, the longest links are the most upper links that span one-half the chip length. They are, however, fewer in numbers, making the routing of these links' wires easier. Also, proper buffer insertion in these links must be applied to maintain the required clock rate.

## 4. NoC design

289

## 290 4.1. Link progression schemes in the MFT NoC

Three types of link progression in the MFT NoC have been evaluated in this work; (1) Arithmetic progression, (2) Mixed Arithmetic/Geometrical progression and (3) Limited Geometrical progression. The three types are briefly explained below.

#### 4.1.1. Arithmetic link progression

The Arithmetic progression MFT, abbreviated as AMFT, is based 296 on partial reduction of contention through the addition of a fixed 297 number of links on the downward path of each router. Hence, in-298 stead of the doubling called for by the original MFT architecture, 299 the number of output ports of each router is equal to the number 300 of input ports plus a fixed number called the *increment*  $(I_A)$ , An 301 increment of 2, means adding an extra link to both the left and 302 right groups of output links. Another parameter, stop level  $(S_{I})$  is 303 introduced to increase the variations of the network structure 304 exploration. Starting from the top row,  $S_L$  represents the row at 305 which incrementing the output links will stop leaving the routers 306 in lower levels with an identical number of inputs and outputs. Ta-307 ble 1 below illustrates the arithmetic progression for a NoC with 64 308 CRs for various combinations of  $I_A$  and  $S_L$ . It shows the number of 309 output links per side (left or right) per router for each row in the 310 NoC. The last column shows the corresponding MFT number of 311 output links when doubling of the links is used. It is very clear that 312 this scheme results in a significant decrease in the number of links 313 at most levels especially the last one (row 0). 314

## 4.1.2. Mixed arithmetic/geometrical link progression

This scheme as its name indicates is a mixture of arithmetic and geometric progression. Starting from the top row, output ports are 317

#### Table 1

Number of routers' output links per side for a network of 64 CRs, various ( $I_A$ ,  $S_L$ ) combinations and using the arithmetic link progression scheme.

| Row | I <sub>A</sub><br>S <sub>L</sub> | 2<br>0 | 2<br>1 |   | 2<br>3 |    | 4<br>1 | 4<br>2 |   |    | 6<br>1 | 6<br>2 | 6<br>3 | MFT |
|-----|----------------------------------|--------|--------|---|--------|----|--------|--------|---|----|--------|--------|--------|-----|
| 5   |                                  | 1      | 1      | 1 | 1      | 1  | 1      | 1      | 1 | 1  | 1      | 1      | 1      | 1   |
| 4   |                                  | 2      | 2      | 2 | 2      | 3  | 3      | 3      | 3 | 4  | 4      | 4      | 4      | 3   |
| 3   |                                  | 3      | 3      | 3 | 3      | 5  | 5      | 5      | 5 | 7  | 7      | 7      | 7      | 7   |
| 2   |                                  | 4      | 4      | 4 | 3      | 7  | 7      | 7      | 5 | 10 | 10     | 10     | 7      | 15  |
| 1   |                                  | 5      | 5      | 4 | 3      | 9  | 9      | 7      | 5 | 13 | 13     | 10     | 7      | 31  |
| 0   |                                  | 6      | 5      | 4 | 3      | 11 | 9      | 7      | 5 | 16 | 13     | 10     | 7      | 63  |



Fig. 3. The H-tree construction of the NoC floor plan (not to scale). The client/NoC interface (CNI) is at the edge of the NoC at the client's interface to the 1st level of routers (row 0).

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

Table 2

Number of routers' output links per side for a network of 64 CRs, various ( $I_A$ ,  $S_L$ ) combinations and using the mixed arithmetic/geometrical link progression scheme.

| Row | $I_A$<br>$S_L$ | 2<br>1 | 2<br>2 | 2<br>3 | 4<br>1 | 4<br>2 | 4<br>3 | 6<br>1 | 6<br>2 | 6<br>3 | MFT |
|-----|----------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-----|
| 5   |                | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1   |
| 4   |                | 2      | 2      | 2      | 3      | 3      | 3      | 4      | 4      | 4      | 3   |
| 3   |                | 3      | 3      | 3      | 5      | 5      | 5      | 7      | 7      | 7      | 7   |
| 2   |                | 4      | 4      | 7      | 7      | 7      | 11     | 10     | 10     | 15     | 15  |
| 1   |                | 5      | 9      | 15     | 9      | 15     | 23     | 13     | 21     | 31     | 31  |
| 0   |                | 11     | 19     | 31     | 19     | 31     | 47     | 27     | 43     | 63     | 63  |

incremented by  $I_A$  until the row  $S_L$  is reached. After that, doubling 318 319 of the links is implemented resulting in number of links in between that of the arithmetic progression and the original MFT. This 320 scheme is meant to increase the number of links at the lower rows 321 because in a typical placement, clients are placed adjacent to other 322 clients that they communicate with the most. Hence most of the 323 324 traffic will be routed through lower levels. Table 2 shows this link progression scheme for a NoC with 64 CRs for various combina-325 326 tions of  $I_A$  and  $S_I$ .

#### 327 4.1.3. Controlled geometrical progression

Based on the assumption that more links are needed at lower rows, this scheme keeps the number of links constant (with no progression at all) for upper rows and start doubling them at level  $S_L$ . This results in a much lower number of links and the overall gate count.

## 333 4.2. Modified router architecture

334 As was explained before, the routers in the original (see Fig. 2) 335 were designed with no possibility of contention due to the geomet-336 rical link progression. With the above progression schemes contention may arise and hence requires modification of the routers' 337 architecture. Since several input ports are going to be eventually 338 competing for fewer output ports crossbar structures have to be in-339 340 serted on both the left and the right paths to manage the allocation 341 of the output ports. Fig. 4 shows the router architecture modified to handle contention. Two crossbars have been added to arbiter 342 contentious allocation of the output ports. The two crossbars are 343 independent from one another, simplifying the design of the arbi-344 345 tration control. Routing, computed in the input port logic, determines to which crossbar the request for output port allocation is sent. Unlike regular crossbars, all output ports are considered as one single path and only a status bit (allocated/unallocated) is associated with output ports. Requests are queued in a first come first serve basis using simple logic that maintains the order of arrival of the requests. A request queue maintains this priority. It is a simple FIFO with a size equal to the difference between the number of input ports and output ports of the crossbar. Each entry is designed to hold the index of the corresponding input port that issued the request. Because all output ports are identical resources, there is no need to specify which output port is requested.

## 4.3. The client network interface (CNI)

The MFT NoC is characterized by a relatively high number of links at the edge of network (where it interface to the clients). Buffering has also been pushed to these interfaces. In order to overcome these limitations a new client interface architecture is proposed. This interface aims at reducing the number of parallel FIFOs (from the incoming links) into a single centralized FIFO before interfacing this FIFO to the client. Fig. 5 below shows the block diagram of the CNI. It is made of three parts; an upper part consisting of several bus-widener structures that will be named *parallelizers* from this point forward, a single centralized FIFO memory and a data transfer interface circuit. The outputs of the parallelizers are connected to the central FIFO via a single many-to-one multiplexer. The data transfer interface circuit is required to transfer the data between the two clock domains; the NoC's and the client's. The design of each of these parts is described below.

## 4.4. The parallelizers

Each one of the parallelizers is made of two layers. The first 374 layer is a collection of registers connected in parallel to the incom-375 376 ing data bus from one of the incoming links (ports). Packet data is received into one of these registers one word at a time. When this 377 layer is full, an entire line made by concatenating all the registers 378 379 of the first layer is transferred to a second set of registers (the second layer in the parallelizer) in a single clock cycle. The ratio be-380 381 tween the width of the parallel bus and the width of a single word is called the parallelization factor. Portions of packets from 382 the same source are always received by the same parallelizer in or-383 der. Packets from different sources are received on different paral-384



M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx



Fig. 5. Block diagram of the client network interface (CNI).

lelizers simultaneously and independently. When the first portion 385 386 of a packet is received and transferred to the second layer of the 387 parallelizer, a flag is set to request transfer of a new packet to 388 the FIFO. The control logic responsible for these transfers will first 389 attempt to reserve space in the FIFO corresponding to one packet. 390 The condition here for this architecture to produce efficient results 391 is the adoption of a fixed packet size. This condition simplifies the space allocation in the FIFO and alleviates the control logic from 392 any space or fragmentation management due to variable size allo-393 cation and disposal. In the case the FIFO is full and no space could 394 395 be reserved, the request is rejected and the backpressure mecha-396 nism is triggered on that requesting port.

397 The control logic continuously transfers the received packet 398 words to the FIFO. Every time a packet portion enters the second 399 layer of registers in one of the parallelizers a flag is set to indicate the presence of data. Those parallelizers which are currently 400 401 receiving packets are said to be active. Only active parallelizers 402 are continuously polled to check the presence of data. The polling follows a round-robin policy. A single clock cycle is used to process 403 404 the currently selected parallelizer. Polling the active parallelizers 405 only supposes some mechanism to "skip" all the non-active parallelizers between two active ones. In order to avoid wasting clock 406 cycles crossing those non-active parallelizers, a special request 407 408 propagation circuit has been designed. Fig. 6 shows the schematic 409 of this circuit. The upper set of flip-flops correspond to the status 410 flag indicating whether a parallelizer is active or not while the low-411 er one is used to indicate which parallelizer is selected to transfer 412 its data during a given clock cycle. The multiplexers are used to in-413 stantly skip the non-active parallelizers.

As a result of this fast polling scheme packets arriving simulta-414 415 neously on different parallelizers may be received in different or-416 der. Packet order from the same source is still guaranteed though 417 because of the buffer-less nature of the network. The absence of 418 buffers means that two consecutive packets cannot be sourced 419 from the same client simultaneously. So if these two packets are 420 destined to the same sink, the second cannot reach the destination 421 before the first one is entirely received. So, even if the first packet is 422 delayed, the second one cannot be sourced. Moreover, the routing

scheme is deterministic so two packets originating from the same 423 source that are destined to the same client take the same path. In 424 the extreme case where the packet size is so small that it can fit in 425 the FFs within few routers along the path freeing the source output 426 link (to the first router) and allowing the second packet to enter the 427 network, contention will occur between the two packets for the 428 same output port of a router along the path and the second packet 429 is held back. 430

431

## 4.5. The <mark>central</mark> FIFO buffer

The central FIFO is organized into a pool of packet slots that can 432 be allocated/freed individually. Each slot has the size of one packet. 433 Different parallelizers, simultaneously active, polled in a round ro-434 bin fashion using the fast polling circuit result in an out-of-order 435 writing of the different packet portions into the central FIFO. The 436 definition of FIFO is violated here as some packets complete their 437 writing before packets that started being written earlier in time. 438 This requires special handling. Two queues are used to keep track 439 of the slot indexes; (1) the Allocated Queue (AQ) and (2) the Free 440 Queue (FQ). The AQ contains the indices of the allocated slots con-441 taining complete packets while the FQ keeps track of the empty 442 slots. A temporary address table, called Writing Table (WT), is used 443 to store the addresses of the packets currently being written into 444 the FIFO. The size of this table is equal to the number of paralleliz-445 ers and each parallelizer has a corresponding entry in the table. 446 Fig. 7 shows the structure of the central FIFO. 447

When a new packet portion is received at one of the paralleliz-448 ers, a request is sent to the central FIFO control logic. If the FQ is 449 empty, backpressure is initiated to stop the network from sending 450 more data into the parallelizer. If the FQ is not empty, the top ele-451 ment in the FQ is popped and sent to the entry corresponding to 452 the requesting parallelizer in the WT. The entry, stored in a regis-453 ter, is divided into two fields. The upper field, called an Index (IDX), 454 is used to store the index slot allocated to the current packet cor-455 responding to the parallelizer it is coming from. The lower field 456 points to the current location in the slot to which the next packet 457 word is to be written. This field, called the Counter (CNT) is initial-458

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx



Fig. 6. Schematic of the request propagation circuit at the heart of the intelligent polling scheme.



Fig. 7. The structure of the central FIFO.

ized to zero when the slot is allocated. When a new portion of a 459 460 packet for which a slot has already been allocated reaches the central FIFO, it is written in the location addressed by the correspond-461 ing entry in the WT. The CNT field is then incremented by one. 462 463 When the last portion of the packet is written into the central FIFO, the corresponding entry in the WT is invalidated and the IDX field 464 is pushed at the tail of the AQ. 465

466

467

468

469

The client side of the central FIFO logic continuously monitors the contents of the AQ. If at least one entry is present, which means one packet is ready to be sent out, the entry is popped out from the AQ. The IDX field is used to read out the different packet words from the central FIFO. After the packet is completely read out of the FIFO, the IDX field is pushed into the FQ to be reallocated to another packet.

## 4.6. The data transfer interface (DTI) circuit

There are several ways to transfer data between two clock domains in GALS systems (Globally Asynchronous Locally Synchronous) [27]. With out any assumptions about the two clocks, the asynchronous style represents the least impact on the communicating blocks operation and performance. Due to handshaking and synchronization between the two domains, a datum transfer would take at least three clock cycles of the slower of the two clocks [27]. In [28] throughput was increased to one datum per clock cycle (of the slower of the two clocks) by pipelining the synchronization itself alongside the data. This simple approach of implementing the FIFOs as pipelines greatly reduced the probability of failure due to Metastability and eliminated the need for detecting full/empty conditions. However it increased the latency of the interface since the pipeline has to be filled first before data can come out of it. It also imposed the constraint that the sender and receiver had to operate at the same data rate. A better approach for data transfer between different clock domains based on a general FIFO was proposed in [29]. It allows the sender and receivers to put (or send) and get (or receive) data at their own clock rates simultaneously. In addition for elaborate circuitry for detecting empty/full FIFO conditions, more circuits were added to detect when the FIFO is nearly full or empty. These signals are necessary to maintain the data transfer rates while synchronizing the conventional empty/full signals.

In this work a novel FIFO design for transferring data between two different clock domains was developed. The new FIFO design is simpler than the one in [29] yet it allows independent data send-

499 500

535

536

537

538

8

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

ing and receiving at different rates (equal to the sender and receiver clock rates, respectively) with no need for empty/full detection.
At the heart of this FIFO is a new circuit for transferring data between two clock domains. This basic component is first described then the construction of the FIFO will be shown.

## 506 4.7. The basic DTI FIFO stage

507 Fig. 8 below illustrates the basic circuit and the signaling proto-508 col required to transfer data between a client and the NoC. Data 509 transfer is illustrated for equal client and NoC clock frequencies 510 (CLK<sub>C</sub> and CLK<sub>NoC</sub>, respectively), the worst case condition in terms of the number of cycles required to complete a datum transfer. 511 512 Two data latches are utilized, one on the client side and another 513 on the NoC side. EN<sub>C</sub> and EN<sub>NoC</sub> are the latch enable signals for 514 putting the data and getting the data. A simple four-phase signal-515 ing protocol is used to simplify the circuit design. A conservative 516 two Flip Flop synchronizer design is employed [30]. The client ini-517 tiates the transfer by setting up the data and raising the PUT signal. 518 The client-side controller would strobe the latch (EN<sub>C</sub>  $\uparrow$ ) and initiates a request signal (Req<sub>Out</sub>). This signal would reach the NoC-side 519 controller as PUT<sub>Reg</sub> after two clock cycle (the synchronization de-520 lay). This controller would then strobe the NoC data latch and ini-521 tiates an acknowledgement signal (TAKE<sub>ACK</sub>). This signal would 522 reach the client-side controller after two more clock cycles which 523 in turn respond by deactivating the request signal. The NoC con-524 troller would then deactivate the acknowledge signal, completing 525 the transfer in 8 cycles. If the client or the NoC have higher clock 526 frequency than the other, the transfer would take less number of 527 cycles (the minimum is four). The use of two latches (instead of 528 a single latch or FF as in most FIFOs) per cell achieves two objec-529 tives; (1) it greatly simplifies the design by decoupling the PUT 530 (writing to the client's side latch) and GET (reading the NoC's side 531 latch output) operations, (2) it effectively provides a two-stage 532 pipeline per FIFO stage, reducing the impact of clock frequency dif-533 ference on the PUT/GET rates. 534

The number of required transfer cycles could have been reduced by overlapping data transfers but this would have resulted in more complex circuitry that would be slower to operate. As will be shown later, a maximum throughput of one datum transfer per cy-



Fig. 8. The proposed circuitry for data transfer between two clock domains; (a) the basic block diagram, (b) the signaling protocol for equal clock frequencies.



(a) The design of the client-side controller.



(b) The design of the NoC-side controller.



#### M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

cle with a latency of less than one cycle is still achieved by the pro-posed FIFO.

541 Fig. 9 below shows the design of the client/NoC controllers of 542 one of the FIFO's stages (stage *i*). Each controller has a simple two-state FSM implemented with a single FF and simple logic. 543 The client-side controller basically would latch the data into the 544 545 client-side latch and assert the Req<sub>Out</sub> signal if it receives a PUT request while the PUT<sub>ACK</sub> signal is low. The Req<sub>Out</sub> signal is kept high 546 till the PUT<sub>ACK</sub> signal goes high. The OK\_to\_PUT signal is then set 547 when the PUT<sub>ACK</sub> signal goes back low. This signal is reset when 548 the controller latch new data. It would be clear why an SR-latch 549 is used to set and reset this signal when the FIFO construction is 550 described. The NoC-side controller would latch-in the data when 551 it receives a PUT request while the previous data has been con-552 553 sumed (i.e. when OK\_to\_TAKE is low). It also asserts the TAKE<sub>ACK</sub> 554 signal and keeps it high till the PUT request signal goes low. After latching the data the OK to TAKE signal is set high. The NoC con-555 sumes the data by asserting the TAKE signal which in turn resets 556 the OK\_to\_TAKE signal. 557

#### 558 4.8. DTI FIFO construction

Fig. 10 shows the block diagram of an *n* stage FIFO constructed from the basic data transfer circuit described above. Two counters are used as pointers to the tail of the PUT queue and the head of the GET queue. A ring-connected shift register could be used as a pointer for FIFOs with large number of stages. Input data lines (D<sub>IN</sub>) are connected to the inputs of all stages on the client side. When a PUT request is received from the client an internal PUT signal (PUT<sub>i</sub>) is asserted and routed to the stage selected by the PUT pointer if its OK\_to\_PUT signal is high. The pointer is also incremented. The SR-latch in the FIFO stage used for the OK\_to\_PUT signal of the selected stage would reset after one clock cycle, hence the internal PUT signal would not *evaporate* before the end of the cycle. This allows the client to put a data item every cycle (of its own clock) as long as the FIFO is not full (indicated by the OK\_to\_PUT signal of the tail cell).

On the NoC side a datum is removed every clock cycle from the head of the queue selected by the TAKE pointer if the corresponding OK\_to\_TAKE signal is high. The NoC can assert the back pressure signal (BP) to stop taking the data. When a data item is taken, the TAKE pointer is incremented to point to the next cell. As explained earlier, it would take 8 cycles to complete a datum transfer within a cell. Data transfer within all cells proceed in parallel, hence using 8-stage FIFO ensures achieving the maximum PUT/TAKE data rates of one datum per clock cycle for any client/ NoC clock frequencies.

## 5. Performance evaluation

To evaluate the performance of the proposed NoC a custom cycle-accurate simulator has been developed. Numerous simulations have been carried out to determine the optimum parallelization factor, central FIFO organization (width and size), and link progres-



Please cite this article in press as: M.E.S. Elrabaa, A. Bouhraoua, A hardwired NoC infrastructure for embedded systems on FPGAs, Microprocess. Microsyst.

(2010), doi:10.1016/j.micpro.2010.09.008

9

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585 586 587

30 September 2010

#### M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

589 sion scheme for several NoC sizes. Two simple synthetic traffic 590 models have been used; traffic with uniform address distribution 591 and traffic with non-uniform distribution. The latter model as-592 sumes that clients that communicate with one another the most are placed adjacent to one another. Hence the probability of a cer-593 tain destination address is assumed to decrease logarithmically 594 with the 'distance' between the destination address and the source 595 address. The distance between two clients, d, is the order of the 596 minimum group that contains both clients. So the probability of 597 a certain destination address  $P_A$  is simply calculated as  $P_A = (0.5)^d$ 598 with the imposed constraint that the sum for all destination prob-599 abilities equal to 1. A packet's delay is defined as the number of 600 clock cycles from the time a packet is injected into the network un-601 til it starts being read out by the receiving client. Throughput is the 602 603 amount of useful data versus idle time actually delivered by the 604 network per unit of time. It is expressed as a fraction of the maxi-605 mum bandwidth or wire speed (which is 1 word/cycle/client). In all 606 simulations, a very high injection rate (by all clients simultaneously) was used to stress out the NoC. 607

# 5.1. Evaluation of parallelizers' width and central FIFO organization on performance

610 Several simulations have been carried out to determine the ef-611 fects of the parallelization factor and central FIFO organization (size and width) on the NoC performance for various NoC sizes 612 for the two traffic models (uniform and non-uniform). Also two 613 packet sizes were considered; 32 bytes and 64 bytes. Figs. 11 and 614 12 show throughput and maximum delay for various combinations 615 of central FIFO width (or emptying rate, R), size and parallelizer's 616 width (called P on the figure). So an  $\mathbb{R}4P4$  value on the x-axis 617 means a FIFO emptying rate of 4 bytes/cycle and a parallelizer's 618 width of 4. For each NoC three FIFO sizes were considered with 619 the two traffic models and two packet sizes (32 and 64 bytes). So 620 a NonU F16 legend means non-uniform traffic and FIFO size of 621 16 packets. Similarly a U F16 legend means uniform traffic and a 622 FIFO size of 16 packets. These results were obtained using geomet-623 rical link progression to maximally stress out the client interface 624 circuitry. It is very clear that a FIFO size of 16 packets, emptying 625 rate of 8 and parallelizer width of 8 is sufficient to obtain the max-626 imum performance under all conditions. Also the larger packet size 627 vields a slightly better throughput. Hence these values (R = 8, P = 8, 628 FIFO size of 16 packets and packet size of 64 bytes) were used in all 629 subsequent simulations. It should also be noted that for the same 630 FIFO size and emptying rate a larger parallelizer width may actu-631 ally lead to a slightly reduced throughput (e.g. going from R4P8 632 to R4P16). This is due to the fact that it would take more cycles 633 to assemble packet fragments in the parallelizers. So when the FIFO 634 is full and backpressure is triggered, it will take longer before the 635 parallelizers get filled and transfer the data to the FIFO. 636



**Fig. 11.** Measured throughput versus FIFO emptying rate (*R*) and parallelizer's width (*P*) for 3 NoC sizes, several FIFO sizes, uniform and non-uniform traffic models. Results on the left are for a packet size of 32 bytes while results on the right are for a packet size of 64 bytes.

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx



**Fig. 12.** Measured maximum packet delay versus FIFO emptying rate (*R*) and parallelizer's width (*P*) for 3 NoC sizes, several FIFO sizes, uniform and non-uniform traffic models. Results on the left are for a packet size of 32 bytes while results on the right are for a packet size of 64 bytes.

## 637 5.2. Evaluation of the different link progression schemes

Using the optimum parallelizer and central FIFO parameters obtained above, numerous simulations were carried out to evaluate
the performance of the different link progression schemes described in Section 3. Fig. 13 below shows the throughput for arithmetic and mixed arithmetic/geometrical link progression for a NoC
size of 64 CRs for uniform and non-uniform traffic. This Figure
clearly shows that for non-uniform traffic where most of the traffic

is local, for both schemes, the minimum number of extra links (i.e. 645 I = 2 and S = 4) would boost the throughput to the maximum pos-646 sible value. Any other values for I and S would be an over design. As 647 for uniform traffic (where most of the traffic traverse upper rows of 648 the NoC), a minimum increment of 2 combined with slightly lower 649 stopping level of 3 would yield a throughput very close to the max-650 imum while providing huge savings on hardware resources. The 651 average packet delays for these two progression schemes were 652 found to be identical and independent of the link progression 653





**Fig. 13.** Throughput as a function of link progression parameters (increment *I* and stop level *S*) for arithmetic progression (left) and mixed arithmetic/geometrical progression (right) for uniform (U) and non-uniform (NU) traffic.

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

12

#### M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

#### Table 3

Average packet delay for both arithmetic and mixed arithmetic/geometrical link progression schemes and the original geometrical link progression scheme for the two traffic models.

| NoC type | Arithmetic<br>link progre |             | Original MFT with pure geometrical link progression |             |  |
|----------|---------------------------|-------------|-----------------------------------------------------|-------------|--|
| NoC size | Uniform                   | Non-uniform | Uniform                                             | Non-uniform |  |
| 16-CRs   | 15                        | 11.5        | 14.5                                                | 11.5        |  |
| 32-CRS   | 16.5                      | 12          | 16                                                  | 12          |  |
| 64-CRs   | 18.5                      | 12          | 18                                                  | 12          |  |



Fig. 14. Throughput versus link doubling start level for the controlled geometrical link progression scheme for three NoC sizes and for uniform (bottom group of curves) and non-uniform traffic.

654 parameters (increment and stop level). Table 3 shows the average packet delays for these link progression schemes compared to the 655 original geometrical link progression for the two traffic models for 656

several NoC sizes. Also the delays of these schemes are very close to the geometrical progression scheme.

Fig. 14 shows throughput versus link doubling start level for the controlled geometrical link progression scheme for three NoC sizes for the two traffic models. As expected this scheme yielded the lowest throughput for uniform traffic since it has the fewest links in upper rows where most of the traffic is. For the more realistic non-uniform traffic, this progression scheme achieves high throughput (>90%) even for relatively low level of link doubling. As will be shown next this allows the designer to reduce the number of links at the edge of the NoC (and consequently the NoC gate count) significantly while retaining an excellent throughput. Again the average packet delays for this scheme were identical to the other link progression schemes.

#### 5.3. DTI simulation results

The ability of the DTI circuit to transfer data between a client 672 and the NoC at the maximum rate of one datum/cycle is very cru-673 cial for the operation of the NoC. To verify the operation of the DTI 674 circuit a gate-level implementation of an 8-stage FIFO was simu-675 lated (with unit gate delays) with three Client to NoC clock fre-676 quency ratios; 1:1, 1:2.5 and 2.5:1. Fig. 15 shows the simulation 677 results for the three clock ratios alongside one another. The results 678 show that for equal frequencies, both client and NoC are able to 679 put/get a datum per clock cycle. When the NoC's clock frequency 680 is  $2.5 \times$  that of the client, the client is still able to put data every cy-681 cle but the data removal rate by the NoC is automatically reduced 682 by a factor of 2.5 of the NoC clock frequency. When the client's 683 clock frequency is  $2.5 \times$  that of the NoC, initially when the FIFO is 684 empty, the client is able to put data at the maximum rate. The rate 685 gradually goes down till it reaches 1/2.5 of the client's clock rate. 686 The gradual reduction of the rate is due to two factors; (1) for this 687



(a) Equal clock frequencies.

the client's. the NoC's.

Fig. 15. Simulation results of the 8-stage FIFO with three client/NoC clock frequency ratios.

731

732 733

734

735

736

737 738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

clock ratio, it takes 4 NoC clock cycles to transfer a datum between
the client-side latch to the NoC-side latch. Since the FIFO size is 8
there will be enough time for several cells to complete their data
transfers. (2) The inherent pipelining with the cell due to the use
of two latches.

## 693 5.4. Comparison with a *mesh-based* NoC

In order to compare the MFT's performance to a classical Mesh NoC, another simulator was developed for simulating Mesh NoCs using simple non-bursty traffic with uniform address generation. Fig. 16 below shows the average latency results for several Mesh sizes. It is very clear that unlike the MFT, these networks start to saturate at very low injection rate ( $\sim$ 30%). This is consistent with reported results in the literature for this popular type of NoCs.

## 701 5.5. NoC gate count

The total gate count of various sized NoCs for the three link pro-702 gression schemes have been evaluated using a word size of 8-bits. 703 704 A special C-code is used to generate the RTL-level Verilog descrip-705 tion of routers, parallelizers and central FIFOs under different link 706 progression schemes. These Verilog codes were then automatically 707 synthesized. The special polling circuitry and the DTI were manu-708 ally designed using basic logic components (gates, MUXs and FFs) 709 to optimize their performance. A break down of gate count is shown below. 710

711 1. The router is made of:

712 713

723

724

730

- Upward and downward input ports which have been synthesized using 250 and 260 gates, respectively (mostly FFs)
- 714 Crossbars that are only present in routers with a number of output ports that are not enough to eliminate contention. 715 These crossbars are considered as series of wide many-to-716 one multiplexers implemented using pass-gates. They are 717 718 modeled as 3 2-input NAND-equivalent gates per bit. One multiplexer per output port is needed. The number of input 719 720 ports depends on the link progression scheme. For example, a crossbar of 6 inputs and 8 outputs with an 8-bits datapath 721 722 is equivalent to approximately 6 \* 8 \* 8 \* 3 = 1152 gates
  - Output ports contain minimum amount of logic (mainly for the backpressure signal)
- Parallelizers are implemented using simple FFs at a gate-equivalency of 4/FF/bit. For example, a 16-word-wide, 2-stage parallelizer of 8 links will amount to 16 × 2 × 8 × 8 × 4 = 8 K gates.
  The central FIFO is implemented using embedded memories
  - The central FIFO is implemented using embedded memories where the area taken by a 16 packets memory with a packet



Fig. 16. Average latency versus injection rate for mesh-based network of various sizes.

size of 64 bytes (8 K-bits) is  $0.008 \text{ mm}^2$  in a  $0.13 \mu \text{m}$  technology. This is roughly equivalent to 1.4 K gate in the same technology. The control logic of the central FIFO amounts to about 1 K gates.

4. The total gate count for DTI circuit depends on the width of the central FIFO's data width. For an 8-byte data width the DTI contains 7 K gates.

Table 4 below shows the gate count for arithmetic link progression for various combinations of arithmetic increment  $I_A$  and stop level  $S_L$ . The gate count for the original MFT is also shown in the table (last column) for comparison). The gate count for the mixed arithmetic/geometrical and the controlled geometrical link progression schemes are shown in Tables 5 and 6, respectively. For all progression schemes, the total RAM size used for central FIFOs is 128 Kb, 256 Kb and 512 Kb for NoC sizes of 16, 32 and 64 CRs, respectively.

Fig. 17 shows a bar chart of the throughput of all link progression schemes versus stop (start) levels for the arithmetic and mixed arithmetic/geometrical (controlled geometrical) link progression schemes for the two traffic models for a 64 CRs NoC. Also shown on the same Figure are the total gate counts for the three schemes (as lines) and the throughput and gate count of the original MFT NoC for reference. As this figure shows, the controlled geometrical progression with a low starting level yields the lowest gate count at an adequate throughput for non-uniform traffic. On the other hand the arithmetic link progression with a stop level of 3 achieved maximum throughput at a  $\sim$ 50% gate count for both types of traffic. It is very clear that the original geometrical and mixed arithmetic/geometrical link progression schemes represent an over design; large gate count at a throughput not more than the arithmetic progression. Based on these results FPGA designers can have different products with different link progression schemes for different applications; controlled geometrical progression (starting at level 2) for applications with localized non-uniform traffic and arithmetic progression targeting applications with more uniform traffic.

## 5.6. Impact on FPGA resources

To put the numbers of gate count into perspective the impact on 769 FPGA resources has been estimated for two cases of a 64 CRs NoC 770 with central FIFO size of 8 Kb; controlled geometrical progression 771 with starting level of 2 and arithmetic link progression with an 772 arithmetic progression of 2 and stop level of 4. The 1st design pro-773 vide 75% throughput for non-uniform traffic with minimum gate 774 775 count of 477 K-gates of logic and 512 Kb of RAM while the second design provide throughputs of 93% and 87% for non-uniform and 776 777 uniform traffic, respectively at a gate count of 558 K-gates of logic and 512 Kb of RAM. To evaluate the impact of the extra logic and 778 RAM resources required by the NoC on the FPGA resources (slices) 779 the following has been done: 780

1. To evaluate the area penalty of adding custom logic two Xilinx 781 Virtex-6 FPGAs with identical resources except for the number 782 of custom DSP slices have been considered. The XC6VLX240T 783 and XC6VLX365T [31] FPGAs are identical in every aspect (I/ 784 Os, MACs, Transceivers, Block RAMs, etc.) except for the number 785 of custom DSP slices (called DSP48E1 slices) and the number of 786 logic slices. The XC6VLX240T has 192 more DSP48E1 slices than 787 the XC6VLX365T. Accordingly the XC6VLX365T has 19,200 788 more logic slices than the XC6VLX240T. Hence one can fairly 789 assume that each additional DSP48E1 slice costs 100 logic 790 slices. So if the number of logic gates per DSP slice is estimated, 791 one can estimate the general cost of custom logic. This is a 792 crude method but can still give a good estimate of the cost of 793

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

adding custom logic. According to the data sheet of these FPGAs 794 [31], each DSP48E1 slice contains the following;  $25 \times 18$ -bit 795 two's complement multiplier, a 48-bit accumulator, multiplier 796 bypass logic, a 48-bit add/subtract arithmetic unit and a logic 797 unit that can generate any one of 10 different logic functions 798 of the two 48-bit operands. The DSP48E1 includes an additional 799 pre-adder, and the multiplier can perform logic functions (AND, 800 OR) and barrel shifting. Assuming that the multiplier is using a 801 modified Booth algorithm that saves area and enhance speed 802 [32] it would require a minimum of 2614 gates (30 gates for 803 encoding and sign extension, 2200 gates for partial products 804 generation and adder array, and 384 gates for the final 48-bit 805 adder). The pre-adder and bypass logic would require at least 806 624 gates, the logic unit would require a minimum of 528 gates 807 and the accumulator needs about 480 gates. The total (ignoring 808 any logic required to do the barrel shifting by the multiplier) 809 amounts to 4246 gates. This means that adding 42.46 gates of 810 custom logic will require the elimination of a single FPGA logic 811 slice. Hence, to add one of the above NoC implementations to a 812 Virtex-6 FPGA would require the elimination of about 11,200 813 and 13,100 logic slices for the controlled geometrical progres-814 sion and arithmetic progression NoCs, respectively. For an 815 XC6VLX760 Virtex-6 FPGA with 118,500 slices, this amounts 816 to 9.45% and 11% reduction of logic resources. 817 2. Considering the same LX760 Virtex-6 FPGA, the central FIFOs 818

would consume 512 Kb of the FPGA's 25,920 Kb block RAMs, which is less than 2%.

Hence, according to these area approximations, to add a fairly large sized NoC (64-clients) as a hardware macro to a state-ofthe-art FPGA around 10% of the logic resources and 2% for the block RAMs will have to be sacrificed.

## 5.7. CR size selection

Selecting the appropriate size of a CR in terms of logic slices rep-827 resents a major design trade-off. As the number of CRs increases 828 (hence each CR will have smaller number of logic slices), the CR 829 fragmentation is reduced but the NoC area will grow and more lo-830 gic slices have to be eliminated. Hence the number of CRs should 831 be determined by the class of applications targeted by the FPGA. 832 For example if the average number of logic slices required by cli-833 ents is 2000 with up to 20 clients per implementation, then a 834 32-CRs NoC with 2000 logic slice per CR could be used resulting 835 in an 64 K slice FPGA. Assuming arithmetic link progression with increment of 2 and stop level of 2, the NoC would occupy an area equivalent to 5600 logic slices and consume 256 Kb of the block RAMs. So an existing FPGA design with about 120,000 slices (e.g. 839 Xilinx XC6VLX760 Virtex-6 FPGA) could be re-designed to achieve 840 the above. This is the trend followed by FPGA vendors for some 841 time now; providing FPGAs with various flavors for different clas-842 ses of applications (general logic, DSP applications, embedded pro-843 cessors or the so called platform FPGAs, etc.). The NoC will be just 844 another variation on top of these basic options with FPGAs ranging 845 from having no NoC at all to having NoCs with various sizes. 846

## 5.8. FPGA implementation of the MFT NoC

Two NoCs with full geometrical progression (i.e. link doubling 848 from one level to the next) have been mapped to two FPGAs; a 849 32-client NoC on a Xilinx Virtex-6 XC6LX760 and a 16-client NoC 850 on a Virtex-4 XC4LX200 to show the area/speed of the NoC if 851 implemented as soft macros. Highest speed grades were used for 852 both implementations. The first NoC has a total of 80 routers, 853 2240 links, parallelizers (31 per client) and 32 central FIFOs. The 854 second NoC has a total of 32 routers, 608 links, 240 parallelizers 855

| ממרך בסתוור לווו | N Barroy IOI JUN         | dair couin (in in Baire) ior several ivoc miprimentations with va |                    | ווסמי למומווורורוים זטו מוזוחווורחר זוווא לוספורשוטוו |                    | v programmi        |                    |                    |                    |                    |                    |                  |
|------------------|--------------------------|-------------------------------------------------------------------|--------------------|-------------------------------------------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|------------------|
| NoC type         | MFT with arithmetic link | Г                                                                 | orogression        |                                                       |                    |                    |                    |                    |                    |                    |                    |                  |
| NoC size         | $I_A = 2, S_L = 1$       | $I_A = 2, S_L = 2$                                                | $I_A = 2, S_L = 3$ | $I_A = 2, S_L = 4$                                    | $I_A = 4, S_L = 1$ | $I_A = 4, S_L = 2$ | $I_A = 4, S_L = 3$ | $I_A = 4, S_L = 4$ | $I_A = 6, S_L = 1$ | $I_A = 6, S_L = 2$ | $I_A = 6, S_L = 3$ | $I_A = 6, S_L =$ |
| 16-CRs           | 106                      | 93                                                                | 73                 | I                                                     | 165                | 139                | 102                | I                  | 200                | 245                | 131                | I                |
| 32-CRS           | 299                      | 274                                                               | 230                | I                                                     | 547                | 488                | 381                | I                  | 856                | 756                | 561                | I                |
| 64-CRs           | 833                      | 782                                                               | 686                | 558                                                   | 1657               | 1529               | 1275               | 946                | 2726               | 2497               | 2025               | 1420             |

94 781 8147

MFT

4

several NoC implementations with various parameters for arithmetic link progression Gate count (in K gates) for

(2010), doi:10.1016/j.micpro.2010.09.008

32-CRS 64-CRs Table 4 Please cite this article in press as: M.E.S. Elrabaa, A. Bouhraoua, A hardwired NoC infrastructure for embedded systems on FPGAs, Microprocess. Microsyst. 847

819

820 821

822

823

824

825

## **ARTICLE IN PRESS**

#### M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

## Table 5

| Gate count (in K gates) | for the mixed | arithmetic/geometrical | link | progression | scheme. |
|-------------------------|---------------|------------------------|------|-------------|---------|
|                         |               |                        |      |             |         |

| NoC type | MFT with mix       | ed arithmetic/ge | eometrical link p | progression     |                    |               |                 |                    |                 | MFT  |
|----------|--------------------|------------------|-------------------|-----------------|--------------------|---------------|-----------------|--------------------|-----------------|------|
| NoC size | $I_A = 2, S_L = 2$ | $I_A=2,S_L=3$    | $I_A=2,\ S_L=4$   | $I_A=4,\ S_L=2$ | $I_A = 4, S_L = 3$ | $I_A=4,S_L=4$ | $I_A=6,\ S_L=2$ | $I_A = 6, S_L = 3$ | $I_A=6,\ S_L=4$ |      |
| 16-CRs   | 117                | 152              | _                 | 173             | 199                | -             | 231             | 246                | -               | 194  |
| 32-CRS   | 327                | 427              | -                 | 537             | 638                | -             | 770             | 854                | -               | 781  |
| 64-CRs   | 890                | 1146             | 1636              | 1585            | 1862               | 2464          | 2403            | 2628               | 3296            | 3147 |

## Table 6

Gate count (in K gates) for the limited geometrical link progression scheme.

| NoC type | MFT with  | controlled geo | metrical link p | progression | MFT  |
|----------|-----------|----------------|-----------------|-------------|------|
| NoC size | $S_L = 1$ | $S_L = 2$      | $S_L = 3$       | $S_L = 4$   |      |
| 16-CRs   | 64        | 104            | _               | -           | 194  |
| 32-CRS   | 142       | 222            | 401             | -           | 781  |
| 64-CRs   | 317       | 477            | 836             | 1593        | 3147 |

(15 per client) and 16 central FIFOs. These are fairly large NoCs 856 (due to geometrical progression) that would be implemented using 857 781 K and 194 K Gates of custom logic and 256 Kb and 128 Kb of 858 RAM. The central FIFOs were implemented using the FPGAs' 859 BRAMs which are 18 Kb each. Each NoC was implemented twice; 860 once with the parallelizers implemented using slice FFs and an-861 other with BRAMs. Table 7 shows the area and speed results for 862 these two NoCs alongside several NoC implementations on FPGAs. 863





#### Table 7

Area and speed results for the soft implementation of the MFT NoC. Area is reported as number of LUTs used and their percentage of available. The same is done with BRAM. Resources have been normalized for 8-bit data width.

| NoC                                                                           | Resources                                           | Total LUTs                              |                                  | FFs only                              |                                 | BRAM                                   |                                      | Freq.              |
|-------------------------------------------------------------------------------|-----------------------------------------------------|-----------------------------------------|----------------------------------|---------------------------------------|---------------------------------|----------------------------------------|--------------------------------------|--------------------|
|                                                                               |                                                     | Complete<br>NoC                         | Per<br>client                    | Complete<br>NoC                       | Per<br>client                   | Complete<br>NoC                        | Per<br>client                        |                    |
| 32-client MFT on Virtex-6 (LX760)<br>with 5-i/p LUTs                          | With FF parallelizers<br>With BRAM<br>parallelizers | 139,409<br>(29.4%)<br>82,497<br>(17.4%) | 4356<br>(0.9%)<br>2578<br>(0.5%) | 84,928<br>(17.9%)<br>26,032<br>(5.5%) | 2654<br>(0.6%)<br>814<br>(0.2%) | 576 Kb<br>(2.2%)<br>18,432 Kb<br>(71%) | 18 Kb<br>(0.07%)<br>576 Kb<br>(2.2%) | 349 MHz<br>341 MHz |
| 16-client MFT on Virtex-4 (LX200)<br>with 4-i/p LUTs                          | With FF parallelizers<br>With BRAM<br>parallelizers | 21,133<br>(12%)<br>20,623<br>(11.5%)    | 1321<br>(0.8%)<br>1289<br>(0.7%) | 25,743<br>(14.4%)<br>9007<br>(5%)     | 1609<br>(0.9%)<br>563<br>(0.3%) | 288 Kb<br>(4.8%)<br>4608 Kb<br>(76%)   | 18 Kb<br>(0.3%)<br>288 Kb<br>(4.8%)  | 211 MHz<br>210 MHz |
| 5-clients on Virtex-4 (LX200ff)<br>with 4-i/p LUTs [24]                       | With FF FIFOs<br>With BRAM FIFOs                    | 7716<br>(4.33%)<br>4278<br>(2.4%)       | 1544<br>(0.9%)<br>856<br>(0.5%)  | -                                     |                                 | -                                      |                                      | 118 MHz<br>122 MHz |
| 5-ports router on Altera Stratix<br>EP1S80 [26] <sup>a</sup>                  | Only the router                                     | 550 4-i/p LUTs                          |                                  | -                                     |                                 | -                                      |                                      | 123 MHz            |
| 4-clients on Virtex-2 XCV800<br>with 4-i/p LUTs [16] <sup>a</sup>             | With BRAMs in routers                               | 3227<br>(34.8%)                         | 807<br>(8.7%)                    | -                                     |                                 | -                                      |                                      | 40 MHz             |
| 5-ports router on Virtex-2 Pro<br>(XC2VP30) with 4-i/p LUTs [17] <sup>a</sup> | With BRAMs in router                                | 772 (2.57%)                             |                                  | -                                     |                                 | 180 Kb                                 |                                      | 32.25 MH2          |
| 9-clients on Virtex-2 Pro 40<br>with 4-i/p LUTs [20] <sup>a</sup>             | With BRAMs in routers                               | 6110<br>(14%)                           | 679<br>(1.6%)                    | -                                     |                                 | 594 Kb<br>(17%)                        | 66 Kb<br>(1.9%)                      | 50 MHz             |

<sup>a</sup> Clients network interfaces are not included.

16

#### M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

#### Table 8

Area (gate count) and speed results for the HW implementation of a 16-client MFT NoC compared to other HW NoCs. Resources have been normalized for 8-bit data width.

|                                                          | Total Gate count           | _                            | RAM                                      |                 | Freq.   |
|----------------------------------------------------------|----------------------------|------------------------------|------------------------------------------|-----------------|---------|
| NoC                                                      | Complete NoC               | Per client                   | Complete NoC                             | Per client      |         |
| 16-client MFT using 90-nm technology <sup>b</sup>        | 194 K gates<br>216 K gates | 12.1 K gates<br>13.5 K gates | 128 Kb<br>Included in equivalen          | 8 Kb<br>t gates | 800 MHz |
| 5-clients using 90-nm technology [24] <sup>c</sup>       | 77 K Gates                 | 15.4 K gates                 | Included in equivalent gates             |                 | 500 MHz |
| 5-ports router using 180-nm technology [26] <sup>a</sup> | 1500                       |                              | Not including buffer RAM and its control |                 | 340 MHz |
| 9-clients using 130-nm technology [20] <sup>a,b</sup>    | 1.1 M gates                | 122 K gates                  | Included in equivalent                   | t gates         | 200 MHz |

<sup>a</sup> Clients network interfaces are not included.

 $^{\rm b}$  Speed estimated as  $4\times$  that of the FPGA implementation as in [33].

<sup>c</sup> Gate count estimated by dividing total area in mm<sup>2</sup> by the area of a minimum size 2-ip NAND gate.

Resources are reported as total used and per client since different 864 865 implementations implemented different NoC sizes. Also, for the published soft NoCs the number of LUTs were normalized for a 866 867 data width of 8-bits by simple division. Many of the reported implementations do not report the detailed usage of BRAMs (like 868 [16] where BRAMs are used in the routers). Many researchers just 869 870 implement the routers without the clients' network interfaces. Re-871 sources are reported as total numbers and percentage of available 872 resources. Although these percentages do not mean much because 873 they depend on the FPGA used, they are reported since they were 874 reported in the original papers.

Although the MFT NoC was not intended for soft implementation it is still comparable to other soft NoCs in resource utilization and superior in speed as the results in Table 7 show.

#### 5.9. Speed and area comparison with other HW NoCs

879 Table 8 shows the area (gate count) and speed estimation for the 16-client MFT NoC with geometrical link progression alongside 880 several HW NoCs from the literature. The performance of the MFT 881 882 NoC is estimated based on the Virtex-4 FPGA implementation to be 883 at least between 800 and 1000 MHz [33] for a 90-nm technology (the Virtex-4 technology). The gate count is given for two in-884 stances; one with central FIFOs expressed in RAM Kbs and another 885 with the RAM converted to their gate equivalent for comparison 886 with other HW NoCs. This table shows that the MFT NoC is very 887 888 efficient in area and superior in performance even if the other 889 HW NoC speed is assumed to increase by 50% from one technology 890 generation to the next.

## 891 6. Conclusions

A new hardwired NoC based on a modified Fat Tree topology for 892 893 future FPGAs has been developed along with an efficient H-tree 894 floor plan that naturally follows the construction methodology. The performance of the proposed NoC has been evaluated for three 895 link progression schemes for various NoC sizes and traffic models. 896 Simulations result showed that for uniform traffic, arithmetic link 897 898 progression would yield best performance with moderate gate count. For more localized traffic (non-uniform), the controlled geo-899 900 metrical progression can provide adequate performance at the lowest gate count. As for the client interface, a new efficient design 901 902 have been developed that requires a single central FIFO buffer 903 combined with link-termination circuitry (parallelizers) that pro-904 vide temporary storage for incoming data words. An intelligent 905 round robin circuit was also developed to transfer packet data from 906 parallelizers to the central FIFO. The design of these interface cir-907 cuitry was optimized through extensive simulations. These simula-908 tions showed that a parallelization factor of 8 combined with a 909 FIFO size of 16 packets and emptying rate (i.e. width) of 8 words

are enough to achieve maximum throughput and minimum packet 910 delay. Simulations also showed that performance is fairly indepen-911 dent of packet size. Finally a new interface circuit that transfer data 912 between the NoC and clients operating at different (and unrelated) 913 clock frequencies has been developed. Unlike the asynchronous 914 interfaces used by other NoCs, this circuit allows the NoC and cli-915 ents to communicate synchronously at the maximum rate of 1 da-916 tum/cycle no matter what is the clock frequency ratio between 917 them. Due to its asynchronous nature, the operation of this circuit 918 was verified with gate-level simulations. Both soft and hard wired 919 implementations of the MFT NoC show superior performance over 920 reported NoCs at an area cost that is comparable or better than 921 other NoCs. 922

#### Acknowledgement

This work was supported by King Fahd University of Petroleum and Minerals (KFUPM) Grant # IN070367.

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956 957

958

959

960

961

962

963

964

965

#### References

- T. Dorta, J. Jiménez, J.L. Martín, U. Bidarte, A. Astarloa, Overview of FPGA-based multiprocessor systems, in: Proc. 2009 International Conference on Reconfigurable Computing and FPGAs, 2009, pp. 273–278.
- [2] W. Webb, FPGAs Reshape Embedded Design, EDN Magazine, March 2009. <www.edn.com/article/CA6643363.html>.
- 3] Xilinx Virtex-5 FPGA Configuration Guide, UG191 (v3.8), 2008.
- [4] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, D. Lindqvist, Network on chip: an architecture for billion transistor era, in: Proc. IEEE NorChip Conf., 2000.
- [5] W.J. Dally, B. Towles, Route packets, not wires: on chip interconnection networks, in: Proc. 38th Design Automation Con., 2001, pp. 684–689.
- [6] J. Henkel, W. Wolf, S. Chakradhar, On-chip networks: a scalable, communication-centric embedded system design paradigm, in: Proc. 17th Int. Conf. VLSI Design, 2004, pp. 845–851.
- [7] E. Nilsson and J. Öberg, Reducing power and latency in 2-D mesh NoCs using globally pseudochronous locally synchronous clocking, in: Proc. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS'04), 2004, pp. 176–181.
- [8] E. Nilsson and J. Öberg, Trading off power versus latency using GPLS clocking in 2D-mesh NoCs, in: Proc. Int. Sym. On Signals, Circuits and Systems (ISSCS), 2005, pp. 51–54.
- [9] V. Soteriou, H. Wang, L. Peh, A statistical traffic model for on-chip interconnection networks, in: Proc. of the 14th IEEE Intl. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '06), 2006, pp. 104–116.
- [10] E. Nilsson, Design and Implementation of a Hot-potato Switch in a Network on Chip, Master's Thesis, Royal Institute of Technology, IMIT/LECS 2002–2011, Sweden, June 2002.
- [11] K. Goossens, J. Dielissen, A. Radulescu, Æthereal network on chip: concepts, architectures, and implementations, IEEE Design and Test of Computers 22 (5) (2005) 414–421.
- [12] A. Bouhraoua and M. Elrabaa, A high-throughput network-on-chip architecture for systems-on-chip interconnect, in: Proc. of the Int. Symp. on System-on-Chip (SOC06), 2006, pp. 1–4.
- [13] A. Bouhraoua, M. Elrabaa, An efficient network-on-chip architecture based on the fat tree (FT) topology, Arabian Journal of Science and Engineering (AJSE) 32 (2C) (2007) 13–26 (special issue on microelectronics).
- [14] C. Leiserson, Fat-trees: universal networks for hardware-efficient supercomputing, IEEE Transactions on Computers C-34 (10) (1985) 892–901.

972

977

978

983

984

985

986

987

991

992

993

999

# **ARTICLE IN PRESS**

M.E.S. Elrabaa, A. Bouhraoua/Microprocessors and Microsystems xxx (2010) xxx-xxx

- 966 [15] A. Bohraoua and M. Elrabaa, Addressing heterogeneous bandwidth 967 requirements in modified fat-tree networks-on-chips, in: Proc. of the 4th 968 IEEE Int. Sym. on Electronic Design, Test & Applications (DELTA'2008), 2008, 969 pp. 486-490. 970
  - [16] T. Marescaux, et al., Networks on chip as hardware components of an OS for reconfigurable systems, in: Proc. 13th Int. Con. on Field Programmable Logic and Applications (FPL), 2003, pp. 595-605.
- 973 B. Sethuraman, P. Bhattacharya, J. Khan, R. Vemuri, LiPaR: A light weight [17] 974 parallel router for FPGA based networks on chip, in: Proc. 15th ACM Great 975 Lakes Symp. on VLSI, 2005, pp. 452-457.
- 976 [18] N. Kapre, et al., Packet switched vs. time multiplexed FPGA overlay networks, in: Proc. 14th Annual Int. IEEE Sym. on Field-Programmable Custom Computing Machines (FCCM), 2006, pp. 205-216.
- 979 [19] T. Pionteck, et al., Applying partial reconfiguration to networks-on-chips, in: 980 Proc. 16th Int. Con. on Field Programmable Logic and Applications (FPL), 2006, 981 pp. 1-6. 982
  - [20] T. Bartic, J.-Y. Mignolet, T. Marescaux, D. Verkest, S. Vernalde, R. Lauwereins, Highly scalable network-on-chip for reconfigurable systems, in: Proc. 2003 IEEE Int. Symp. On System-on-Chip (SOC03), 2003, pp.79-82.
  - R. Hecht, S. Kubisch, A. Herrholtz, D. Timmermann, Dynamic reconfiguration [21] with hardwired networks-on-chip on future FPGAs, in: Proc.14th Int. Conf. on Field Programmable Logic and Applications (FPL), 2005, pp. 527-530.
- 988 [22] R. Gindin, I. Cidon, I. Keidar, NoC-based FPGA: architecture and routing, in: 989 Proc. 1st ACM/IEEE Int. Sym. on Networks-on-Chip (NOCs'07), 2007, pp. 253-990
  - [23] H. Elmiligi, M. El-Kharashi, F. Gebali, Introducing OperaNP: a reconfigurable NoC-based platform, in: Proc. 2007 IEEE Canadian Conf. on Electrical and Computer Engineering, 2007, pp. 940–943.
- 994 [24] K. Goossens, M. Bennebroek, J.-Y. Hur, M. Wahlah, Hardwired networks on 995 chip in FPGAs to unify functional and configuration interconnects, in: Proc. 996 2nd ACM/IEEE Int. Sym. on Networks-on-Chip (NOCs'08), 2008, pp. 45-54. 997
- M. Wahlah, K. Goossens, 3-Tier reconfiguration model for FPGAs using 998 hardwired network on chip, in: Proc. Int. Con. on Field Programmable Technology (FPT), 2009, pp. 504-509.
- 1000 [26] R. Pau, N. Manjikianm, Implementation of a configurable router for embedded 1001 network-on-chip support in FPGAs, in: Proc.6th NEWCAS-TAISA Con., 2008, 1002 pp. 25-28.
- 1003 P. Teehan, M. Greenstreet, G. Lemieux, A survey and taxonomy of GALS design 1004 styles, IEEE Design & Test of Computers 24 (5) (2007) 418-428.
- 1005 [28] J. Seizovic, Pipeline synchronization, in: Proc. Int. Symp. On Advanced Research 1006 in Asynchronous Circuits and Systems (ASYNC 94), 1994, pp. 87-96.
- 1007 [29] T. Chelcea, S. Nowick, Robust interfaces for mixed timing systems, IEEE 1008 Transactions on Very Large Scale Integration (VLSI) Systems 12 (8) (2004) 1009 857-873.
- 1010 [30] R. Ginosar, Fourteen ways to fool your synchronizer, in: Proc. of 9th Int. Symp. 1011 On Asynchronous Circuits and Systems (ASYNC'03), 2003, pp. 1-8.

- [31] http://www.xilinx.com/support/documentation/data\_sheets/ds150.pdf.
- M. Elrabaa, I. Abu-Khater, M. Elmasry, Advanced Low-Power Digital Circuit [32] Techniques, Kluwer Academic Publishers, 1997.
- I. Kuon and J. Rose, Measuring the gap between FPGAs and ASICs, in: Proc. of [33] 15th Int. Symp. IEEE Field Programmable Gate Arrays (FPL), 2006, pp. 21-30.



Muhammad E. S. Elrabaa received M.A.Sc. and PhD degrees in Electrical & Computer Engineering from the University of Waterloo, Waterloo, Canada, in 1991 and 1995, respectively. From 1995 till 1998, he worked as a senior component designer with Intel Corp., Portland, Oregon, USA. He designed and developed low power digital circuits for Microprocessors. He is currently an associate professor with the computer Engineering department, KFUPM University. His current research interests include Networks-on-chip, defect tolerant circuit techniques and reconfigurable computing. He authored and co-authored numerous papers, a book and holds two US patents.



Abdelhafidh Bouhraoua is an Assistant Professor at the Computer Engineering Dept., KFUPM, Saudi Arabia. Dr. Bouhraoua completed his Bs. in Computer Science and Engineering from the National Institute of Informatics in Algiers, Algeria in 1989, After completing his Bs: he joined the Ministry of Scientific Research, Algeria, where he contributed in the design and VLSI integration of a RISC microprocessor. He, then, joined the VLSI/System Architecture group (ASIM) at the University Of Paris. France, where he completed his Ms. and PhD degrees, respectively in 1993 and 1998. During his PhD, Dr. Bouhraoua contributed to the design and performance evaluation of the interconnection network that was the

core of a PC-based network of workstations. Prior to joining KFUPM, Dr. Bouhraoua was working as a Senior ASIC Designer for numerous North American Fabless ASIC design companies like Applied Micro Circuits Corporation and Zarlink Semiconductor where he was designing ASICs for various telecom systems. Currently, Dr. Bouhraoua research interests are in SoC design methodologies, NoCs design; reconfigurable architectures; FPGA systems; embedded systems, robotics and mechatronics.

17

1012

1013

1014

1015

1016

1017

1020