# Low-latency Multi-Level Mesh Topology for NoCs

Mohsen Saneei Nanoelectronics Center of Excellence, School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran Email: mohsen.saneei@ece.ut.ac.ir Ali Afzali-Kusha Nanoelectronics Center of Excellence, School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran Email: afzali@ut.ac.ir Zainalabedin Navabi Computer Aided Design Lab., Elec. and Comp. Eng. Dept. University of Tehran, Tehran, Iran Email: navabi@cad.ece.ut.ac.ir

Abstract- In this paper, we introduce a new topology for network on chips which is named Multi-Level Mesh topology. The Multi-Level Mesh topology is basically similar to the 2D-mesh with this difference that we have several meshes that have some common routers. This architecture reduces the latency and the dynamic power consumption in NoCs and can improve the communication throughput in high traffic applications. This architecture reduces the latency of  $3\times3$ ,  $5\times5$ , and  $7\times7$  2-level mesh architecture, respectively. The results are expected to improve further if a better adaptive routing algorithm is utilized.

*Index Terms*— Network-on-Chip, latency, topology, Multi-Level Mesh, routers.

## I. INTRODUCTION

The integration of a system onto a chip (SoC) in near future will become very difficult due to the fact that the number of transistors on a chip will exceed one billion with several hundred cores very soon [1]. These large systems may have a lot of problems. Delay of wires, synchronization, power consumption, scalability, reusability, and bus-bandwidth are some of these problems which most of them originates from long wire delays [1][2][3][4][5].

Usually, the interconnect architecture in the traditional SoCs is based on dedicated wires or shared buses. In dedicated wires approach, every bus is common between only two cores. In this approach the number of wires in the system increase exponentially as the number of cores grows. Shared bus is a set of wires common to multiple cores. This approach is more scalable and reusable than dedicated wires. However, we only allow have one communication transition at a time in the bus limiting the bandwidth and scalability of this interconnect architecture. So, this architecture is not suitable for large SoCs [3].

Network-on-Chip (NoC), as a new SoC paradigm, is a good solution to reduce these problems [2][4][5]. A NoC is a regular tile-based architecture that provides the communication infrastructure for resources. Each NoC has two main parts which are computation part and communication part. The

computation part contains several operational blocks or IPcores which connect to a communication part (routers) through resource network interface (RNI). The communication part is a set of links and routers that transport packets from source IP to destination IP. Several topologies have been proposed for this part (see, [4][5][6][7][8]). Some of these topologies are shown in Figure 1. In this figure, the IP-cores are shown by white squares while routers are shown by black squares.

The first topology is SPIN which was proposed in [8]. SPIN is a tree based topology that every node has 4 children and parent is replicated 4 times at any level. A mesh based interconnect architecture is proposed in [5] that consist of an  $M \times N$  mesh of routers. Each router has been connected to one IP core and four neighboring routers, except for the routers in the edges. Therefore, the number of routers is equal to the number of IP blocks.

2D-torus is another commonly used topology presented in [4]. 2D-torus is basically the same as a regular mesh except in the edge routers. The routers at the edges are connected to the routers at the opposite edge through a wrap-around channel. These long wrap-around channels can yield excessive delays. However, this delay can be avoided by folding the torus [7].

The performance is one of very important parameter in NoCs. In this paper, we introduced a topology named Multi-Level-Mesh (MLM) that reduces latency and power and increase throughput. In section 2, we introduce the proposed MLM topology. In Section 3, our router will be introduced. The power and latency issues for the proposed scheme are described in section 4. The experimental results are given in section 5 and in section 6 we extend 2-level mesh to multi-level mesh. Finally the summary and conclusion are presented in Section 7.



Figure 1. some already proposed topology

# II. MULTI-LEVEL MESH TOPOLOGY

Here, we propose a new topology, named Multi-Level Mesh or MLM which reduces the latency and dynamic power consumption and increases throughput of the communication part of a NoC. The MLM topology is basically similar to the 2D-mesh with this difference that we have several meshes that have some common routers. For example, consider a 2-level  $(2N-1) \times (2N-1)$  mesh architecture that consists of two meshes. The main mesh (the first level) is an  $(2N-1) \times (2N-1)$  regular 2D-mesh topology. In the second level, there is an N  $\times$  N 2Dmesh. All routers in the second level mesh have nine ports (except for the edge routers) and they are common between the first level and second level meshes. Other routers which are in the first level mesh are 5-port routers. Figure 2 shows a  $5 \times 5$  2level mesh topology that consist a  $5 \times 5$  mesh in the first level and a  $3 \times 3$  mesh in the second level. This NoC has 25 routers consisting nine 9-port routers and sixteen 5-port routers. Our experimental results show that 5-port and 9-port routers have 1837 and 4549 gates. So, each 9-port router is about 2.5 times larger than a 5-port router. Thus, a  $5 \times 5$  2-level mesh NoC has about 54% more area overhead in the routers compared to a 5  $\times$ 5 regular 2D-mesh NoC. If the size of each IP-core is  $1 \times 1$ , the total link length in a 5  $\times$  5 regular 2D-mesh and in a 5  $\times$  5 2level mesh will be 40 and 64, respectively, leading to about 60% area overhead in the link length. Table 1 shows some specification of an  $(2N-1) \times (2N-1)$  2-level mesh. So, the area overhead of 9-port routers in a N×N 2-Level Mesh topology is  $37.5(1 + 1/N)^2$  percent and the area overhead of links is 50(N+1)/N percent for N  $\geq$  3. This suggests that the area overhead of the routers is in the range of 37.5% to 66.7% and the area overhead of the links is in the range of 50% to 66.7%.

Our experimental results show that the average number of routers between two random IP-cores in  $3 \times 3$ ,  $5 \times 5$ , and  $7 \times 7$  2-level mesh architecture are about 13%, 17%, and 21% less than those in the corresponding mesh architecture. On the other hand, because the number of paths between two functional IP-cores in 2-level mesh is more than the regular mesh, the traffic in 2-level mesh is lower than that of mesh and the bandwidth in the 2-level mesh is more than the bandwidth of the mesh topology. Therefore, it is expected that 2-level mesh to have a lower latency as well.

# III. ROUTERS

In this section, we explain the routers for the proposed topology. Several structures for routers were already proposed. Some researchers have proposed fully asynchronous routers [9][10] and some of them have proposed Globally-Asynchronous-Locally-synchronous or GALS structure [11][12]. We consider parallel 32-bit links and GALS routers. So, every router has a local clock generator and all internal part of a router is synchronized with this local clock, but two neighboring routers communicate together asynchronously. Every link in our design has 32-bit for data (flits) and 2-bit to determine start and end of packets (SOP and EOP) and 2-bit for request (Req) and acknowledge (Ack) signals.



Figure 2.  $5 \times 5$  3-Level mesh topology.

#### Table 1

Some specification of (2N-1)×(2N-1) 2-Level Mesh topology.

| Total number of routers        | (2N - 1)2        |
|--------------------------------|------------------|
| Total number of 5-port routers | 3N2 - 4N + 1     |
| Total number of 9-port routers | N2               |
| Total number of short links    | 4(N - 1)(2N - 1) |
| Total number of long links     | 2N(N - 1)        |

Each  $5 \times 5$  router has 5 34-bit input latches, 5 input port, 5 output ports, and 5 34-bit output latches. Each packet is divided to several 32-bit flits where the first flit has the routing information such as the destination address. When an input latch receives the first flit of the incoming packet (SOP), the corresponding input port reads the destination address filed and selects an output port to deliver the input flits. Then the output port sends the incoming flit to the next router through its output latch and the corresponding link. The rest of the flits of the corresponding packet are transferred toward the destination through the same path.

If in the same time, more than one input port send packets to same output port, an arbiter select one of them and send selected packet through corresponding output latch on the output link. The arbiter works with a semi-round-robin algorithm. First, each input port has an arbitrary priority. When an output port accepts one request, the priority of corresponding input port is lowered. This way we make sure that all input packets would obtain a chance to be transported to output links. The routing algorithm in 9-port routers is similar to the XY routing algorithm. However, wherever possible, the incoming packets would go through the second level mesh.

## IV. POWER AND LATENCY EVALUATION

The model proposed in [13], named bit energy model, may be used for the energy consumption of the network. The bit energy ( $E_{bit}$ ) is defined as the energy consumed when one bit of data is transported through the router.

$$E_{bit} = E_{routers} + E_{links} \tag{1}$$

where  $E_{router}$  represents the energy consumed by the router (switch, buffers, and internal wires) and  $E_{link}$  represents the energy consumed by charging and discharging of the link capacitance. In this model, the average energy consumption for sending one bit data from tile  $t_i$  to tile  $t_j$  in a 2D-mesh is given by

$$E_{bit}^{l_i,l_j} = n_R \cdot E_{R5} + (n_R - 1) \cdot E_{link} \tag{2}$$

where  $n_R$  is the number of routers between tiles  $t_i$  to tile  $t_j$ .  $E_{R5}$  and  $E_{link}$  are the energy consumptions in a 5-port router and a mesh link, respectively [14]. The 2-level mesh architecture has two kinds of router and link. Therefore, the average energy consumption for sending one bit data from tile  $t_i$  to tile  $t_j$  in a 2level-mesh is:

$$E_{bit}^{l_i, l_j} = n_{R5} \cdot E_{R5} + n_{R9} \cdot E_{R9} + (n_{L1} + 2.n_{L2}) \cdot E_{link}$$
(3)

where  $n_{R5}$  and  $n_{R9}$  are the numbers of 5 and 9 port routers, respectively, and  $n_{L1}$  and  $n_{L2}$  are the numbers of short and long links, respectively, in the path between tile  $t_i$  to tile  $t_j$ ,  $E_{R5}$  and  $E_{R9}$  are the energy consumptions in a 5-port and 9-port routers, respectively, and  $E_{link}$  is the energy consumption in a normal link. We have

$$n_R - 1 = n_{L1} + 2 \cdot n_{L2} \tag{4}$$

$$n_R > n_{R5} + n_{R9} \tag{5}$$

which suggests that the power consumption in the links of the 2-level mesh architecture is the same as that of the mesh architecture. If  $E_{R9}$  is close to  $E_{R5}$ , the dynamic power consumption of 2-level mesh is lower than that of mesh.

The latency between two tiles is a function of the number of the routers in the path, the total length of the links in the path, and the traffic of the path. In the 2-level mesh architecture, the total length of the links in the path is the same as that of 2Dmesh but the number of the routers between the two far IPcores and the traffic of the path is less than those of mesh. Therefore, we expect that the average latency in a 2-level mesh to be less than the average latency in the mesh architecture. But for neighboring functional IP-cores, the latencies in the two topologies are the same. This is due to the fact that the neighboring IP-cores communicate together through the same router in the two architectures.

## V. MULTI-LEVEL MESH TOPOLOGY

To extend the 2-level mesh topology to multi-level mesh topology, we can add additional layers on top of the 2-level mesh architecture. Figure 3 shows a  $9 \times 9$  4-level mesh topology which is a combination of four  $9 \times 9$ ,  $5 \times 5$ ,  $3 \times 3$ , and

 $2 \times 2$  2D-mesh topologies. In this figure we have not shown the functional IP-cores for a better clarity. Every link in the fourth-level mesh is twice longer than the third-level links, 4 times longer than the second level links, and 8 times longer than the first level links.

## VI. EXPERIMENTAL RESULTS

To evaluate the proposed architecture, we designed and simulated NoCs with  $3 \times 3$ ,  $5 \times 5$ , and  $7 \times 7$  2-level mesh topologies and their corresponding 2D-mesh. In our simulations, each packet has eight 32-bit flits. Each functional IP-core generates 1000 packets and delivers them to the network with a random destination address. We changed the packet rate injection from 10% to 100% and determined the latency and the power consumption in every NoC. The results of our simulations are shown in the Figure 4. As is evident from this figure, the average latency reduction in  $3 \times 3$ ,  $5 \times 5$ , and  $7 \times 7$  2-level mesh architectures are about 12.5%, 21.4%, and 18.5% compared to those of the corresponding mesh topologies.

As described before, the  $9 \times 9$  routers transports input packets through long links if it is possible. Therefore, in large NoCs with the 2-level mesh topology, long links have more traffic than short links. If we use an adaptive routing algorithm to achieve the uniform traffic in the network, the average latency reduction in large topologies such as  $7 \times 7$  2-level mesh is considerably more than those of small topologies. Table 2 shows that the average number of routers in every packet path that is about 5.61 and 4.44 in  $7 \times 7$  2D-mesh and 2-level mesh topologies, respectively. Recall that in the mesh topology, all routers are the same while in the 2-level mesh topology some routers have 9 ports and consume power about 30% more than 5-port routers. But total power consumption in 2-level mesh is less than 2D-mesh.

#### VII. SUMMARY AND CONCLUSION

In this paper, we proposed the Multi-Level Mesh architecture to reduce the latency and the power of NoCs. In the topology, we add some new path between far IP-cores. Therefore, this architecture reduces the latency and the power consumption for the transactions between these cores while not affecting the power consumption for local transactions. The experimental results showed that the latencies of  $3 \times 3$ ,  $5 \times 5$ , and  $7 \times 7$  2-level mesh architectures were reduced about 12.5%, 21.4%, and 18.5% compared to the corresponding mesh architectures. If a good adaptive routing algorithm is used, one can achieve even better results. The area overhead of the proposed architecture was between 37.5% to 66.7% compared to the corresponding 2D-mesh topologies.



Figure 3. A  $9 \times 9$  4-level mesh topology



Figure 4. Latency of  $3 \times 3$ ,  $5 \times 5$ , and  $7 \times 7$  mesh and 2-level mesh.

Table 2 Power in 2-level mesh architecture

| Size<br>of | # of<br>Routers in | # of routers in 2-level<br>mesh |               | Power red. |
|------------|--------------------|---------------------------------|---------------|------------|
| NoC        | Mesh               | 5-port router                   | 9-port router | in routers |
| 3×3        | 2.93               | 1.38                            | 1.18          | 0.7%       |
| 5×5        | 4.28               | 1.95                            | 1.59          | 6.5%       |
| 7×7        | 5.61               | 2.39                            | 2.05          | 10%        |

#### References

- D. Bertozzi and L. Benini, "Xpipes: A Network-on-Chip Architecture for Gigascale Systems-on-Chip," IEEE Cir. and Sys. Magazine, Second Quarter 2004, pp.18-31
- [2] L. Benini and G. De Micheli, "Network on Chips: A New SoC Paradigm," IEEE Computer, Jan. 2002.
- [3] A.V. De Mello, L. C. Ost, F. G. Moraes, N. L. V. Calazans, "Evaluation of Routing Algorithms on Mesh Based NoCs," Faculdade Informatica Pucrs, Brazil, 2004
- [4] W. J. Dally and B.Towles, "Route Packet, Not Wires: On-Chip Interconnection Networks," DAC 2001, Las Vegas, USA, june 18-22,
- [5] S. Kumar and et. al., "A Network on Chip Architecture and Design Methodology," Proc. of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'02), 2002
- [6] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, "Performance Evaluation and Design Trade-Offs for Network-on-chip Interconnect Architectures," IEEE Tran. on Computers, Vol. 54, No. 8, August 2005, pp.1025-1040
- [7] W.J. Dally and C.L. Seitz, "The Torus Routing Chip," Tech. Report 5208: Comp. Science Dept., California Inst. of Technology, pp. 1-19, 1986.
- [8] P. Guerrier and A. Greiner, "A Generic Architecture for On-Chip Packet-Switched Interconnections," Proc. Design and Test in Europe (DATE), pp. 250-256, Mar. 2000.
- [9] D. Rostislav, V. Vishnyakov, E. Friedman, and R. Ginosar, "An Asynchronous Router for Multiple Service Level Networks on Chip," 11th IEEE Inter. Symp. on Asynchronous Cir. and Sys. (ASYNC'05) pp. 44-53
- [10] R. Dobkin, I. Cidon, R. Ginosar, A. Kolodny, and A. Morgenshtein, "Fast Asynchronous Bit-Serial Interconnects for Network-on-Chip," CCIT Report #529, April 2005
- [11] C. A. Zeferino, A. A. Susin, "SoCIN: A Parametric and Scalable Network-on-Chip," Proc. of the 16th Symp. on Integrated Cir. and Sys. Design (SBCCI'03), 2003
- [12] F. Moraes, A. Mello, L. Moller, L. Ost, and N. Calazans, "A Low Area Overhead Packet-Switched Network on Chip: Architecture and Prototyping," IFIP VLSI-SoC 2003, December 1-3 2003, Darmstadt, Germany
- [13] T.Ye, L. Benini, and G. D. Micheli, "Analysis of Power Consumption on Switch Fabrics in Network Routers," DAC 2002, June 10-14, USA
- [14] J. Hu, R. Marculescu, "Energy and Performance-Aware Mapping for Regular NoC Architectures," IEEE Tran. on Computer-Aided Design of Integrated Circ. and Sys., Vol. 24, No. 4, April 2005.