DEPARTMENT OF COMPUTER SCIENCE | DWIGHT LOOK COLLEGE OF ENGINEERING | TEXAS A&M UNIVERSITY
High Performance Computing Laboratory
    Home | Research | Publication | People | Links

Communication-Centric CMP Design
High Performance Computing Laboratory

Peak Power Control

Communication traffic of these applications makes routers greedy to acquire more power such that the total consumed power of the network may exceed the supplied power and cause reliability problems. To ensure high performance and power constraint satisfaction, the on-chip network must have a peak power control mechanism. The satisfaction of peak power consumption in a single chip is essential to maintaining supply voltage levels, to supporting reliability, to limiting capacity of heat sinks and to meeting affordable packaging costs. Since the total power supplied to a chip is distributed to all the units of a chip, each unit should keep its power consumption below a preset upper limit. With the increasing demand for interconnect bandwidth, an on-chip network becomes the major power consumer in a chip.

Multimedia applications on a System-on-Chip (SoC) are extensively being studied for bandwidth requirements over heterogeneous components of the network. However, we are focusing on the QoS environment in the homogenous network such as chip multiprocessors. An on-chip network must support guarantee for the delivery of multimedia data (real-time traffic) as well as the normal message-oriented communication (best-effort traffic).

We propose a credit-based peak power control to meet pre-specified power constraints while maintaining the service quality, by regulating injection of packets. We take different approaches for different traffic types. For real-time traffic, instead of throttling the injection of packets of already established connections, our scheme works by determining the acceptance of a new connection based on the requirement of the consumed power and the available power budget as in the case of admission control. We also show how to calculate the expected power consumption of a connection from its bandwidth requirement. For best-effort traffic, we calculate the required power of a packet based on the distance from its source to the destination. If the expected power consumption exceeds the power budget, we throttle the injection of the packet such as the congestion control.

Peak Power Control By Regulating Input Load

Domain-Specific On-Chip Network Design in Large Scale Cache Systems

As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Recently proposed Non-Uniform Cache Architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present Fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network.

Design A: 16x16 Mesh (64KB) Design B: 16x16 Simplified Mesh (64KB)
Design C: 16x4 Simplified Mesh (256KB)
Design D: Spike-16 Halo (64KB) Design E: Spike-5 Halo (Non-uniform size)
Domain-Specific Network Development
Performance of Different Designs

Adaptive Data Compression with Table-based Hardware

The design of a low-latency on-chip network is critical to provide high system performance, because the network is tightly integrated with the processors as well as the onchip memory hierarchy operating with a high frequency clock. To provide low latency, there have been significant efforts on the design of routers and network topologies. However, due to the stringent power and area budgets in a chip, simple routers and network topologies are more desirable. In fact, conserving metal resource for link implementation can provide more space for logic such as cores or caches. Therefore, we focus on maximizing bandwidth utilization in the existing network. Data compression has been adopted in hardware designs to improve performance and power. Cache compression increases the cache capacity by compressing block data and accommodating more blocks in a fixed space. Bus compression also expands the bus width by encoding a wide data as a small size code. Recently data compression is explored in the on-chip network domain for performance and power.

We investigate adaptive data compression for on-chip network performance optimization, and propose a cost-effective implementation. Our design uses a tablebased compression approach by dynamically tracking value patterns in traffic. Using a table for compression hardware processes diverse value patterns adaptively rather than taking static patterns based on zero bits in a word. The tablebased approach can easily achieve a better compression rate by increasing the table size. However, the table for compression requires a huge area to keep data values on a flow basis. In other words, the number of tables depends on the network size, because communication cannot be globally managed in a switched network. To address this problem, we present a shared table scheme that stores identical values as a single entry across different flows. In addition, a management protocol for consistency between an encoding table and a decoding table works in a distributed way so that it allows out-of-order delivery in a network.

We demonstrate performance improvement techniques to reduce negative impact of compression on performance. Streamlined encoding combines encoding and flit injection processes into a pipeline to minimize the long encoding latency. Furthermore, dynamic compression management optimizes our compression scheme by selectively applying compression to congested paths.

Packet Compression

Publications

    © 2004 High Performance Computing Laboratory, Department of Computer Science, Texas A&M University
    427C Harvey R. Bright Bldg, College Station, TX 77843-3112