Xiao B, Zhang YF, Gao YP *et al.* A robust and power-efficient SoC implementation in 65 nm. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 28(4): 682–688 July 2013. DOI 10.1007/s11390-013-1368-7

# A Robust and Power-Efficient SoC Implementation in 65 nm

Bin Xiao<sup>1,2,3</sup> (肖 斌), Yi-Fu Zhang<sup>1,2,3</sup> (张译夫), Yan-Ping Gao<sup>1,2,3</sup> (高燕萍), Liang Yang<sup>3</sup> (杨 梁) Dong-Mei Wu<sup>1,3</sup> (吴冬梅), and Bao-Xia Fan<sup>3</sup> (范宝峡)

<sup>1</sup>State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China

<sup>2</sup> University of Chinese Academy of Sciences, Beijing 100049, China

<sup>3</sup>Loongson Technology Corporation Limited, Beijing 100190, China

E-mail: {xiaobin01, zhangyifu, athene}@ict.ac.cn; yangliang@loongson.cn; {wudongmei, fanbaoxia}@ict.ac.cn

Received September 28, 2012; revised March 7, 2013.

**Abstract** Godson2H is a complex SoC (System-on-Chip) of Godson series, which is a 117 mm<sup>2</sup>, 152 million transistors chip fabricated in 65 nm CMOS LP/GP process technology. It integrates a 1 GHz processor core and abundant high or low speed peripheral IO interfaces. To overcome on-chip-variation problems in deep submicron designs, many methods are adopted in clock tree, and PVT detectors are integrated for debug. To meet the low power constraints in different applications, most of state-of-the-art low power methods are used carefully, such as dynamic voltage and frequency scaling, power gating and aggressive multi-voltage design.

Keywords System-on-Chip, on-chip-variation, PVT detector, low power, hierarchical design flow

### 1 Introduction

As VLSI technology continues to scale, on-chipvariation (OCV) and power density have become key problems in physical design phase. OCV is changing the design problem from deterministic to probabilistic<sup>[1-2]</sup>, which takes negative effect in both performance and yield. Power dissipation of a chip, limited by packaging, cooling and other infrastructures, should be controlled carefully. Support of advanced power management technology in chip implementation phase is a mandatory issue of physical design. Both of these two problems lead to the boost of design complexity so that an efficient design flow to enable fast design closure is required.

In this paper, a complex SoC (System-on-Chip) design named Godson2H is introduced, which shows some novel ideas to deal with these challenges mentioned above. The rest of this paper is organized as follows. Section 2 introduces the architecture of Godson2H, Section 3 lists methods for OCV tolerance in high performance circuits, Section 4 illustrates low power methods used in Godson2H, Section 5 presents an improved hierarchical design method, and Section 6 gives conclusions of our work.

## 2 Architecture of Godson2H

Godson2H is a SoC chip specified for multiple application scenarios. It can be used either as a microprocessor or as a southbridge chip for Godson3 series microprocessors<sup>[3-4]</sup>. It is implemented in 65 nm LP/GP mixed CMOS process with 7 layers of Cu metallization, and it contains 152 million transistors with an area of 117 mm<sup>2</sup>.

As a processor, it is integrated with a 1 GHz high performance processor core (an MIPS64-compatible 4issue superscalar RISC processor core with 512 KB L2 cache<sup>[5]</sup>, named GS464), some high-speed IO interfaces (like HyperTransport 2.0, DDRII/III PHY, and PCIE 2.0 PHY) and other computer function units (like 3Dsupported GPU, media decoding module supporting 1080 p video decoding for H.264, VC-1 and MPEG2 format, and other low-speed peripheral interfaces). Without any other on-board cooperating chips, it can be used as a complete computer system.

As a southbridge chip for Godson3 series microprocessors, it is connected to them by HyperTransport interface, and it is integrated with many low speed peripheral interfaces such as SATA, USB, GMAC, HDA, LPC, SPI, I2C and NAND.

Short Paper

This work is supported by the National Natural Science Foundation of China under Grant Nos. 61003064, 61050002, 61070025, 61100163, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010901, 2012AA011002, 2012AA01202, 2013AA014301.

 $<sup>\</sup>textcircled{O}2013$ Springer Science + Business Media, LLC & Science Press, China

The architecture of Godson2H is shown in Fig.1. It has 2-level crossbars: level 1 (named X1) is a 2X2 crossbar and connects GS464 core, SCACHE, DMA module and HT controller; level 2 crossbar (named XBAR) connects other high-speed IOs (PCIE and DDR), GPU module and Media decoding module (named XPU). Other low-speed IOs in south bridge module are connected to XBAR by an AXI mux module.



Fig.1. Architecture of Godson2H.

### 3 OCV Tolerant Design

Scaling of CMOS technology leads to the increasing of intra-die variation. According to [6], the magnitude of intra-die channel length variations grows from 35% of the total variation for a 0.13  $\mu$ m technology to about 60% in a 0.07  $\mu$ m technology. Variability must be carefully dealt with in the design phase as it has played an important role in ASIC timing<sup>[7]</sup> and yield<sup>[8]</sup>. The most critical signal, clock signal in Godson2H, is designed in an OCV tolerant method, including clock generation and clock distribution. Furthermore, to verify these different OCV effects, three kinds of detectors are integrated for post-silicon test and debug.

### 3.1 Clock Generation

A high performance PLL (phase-locked loop) with low power feature is integrated in Godson2H. The frequency range is from 600 MHz to 3.2 GHz. The CC-RMS jitter is 3.0 ps at 1.44 GHz. Peak power consumption of this PLL is only about 2.891 mW (2.124 mW and 0.767 mW in the analog part and digital part respectively). The architecture of PLL is shown in Fig.2.

A digital calibrator is used before the analog loop during PLL start-up to remove the impact of process variation on the oscillator, and then analog loop starts to work and PLL gets locked quickly. Greatly reduced VCO gain brings excellent jitter performance, also 5-



Fig.2. Architecture of PLL in Godson2H.

stage ring VCO and AC-coupled duty cycle output stage could keep accurate duty cycle and suppress random noise. Loop-in and loop-out series number dividers make output frequency freely adjustable.

# 3.2 Clock Distribution

In the clock tree design of CPU core in Godson2H, many OCV tolerant design methods are used, such as register clustering, clock skew scheduling, fine-grained clock gating, useful clock skew and H-tree distribution. Similar registers are identified by both logical and physical information, and they are grouped into the same register clusters. Registers from the critical paths' startpoints and endpoints are grouped together manually for a small clock skew under variations. Register clusters are treated not only as basic units for clock skew scheduling, but also as relative groups during placement and routing to form a regular local H-tree structure. Clock skew scheduling is performed iteratively during placement and even synthesis to improve performance. Clock distribution is divided into two stages: in register block level, registers are placed regularly and clock gating cells are implemented for them (about 83% flip flops are clock gated in GS464 core), then clock delay cells are inserted according to their target skew values in top level, global clock latencies of different blocks are balanced by an H-tree clock structure. The clock distribution scheme achieves no more than 20 ps clock skew in CPU core, and a regular H-tree structure, both in global and local, makes the entire clock tree more OCV-tolerant.

# 3.3 OCV Detector

To help post-silicon test and debug, better observation of OCV is needed. Three monitors have been integrated into Godson2H, including the thermal sensor, voltage measure circuit and process monitor.

The process monitor can check whether the chip remains within pre-defined process limits at chip test phase and it is helpful for debug and failure analysis at application level. It can monitor performance of different voltage threshold transistors, memory cells and thick oxide transistors. For each monitored object, it has a dedicated sensor including two oscillators: one for speed test and the other for leakage test. When it is running, the whole chip is not working and both the voltage and temperature are close to ideal values used in simulation.

A thermal sensor integrated in the chip can provide digital measurement of junction temperature collected by the bipolar transistors. According to the runtime temperature measured by thermal sensor, the computer system can adjust work load or fan speed. The temperature measurement range is from 20°C to 125°C and the sensitivity is 1°C. To remove process effects, digital calibration is supported.

Supply voltage is one of the most important factors for both timing and power. To detect voltage variation, a voltage measure circuit is integrated in Godson2H as shown in Fig.3. The idea comes from that the delay of each cell is proportional to its voltage magnitude. The core of the circuit is a ring oscillator. The period of this oscillator is counted with digital logic and encoded into some flip-flops. Values of these flip-flops could be read out by software and converted to voltage values as SPICE simulation shows (process and temperature could be set as tested). It has a delay variation ranging from 5 ns to 500 ns and results in 5 mV voltage resolution.

### 4 Low Power Design

To be suitable for various application scenarios, function units should be able to provide best performance when required, and consume least power when idle. To meet strict low power requirements, nearly all of state-of-the-art low power design methods<sup>[9]</sup> are adopted in Godson2H. The whole chip is implemented mainly by static CMOS circuits, and even mixed process for different modules: high-speed modules use general purpose (GP) process cells for performance, and low-speed modules use low power (LP) process cells for leakage power control. Even in one module using the same process, different voltage threshold cells are chosen in different paths. For example, about 34% of total gates use GP process, of which about 66% use high voltage threshold cells. For most function units the clock signal can be gated globally by software, and the fine-grain clock gating of flip-flop banks in modules is also adopted. These two methods save lots of power consumption on clock network.

Besides, some aggressive low power techniques are used, like dynamic voltage and frequency scaling (DVFS), power gating and aggressive multi-voltage design. Hereby we will focus on those three kinds of low power techniques implemented in Godson2H.

# 4.1 Dynamic Voltage and Frequency Scale

From dynamic power equation

$$P \sim \alpha \times F \times V^2, \tag{1}$$

we know that dynamic power could be saved by lowering down the frequency F or power supply voltage V. For most industrial processors, DVFS is used to reduce dynamic power when the system needs no highest performance<sup>[10]</sup>. Godson3 series microprocessors use a smart way to realize frequency scaling by selective clock pulse gating, but the power supply magnitude could not be scaled down<sup>[11]</sup>. A completely different DVFS way is used in Godson2H. A customized PLL, which supports 6-bit successive clock division ranged from 1/2 to 1/64, is used for the first time. In this way the real highest work frequency could be reduced and then power supply magnitude could be scaled, and thus dynamic power dissipation could be reduced more significantly.



Fig.3. Voltage measure circuit.

When the chip works as a southbridge chip or when the computer system is in idle state, the GS464 processor core and SCACHE module do not need highest performance, and then frequency and power supply magnitude could be scaled down. Another case is that when smaller workload is asserted by power management unit (PMU), control signals would be sent to PLL and on-board voltage regulator to trigger DVFS.

Though the magnitude of power supply could be lowered down for more power savings, devices could work only at a voltage range from 0.9 V to 1.3 V. The range for voltage scaling must be placed in this region, but it is beneficial that no level shifter is needed for signals from one voltage area to another voltage area.

A key issue in DVFS is the choice of work points. Though frequency and power supply voltage could be scaled successively, enormous work is needed for signoff timing analysis for lots of combination of frequency and power supply. Without loss of generality, 50 mV and 100 MHz are chosen as the scaling step for voltage and frequency respectively. Additionally, the timing of boundary paths crossing signal interface between DVFS and normal domain should be checked carefully to retain some timing margins.

The results of power dissipation under typical work conditions are plotted in Fig.4. Dynamic power could be reduced by 80% from about 2.9W@(1.1 GHz, 1.15 V) to 0.6W@(400 MHz, 0.90 V), and leakage power benefits a little less in this way.



Fig.4. Power consumption of DVFS part.

# 4.2 Power Gating

Leakage power has become a major component of the total power consumption in CMOS design nowadays<sup>[12]</sup>. To work at a higher frequency some function units are implemented with numerous low-threshold devices. However, it leads to rapid growth of leakage power, which is especially inefficient under some idle circumstances. One way to overcome this is to gate their power supply when idle<sup>[13]</sup>.

In Godson2H, media decoder is designed for 1080p high-definition video decoding, and the target frequency of this unit is 400 MHz at least. It is used only in some circumstances, so power gating technique is applied to it to save leakage power during idle.

A ring-based header switch power gating style is adopted in Godson2H. The gating area is about  $2.9 \times 3.9 \,\mathrm{mm^2}$  with about 1 million general purpose low threshold standard cells in it. When running at 400 MHz@1.10V, its power consumption is about 1.60 W with leakage power 0.42 W in typical corner. There are about 17K header switch cells with one switch control cell. The switch control cell follows commands generated by PMU and has 2-bit control signals to manage current consumption during switch-on transient phase. The switch cells and switch control cell are designed using low power process with the highest threshold voltage, which makes them to consume less extra leakage power. To keep the correctness of logic outside the power gating area, when this block is gated, isolation cells are needed.

When media decoder does not need to work, PMU would send a shut-down signal to the switch control cell, and then the control cell would send signal to all the switch cells to shut down the power supply. When the media decoder needs to work, PMU would send an "open" signal to the switch control cell, and then all the switch cells should be opened for power supply sustainment. After all the switch cells are open, the switch control cell would send a "power ok" signal to PMU, then the system can load work into this unit.

The most important stage for power gating mechanism is the switch-on phase. In-rush current should be carefully analyzed to make sure that the block can be powered on and the circuit would not be destroyed, as shown in Fig.5. It is observed that the peak current is about 30 mA and 1.1 ms is needed to switch on the whole unit. When the whole power gating frame is

Baseline = 0 Time A =  $4458.6120 \text{ ns} \frac{12\,000 \text{ ns}}{14\,000 \text{ ns}} \frac{16\,000 \text{ ns}}{14\,000 \text{ ns}}$ 

| 0 2000 ns     | 4000 ns 6000 ns 8000 ns |                          |
|---------------|-------------------------|--------------------------|
| 0 -0.004      |                         | 4.12619e-05A             |
| -0.008<br>1.0 |                         | -0.008 003A<br>1.100 15V |
| 0.6<br>0.4    |                         | 0.338475                 |
| -1.0<br>-0.5  |                         | 1.1V                     |
|               |                         | -7.74884e-05V            |

Fig.5. In-rush current analysis for power gating. The first line means the in-rush current from power network to power gating area; the second line means the switch signal sending to switch cells by switch control cells; the third line is the asserted signal sended by switch control cell to show that the output voltage of switch cells is high. 686

working at maximum workload, IR drop for this module is only about 35.75 mV.

# 4.3 Aggressive Multi-Voltage Design

In modern SoC design, different blocks may have different performance objectives and timing constraints. High performance units need to run as fast as possible, which means a relative high supply voltage is required. For low speed units, a lower supply rail would be sufficient, which means less dynamic and static power. Partitioning and combination with many different voltage domains are called multi-voltage design.

Godson2H uses a more aggressive multi-voltage design method. Within this chip, there are four independent power supply regions shown in Fig.6. The four independent power regions are called SoC domain, node domain (DVFS area), resume domain and RTC domain respectively. Most units work in SoC domain, normally running at 1.1 V voltage supply; in the center of the SoC domain, it is the node power domain, in which power supply could be scaled from 0.9 V to 1.1 V or higher. In SoC domain there is a power gating region mentioned before. SoC domain interfaces with resume and RTC domains. RTC domain is an always-on area working at 1.8 V and is powered by on-board battery. When the system is in stand-by mode, to save more power the whole SoC power domain could be shut-down and only the power supply of the resume domain retains. In the resume domain there are PMU, GMAC and USB modules which are used to wake up the system by Ethernet devices or USB devices. The PMU module is in charge of sending control signals to PLL, power gating switch control cells and isolation control signals.



Fig.6. Logic power domains of Godson2H

The key issue for multi-voltage design is multivoltage rule check both in the design phase and LVS check phase. Though EDA tools support multi-voltage rule checks based on uniform power format (UPF) files, most checks need to be done manually for both function and implementation correctness.

#### J. Comput. Sci. & Technol., July 2013, Vol.28, No.4

Power dissipation in different power modes for Godson2H is shown in Fig.7. There are six operating modes: full speed, DVFS of CPU clock domain, power gating of media, the mix of DVFS and power gating, the resume mode (only USB and GMAC modules are running and all the other units are powered off), and the RTC mode (only RTC domain is running, and all the other units are powered off). Different modes are selected according to different applications. From Fig.7 it could be seen that the total power consumption is about 6.7 W with 1.1 V power supply magnitude in full speed mode and the lowest power consumption is no more than 12 mW in RTC mode. In the resume mode the total power dissipation is about 103 mW which promises ultra-low stand-by power.



Fig.7. Power dissipation at different modes. TDP means that the whole system is running at full speed. RSM is an abbreviation for the resume mode.

## 5 Enhanced Hierarchical Design Flow

The transistor count of today's largest IC designs is over 2 billion, and to overcome the design complexity and deep submicron effects, hierarchical physical design methods are adopted<sup>[14-15]</sup>. Such design flow was used in previous Godson series chip design phase<sup>[5,16]</sup>, but in Godson2H OCV tolerant techniques and low power constraints lead to more design challenges. A hierarchical design structure like [16] is chosen, but great enhancements are made in different phases.

# 5.1 Physical RTL Design

Logical RTL is organized mainly by function definitions of different blocks, but with less consideration on physical design information such as floorplan, feedthroughs, and power domains. To be fit for physical implementation, a suite of physical RTL is designed first, which includes clock distribution schemes, chip packaging requirements, low power structures and DFT issues<sup>[16]</sup>. For example, combination logic in critical timing paths is constructed manually for the performance and regularity to tolerate OCV. Netlists of some clock generation blocks are written and verified first, and then integrated in physical RTL.

Another part of physical RTL in Godson2H is the integration of low power techniques. Nowadays most multi-voltage design tools can only handle those RTLs which have exactly the same structure with power domain, so the hierarchy of physical RTL must be suitable for the convenience of power domain design rules check.

To balance the high performance and low power requirements, LP and GP mixed design styles are used in the core area, but these two kinds of processes have different variation characteristics. So it is better to keep clean timing boundaries between modules using different processes. Intentionally, all modules using LP and GP processes communicate through asynchronous FIFO (first-in-first-out), but we still need to divide pointer and data flip-flops in FIFO according to their fan-in logic.

### 5.2 Floorplan and Partition

In this phase designers need to arrange the locations of sub-modules under their logical and physical relationships. Both timing and power constraints must be taken into consideration. Criteria for hierarchical timing closure include clear logical connection between different blocks to keep critical paths inside.

For Godson2H one of the most important things is that cells in the same power domain must be placed together to ease the power network design. If modules belonging to a common power domain are not adjacent, feedthroughs between these two modules must be placed in the same voltage area or through an alwayson area. To keep good quality of clock signals, clock feedthroughs should be placed outside the DVFS area especially for high frequency clocks.

After this design phase, module design information are partitioned to sub-modules including shape, area, row, track and power network.

# 5.3 Module and TOP Design

To be fit for different design styles, each sub-module has its own customized cell-based automatic placement and routing flow. For some timing critical parts of the chip, full-custom design methods are chosen like Regfiles in CPU core; for some regular logic like feedthrough or differential signals, cell-based semi-custom design method is adopted, which means standard cells are placed and routed by in-house tools; for other less complex and timing noncritical modules, automatic synthesis, place and route method is enough for performance. Each module has an RTL-to-GDS flow and must be clean in both inner timing paths and DRC&LVS check before committed to TOP design.

In the TOP design flow all modules are combined together for full-chip analysis. To reduce the runtime for the whole chip's timing analysis, a hierarchical signoff flow is used. First wire parasitics are extracted in parallel for each submodule, and then fed into a flatten timing analysis flow. After a careful partition and few iterations of characterization for submodules' timing boundary, there are only a few critical paths existing between modules in the TOP design. Another key point to be taken care of is the multi-voltage design rules check in the TOP design, because in some submodules' design phase there is lack of the TOP design's voltage area information. A flat netlists multivoltage check is done at the TOP design by tools and customized scripts. At last, the TOP design must pass the DRC&LVS check, which is less bothersome when all submodules are DRC&LVS clean and only their interfaces should be fixed.

# 6 Conclusions

Godson2H is a complex SoC of Godson series. To deal with on-chip-variation in deep submicron design, different OCV tolerant design styles and circuits were used in some critical parts of this chip like PLL, clock tree and PVT detectors. To keep the power flexibility for various applications, combination of most of low power methods was introduced, leading to that Godson2H could work with the highest power dissipation of 6.7 W for maximum performance and lowest power of 11 mW during idle. Both of these requirements bring design complexity, so an enhanced hierarchical design flow was chosen to achieve design closure within two months. Fig.8 shows Godson2H's die photo.

Now Godson2H has been packaged. All function units run well and the low power management methods are being tested. Fig.9 shows the chip's picture.



Fig.8. Die photo of Godson2H.

### References

 Borkar S, Karnik T, Narendra S, Tschanz J, Keshavarzi A, De V. Parameter variations and impact on circuits and micro-



Fig.9. Photo of Godson2H.

architecture. In Proc. the 40th Annual Design Automation Conference, Jun. 2003, pp.338-342.

- [2] Karnik T, Borkar S, De V. Sub-90nm technologies: Challenges and opportunities for CAD. In Proc. the 2002 IEEE/ACM Int. Conf. Computer-Aided Design, Nov. 2002, pp.203-206.
- [3] Hu W, Wang R, Chen Y, Fan B, Zhong S, Gao X, Qi Z, Yang X. Godson-3B: A 1 GHz 40 W 8-core 128GFLOPS processor in 65 nm CMOS. In *Digest of Technical Papers of 2011 Int. Solid-State Circuits Conference (ISSCC)*, Feb. 2011, pp.76-78.
- [4] Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G. Godson-3: A scalable multicore RISC processor with X86 emulation. *IEEE Micro*, 2009, 29(2): 17-29.
- [5] Fan B, Yang L, Gao Z, Zhang F, Wang R. The implementation and design methodology of a quad-core version Godson-3 microprocessor. In Proc. the 52nd IEEE Int. Midwest Symp. Circuits and Systems, Aug. 2009, pp.1167-1170.
- [6] Nassif S R. Within-chip variability analysis. In *Technical Digest of Int. Electron Devices Meeting*, Dec. 1998, pp.283-286.
- [7] Zuchowski P S, Habitz P A, Hayes J D, Oppold J H. Process and environmental variation impacts on ASIC timing. In Proc. the 2004 IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004, pp.336-342.
- [8] Luo J, Sinha S, Su Q et al. An IC manufacturing yield model considering intra-die variations. In Proc. the 43rd Annual Design Automation Conf., Jul. 2006, pp.749-754.
- [9] Keating M, Flynn D, Aitken R, Gibbons A, Shi K. Low Power Methodology Manual: For System-on-Chip Design. Springer Publishing Company, Incorporated, 2007.
- [10] Floyd M S, Ghiasi S, Keller T W et al. System power management support in the IBM POWER6 microprocessor. IBM Journal of Research and Development, 2007, 51(6): 733-746.
- [11] Fan Q, Zhang G, Hu W. A synchronized variable frequency clock scheme in chip multiprocessors. In Proc. IEEE Int. Symp. Circuits and Systems, May. 2008, pp.3410-3413.
- [12] De V, Borkar S. Technology and design challenges for low power and high performance. In Proc. the 1999 Int. Symp. Low Power Electronics and Design, Aug. 1999, pp.163-168.
- [13] Kosonocky S V, Bhavnagarwala A J, Chin K et al. Lowpower circuits and technology for wireless digital systems. *IBM Journal of Research and Development*, 2003, 47(2/3): 283-298.
- [14] Dai W J. Hierarchical physical design methodology for multimillion gate chips. In Proc. International Symposium on Physical Design, Apr. 2001, pp.179-181.
- [15] Cong J. Timing closure based on physical hierarchy. In Proc. the 2002 Int. Symp. Physical Design, Apr. 2002, pp.170-174.
- [16] Wang R, Fan B X, Yang L et al. Physical implementation of the eight-core Godson-3B microprocessor. Journal of Computer Science and Technology, 2011, 26(3): 520-527.













**Bin Xiao** received the B.S. degree in electronic and electric from Peking University, in 2006. He is currently a Ph.D. candidate of Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing. His research interests include deep submicron physical design.

**Yi-Fu Zhang** received his B.S. degree in microelectronic from Peking University in 2007. He is currently a Ph.D. candidate of ICT, CAS. His research interests include chip power analysis and low power techniques for high performance processors.

Yan-Ping Gao participated in Loongson team in 2002. She is now a Ph.D. candidate of ICT, CAS. Her research interests include low power design methodology and low power key techniques, asynchronous circuits and system.

Liang Yang received the Ph.D. degree in computer architecture in 2010 from ICT, CAS. Now he works in Loongson Technology Corporation Limited, Beijing. His current research interests include clock distribution network, interconnect modeling, and high performance and low power design.

**Dong-Mei Wu** received her B.S. and M.S. degrees both in microelectronics from Peking University in 2007 and 2010 respectively. She is currently a physical design engineer in ICT, CAS. Her research interest is deep submicron physical design.

**Bao-Xia Fan** received the Ph.D. degree in computer architecture in 2010 from ICT, CAS. Now he works in Loongson Technology Corporation Limited, Beijing. His research interests include on-chip-variation analysis and optimization, design methodology for high performance processor.