Hu WW, Gao YP, Chen TS *et al.* The Godson processors: Its research, development, and contributions. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 26(3): 363–372 May 2011. DOI 10.1007/s11390-011-1139-2

# The Godson Processors: Its Research, Development, and Contributions

Wei-Wu Hu<sup>1,2</sup> (胡伟武), Senior Member, CCF, Yan-Ping Gao<sup>1,2,3</sup> (高燕萍), Member, CCF Tian-Shi Chen<sup>1,2</sup> (陈天石), and Jun-Hua Xiao<sup>1,2</sup> (肖俊华), Member, CCF

<sup>1</sup>Key Laboratory of Computer System and Architecture, Chinese Academy of Sciences, Beijing 100190, China

<sup>2</sup>Loongson Technologies Corporation Limited, Beijing 100190, China

<sup>3</sup>Graduate University of Chinese Academy of Sciences, Beijing 100190, China

E-mail: {hww, athene, chentianshi, xiaojunhua}@ict.ac.cn

Received December 20, 2010; revised March 9, 2011.

**Abstract** The Godson project with an R&D history of 10 years is an independent national program of China that aims at developing advanced microprocessor technologies based on fundamental research and commercialization of the chip technology. We will give a comprehensive presentation of the Godson project, including its history, technical roadmaps, and several unique technical merits.

Keywords IT industry, CPU research and development, Godson microprocessor, XPU, system on chip

## 1 Introduction

Microprocessor technology is one of the key technologies in IT industry, which plays an important role in the development of social economy and national security. Over the past decade, China has made great achievements in developing advanced microprocessors that meet the requirements of both civilian and security applications. Benefited from the intensive investment of Chinese government in the 10th and 11th Five-Year Plans (2001~2010), Chinese researchers have developed several series of microprocessors, including the Godson processors designed by the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS)<sup>[1-6]</sup>, the UniCore/PKUnity processors designed by the Peking University<sup>[7]</sup>, the YHFT processors designed by the National University of Defense Technology<sup>[8]</sup>, the Shenwei processors developed by National High Performance IC (Shanghai) Design Center<sup>[9]</sup> and so on. Currently, China has become one of the few countries that are capable of designing general-purpose microprocessors. In the 12th and 13th Five-Year Plans  $(2010 \sim 2020)$ , the Chinese government will continue to offer strong support to the development of CPUs, and its long-term purpose is to build a new ecosystem for Chinese IT industry based on the solid foundation of mature CPU technologies.

Boosted by the recent advances of microprocessor

technology, various areas in Chinese IT industry, such as high-performance computing, cloud computing, computer network, operating system and desktop application, have been developing rapidly. In the meantime, the developments in these areas also contribute to the further enhancement of microprocessor technology and market extension of new products. Under this circumstance, the research and development of advanced microprocessor technology, which was traditionally considered to be technical tasks, should also take the factors of applications and marketing into account.

Godson project is a CPU research and development project working at the cutting-edges of technology, application and market extension of CPUs. Since this project was initially launched by ICT, it has targeted on establishing the foundation of an independent IT industry by domestic designed CPUs. After ten-year's development, the Godson project has become one of the most famous CPU projects in China. To share Godson's experiences and shed some light on the future developments of Chinese CPU projects, this article briefly reviews the history of Godson project. In addition, we will introduce several technical merits of Godson CPUs.

The rest of this article is organized as follows. Section 2 briefly reviews the ten-year's history of the Godson project. Section 3 elaborates the technical roadmap of Godson project at the background of microprocessor

Regular Paper

Supported by the National Natural Science Foundation of China under Grant Nos. 60736012 and 60673146, the National High Technology Research and Development 863 Program of China under Grant Nos. 2008AA110901 and 2007AA01Z114, and the National Basic Research 973 Program of China under Grant No. 2005CB321600.

<sup>©2011</sup> Springer Science + Business Media, LLC & Science Press, China

industrialization and market extension. To offer indepth understandings to the status of Godson project, some key technical features of the state-of-the-art Godson processors are presented in Section 4. Section 5 discussed the future work.

## 2 Review of Godson History

The Godson project was initialized by ICT in 2001. In 2002, the project released the Godson-1  $CPU^{[1]}$ , which is the first 32-bit general-purpose microprocessor in China. In 2003, the Godson team developed the Godson-2B  $CPU^{[2]}$ , which is the first 64-bit general-purpose microprocessor in China. On the basis of the design of Godson-2B, Godson-2C and Godson-2E were developed in 2004 and 2005 respectively. The performance of each of Godson-2B, Godson-2C and Godson-2E triples that of its previous one. According to the performance evaluations, Godson-2E's performance scores on both SPEC int2000 and SPEC fp2000 are higher than 500.

Based on the first five-year's technical accumulation with Godson-1, Godson-2B, Godson-2C and Godson-2E, Godson project started its production from 2006. Godson-2F released in 2008 is a product version of Godson-2E. It is a MIPS III-compatible CPU fabricated in a 90 nm CMOS process, which dissipates only  $3\sim 5$  W at 1.0 GHz. In 2010, the first multi-core CPU product in China, Godson-3A, was successfully fabricated in a 65 nm CMOS process<sup>[3,5]</sup>. Godson-3A is a 64bit MIPS64 compatible CPU consisting of four GS464 cores. It adopts a scalable multi-core architecture with hardware support to accelerate applications including x86 emulation.

From 2006, the Godson team has taken a number of steps towards the industrialization and market experiments of Godson technologies. In the same year, the Godson team set up the Lemote Technology Limited Corporation in Changshu, Jiangsu, China. This is a company focusing on providing customs with low-cost computers and solutions based on Godson CPUs. In 2008, Loongson Technology Corporation Limited was set up in Beijing, China. The corporation's mission is to transfer Godson technology into products and to provide CPU products for the market. It will also provide solutions and design, upgrading services for the partners based on Godson technology.

The developments of Godson processors have also inspired a number of significant research contributions in the research community. Various novel ideas collected from industrial practice, such as eXtra Processing Units (XPU), x86 emulation and design for debug, have been published by premier academic journals and conferences such as IEEE Micro, ISSCC, HOTCHIPS, ISCA, HPCA<sup>[3-4,6,10-11]</sup>. As evidenced by these high-impact publications, the Godson project has eventually gained world-wide reputations in both industry and academia.

To sum up, after ten-year efforts, the Godson project has established a mechanism of interactive innovations in academia, development and industrialization.

## 3 Technical Merits

## 3.1 Overview

In the past decade, Godson processors have evolved from single-issue to superscalar architecture, from single core to multi-cores, and from experimental prototype to mass-produced industrial products. Currently, Godson project possesses three series of productions that cover a wide range of application fields. The first one is Godson- $3^{[3]}$  with a scalable multiple-core architecture, mainly for high-throughput data center application, high-performance scientific computing with reduced power consumption. The second is Godson- $2^{[2]}$ that aims at personal computers and high-end embedded applications. In addition, Godson-1 focuses on the domain of consumer electronics such as digital TV.

Among the above three series of processors, Godson-2 and Godson-3 integrate four-issue, 64-bit GS464 CPU core, while Godson-1 integrates single- or dual-issue 32-bit GS232 CPU core. Godson-1 and Godson-2 take SoC (System on Chip) architecture which integrates CPU, GPU, media processor, memory controller and plentiful peripherals such as PCIE, USB, GMAC, SATA, while Godson-3 takes multiple-core architecture which connects multiple GS464 cores with a scalable NoC (Network on Chip) and required interfaces such as memory controller, HyperTransport controller.

## **3.2 GS464 and GS232 Cores**

GS464 is a 64-bit processor core that is used in both Godson-2 and Godson-3 series of processors. As illustrated in Fig.1, GS464 contains two fixed-point functional units and two floating-point units. In addition to traditional computation operation, e.g., addition, subtraction, etc., the first fixed-point arithmetic logic unit (ALU1) can execute trap, conditional move, and branch instructions; the second one (ALU2) can execute multiplication and division instructions. The first floatingpoint unit (FALU1) executes all floating-point instruction; the second floating-point unit (FALU2) executes floating-point addition, subtraction, and multiplication instructions.

The GS464's four-issue superscalar execution mechanism advances extremely high requirements for resolving inter-instruction dependency and supplying instruction and data. Thus, out-of-order execution and



Fig.1. GS464 architecture.



Fig.2. GS232 architecture.

aggressive cache design are employed to improve the efficiency of pipeline. The out-of-order execution scheme of GS464 combines register renaming, dynamic instruction scheduling, and branch prediction. Concretely, a 64-entry physical register file is utilized for fixedpoint and floating-point register mapping to remove WAR (write-after-read) and WAW (write-after-write) dependency. GS464 contains a 16-entry fixed-point reservation station and a 16-entry floating-point reservation station to issue instructions out-of-order. To guarantee that instructions are executed in program order, a 64-entry reorder buffer queue is employed. Besides, GS464 utilizes a 16-entry branch target buffer, an 80 Kbyte entry branch history table, a 9-bit global history register, and a 4-entry return address stack to implement branch prediction.

The memory subsystem of GS464 supports 64-bit virtual addresses and 48-bit physical addresses, and can access a 128-bit quad word in one cycle. GS464 has a 64-Kbyte L1 instruction cache and a 64-Kbyte L1 data cache, both are four-way set associative. The fully associative translation look-aside buffer (TLB) has 64 entries, each of which can map an odd page and an even page. GS464 also has a 24-entry memory-access queue

GS232 is mainly used in Godson-1 series of products. It implements MIPS32 compatible instruction set architecture and its DSP extension with two-way superscalar, whose architecture is illustrated in Fig.2. GS232 adopts many superscalar techniques, such as register renaming, dynamic scheduling, branch prediction. GS232 has two fixed-point functional units, one floating point unit and one memory-access unit. The floating-point unit can be extended to 256-bit to support SIMD media processing instructions. GS232 also implements the EJTAG standard for debugging. Built upon GS232, Godson 1A and 1B primarily focus on consumer electronics applications.

## 3.3 Godson-2 Processors: From CPU to SoC

Godson-2 processors are a family of 64-bit processors dedicated to applications such as desktop PCs and various embedded applications. The latest members of Godson-2 series include Godson-2F, 2G and 2H. Godson-2F is a 64-bit MIPS III compatible general-purpose CPU manufactured with 90 nm process, which only consumes  $3\sim 5$  W when working at the frequency of 1 GHz. The performance scores of Godson-2F on both SPEC int2000 and SPEC fp2000 are higher than 500. Godson-2F has been produced for low-cost PCs and embedded applications.

Over the years, Godson-2 processors have been evolving from CPU to System-on-Chip (SoC). Godson-2G is a two-chip solution, which primarily

consists of a GS464 core, a HyperTransport (HT) controller, a PCI/PCIX controller, a DDR2/DDR3 controller and low speed interface such as LPC, SPI and UART. A south-bridge, either with HT or PCI interface, is assumed when building a complete system. Godson-2H integrates even many more functionalities than Godson-2G, such as PCIE/SATA/USB interface, 3D graphics, dual display, audio, and HD media decoder. Godson-2H provides single chip solution for low-cost PC. Godson-2G and 2H SoCs are built around two-level 128 bit AXI interconnection, as indicated in Fig.3. The level one switch is cached-coherence extended, which connects the processor core, the level 2 caches, and other coherence capable I/O devices. The level two switch connects to the memory controller and other I/Os.

Godson-2 processors can be adopted in low-cost desktop PCs and various embedded applications. For low-cost desktop PCs, Godson-2F has been massproduced. Recently, more than 150 000 PCs with Godson-2F have been purchased by many elementary schools in Jiangsu Province, China. For embedded applications, Godson-2 series is much cheaper, compared with that of other competitors. To sum up, the above technical roadmap aims at minimizing the costs of desktop PCs and various embedded applications, and thus promoting the market competitiveness of low-end Godson products.

#### 3.4 Godson-3 Processors

Godson-3 is a series of multiprocessors that are designed specifically for various high-end applications. Godson-3 utilizes the scalable mesh of crossbar on-chip network topology to support up to 64-core. Fig.4 shows



Fig.3. Illustrations of Godson-2G and 2H SoCs. (a) Godson-2G. (b) Godson-2H.



Fig.4. General network topology of Godson-3 series.

the overall architecture of Godson-3. The 2D mesh network connects 16 nodes, each node includes an  $8 \times 8$  crossbar (X1) to connect 4 processors as masters and 4 shared L2 cache banks as slaves. A second level crossbar (X2) in each node connects the DDR2/DDR3 memory controllers to L2 cache banks. Besides, the HT I/O controllers are connected in the boundary of 2D mesh.

As the first member of Godson-3 series, Godson-3A consists of four GS464 cores, which is manufactured with seven-metal 65 nm CMOS process. This chip includes 425 million transistors and the die size measures 14 240 micrometers by 12 205 micrometers. It can achieve 16 GFlops (double-precision) peak performance. Besides, the power dissipation is less than 15 W with typical working frequency as 1 GHz. Godson-3A features a scalable architecture, high reliability, low power consumption and hardware-supported x86 binary translation.

To accommodate high throughput computing and high density computing in digital signal processing, the second member of Godson-3 series, which is named Godson-3B, integrates 8 GS464V cores (GS464 core with  $2 \times 256$ -bit vector extension). It can therefore achieve 128 GFlops (double-precision) with about 40 W power dissipation. Like Godson-3A, Godson-3B integrates 2 DDR3 and 2 HyperTransport controllers. Moreover, this chip consists of 583 million transistors with die size less than  $300 \text{ mm}^2$ . The most highlight of Godson-3B is that it has the highest ratio of peak performance to power consumption compared with the state-of-the-art commercial microprocessors<sup>[4,6]</sup>.

The design of Godson-3C, which will adopt 28 nm CMOS technology and more custom-designed modules, is also in progress. It will be firstly taped out in 2012. Godson-3C integrates 16 GS464V cores, 4 DDR3, and 4 HT controllers. It aims at achieving peak performance of 384 and 512 GFlops (double-precision) when working at the frequency of 1.5 GHz and 2.0 GHz, respectively.

A notable advantage of Godson-3 series is that its promising peak performance is comparable with mainstream commercial server processors. Moreover, due to integration of memory controllers and HT controllers in the chip, Godson-3 offers high throughput to meet the requirement of high throughput computing. Besides, the prices of Godson-3 series are very cheap in comparison with mainstream multi-core commercial processors. Last but not least, Godson-3 processors achieve a promising balance between performance and power dissipation, which is especially suitable to construct a "green" supercomputer with less power consumption. Because of these attractive features, Godson-3B processor, the second member of the Godson-3 series, will be utilized in the design of next-generation Dawning supercomputer<sup>[12]</sup></sup>. It is expected that Godson-3B can further improve the peak performance of Dawning supercomputer with reduced power consumption.

#### 4 Technical Features

Godson CPUs have adopted a number of key techniques for enhancing the performance and reducing the energy consumption of microprocessors. Some of them are significant innovations in both industry and academia. This section lists some key technical features of Godson CPUs

#### 4.1 Scalable Architecture

The scalable architecture of Godson-3 is shown in Fig.4. Godson-3 adopts a scalable interconnection network which includes two layers. In the first layer, a mesh connects up to 16 nodes together. In the second layer, a crossbar connects four cores and four L2 cache banks inside the node with adjacent nodes in the four directions. I/O modules (HyperTransport) are connected to the interconnection network through the boundary ports of the mesh.

All ports on the Godson-3 interconnection network comply with the 128-bit AXI (AMBA 3.0) standard interface. There are five channels in the AXI protocol, which are write address channel (AW), write data channel (W), write response channel (B), read address channel (AR) and read data channel (R) respectively. The standardization of AXI makes third-party IPs be easily plugged into the Godson-3 network. Furthermore, Godson extends the AXI protocol with several additional fields to support cache coherence between cores and I/O modules.

In Godson-3 interconnection network, both routing and retransmitting are accomplished by the crossbars which reside in each node. A crossbar mainly includes 8 AXI Mater Link (AML) modules, 8 AXI Slave Link (ASL) modules and an 8 × 8 multiplexing matrix which connects AML and ASL together. Each link has all of the five channels (AW, W, B, AR, and R) of AXI protocol. The AML routes AXI requests (AW, W and AR) according to the address. The ASL routes AXI replies (R and B) according to id of AXI replies. Since both AML and ASL have two stages of pipelines, the crossbar in Godson-3 network has a latency of four cycles per hop.

The cache coherence protocol of Godson-3 supports both intra-chip and inter-chip cache coherence. Godson-3 adopts a directory-based cache coherence protocol. The home L2 cache bank of each memory address is fixed, and a bit vector is maintained in L2 cache bank to record the L1 caches (both data cache and instruction cache) which own a copy of the block. Godson extends the standard HT protocol to transfer cache coherence information across chips. The cache coherent AXI package can be packaged to HT transactions by HT controller on the sending chip, then the HT transactions are transferred to other chip through HT link, and finally these HT transactions are recovered to cache coherent AXI package by HT controller on the receiving chip. Large Cache-Coherent Non Uniform Memory Access (CC-NUMA) system can be built with Godson-3 through the hierarchical directory extension technology. The bit vector directory in each L2 cache block can treat an HT controller as a *virtual* core, and the HT controller can be the agent of other chips for remote memory access request.

## 4.2 eXtra Processing Unit (XPU)

The latest member of Godson-3 series, Godson-3B, consists of 8 GS464 XPUs. GS464 XPU is also named GS464V in Godson project, whose architecture is illustrated in Fig.5. GS464V is MIPS64-compatible while providing additional 300 instructions for 256-bit vector extension. It extends each of the two floatingpoint units in GS464 to 256-bit vector unit, and the 32-entry 64-bit floating point register to 128-entry 256bit vector unit. The 28-entry vector queue can outof-orderly dispatch two vector instructions to the two 256-bit vector processing units (VPU). Each VPU can perform 4 double-precision floating-point multiply and add (MADD) operations or 8 single-precision floatingpoint MADD operations simultaneously, or at most 32 fixed-point operations. Furthermore, to keep the compatibility with MIPS64, each VPU can also perform one MIPS-64 floating-point instruction per clock cycle.

To fully take advantage of the vector computational ability, GS464V defines a vector instruction set considering the requirement of important applications (e.g., scientific computation, signal processing, multimedia et al.) as well as vector compiler. It is worth noting that the vector instruction set of GS464V includes a series of shuffle-computation mixed instructions to reduce shuffle instructions. Traditionally, vector programs may include many shuffle instructions to re-organize the data residing in vector registers. Thus, many previous processors with vector unit adopt dedicated shuffle unit, which consumes the width of superscalar and incurs additional power consumption. The shuffle-computation mixed instructions of GS464V eliminate the necessity of dedicated shuffle unit (although the VPU of GS464V can still perform shuffle operation). Hence both the number of instructions and power consumptions are reduced.

The computation ability of vector extension cannot be fully explored without a corresponding promotion of memory bandwidth. On one hand, to relieve the burden of memory bandwidth through exploiting the spatial locality, GS464V employs a 128-entry 256-bit vector register file with 4 write and 8 read ports, which



Fig.5. GS464V architecture.

can hold 4 KB data; on the other hand, GS464V employs an additional programmable memory access coprocessor which is independent with the normal memory unit. As shown in Fig.6, the non-reusable data can be automatically and directly transferred between vector register file and L2 cache/memory controller by the memory access coprocessor, while reusable data can reside in L1 data cache and be accessed by normal load/store instructions. At the same time, data reorganization is automatically performed in the path between vector register and L2 cache/memory so as to provide



Fig.6. Data reorganization on data path.

data in the required format for VPU.

## 4.3 x86 Emulation

Due to the significant difference between x86 and RISC, traditional commercial RISC processors do not provide dedicated support for x86 emulation. Since some x86-related features have not been presented in MIPS (e.g., floating-point register stack, EFlags, segment addressing mode, and so on), translating x86 binary to MIPS binary via software is inefficient. GS464 supports x86 binary translator from hardware to offer smooth translation from x86 binary to MIPS  $binary^{[3]}$ . To be specific, GS464 defines new instructions and runtime environments via the MIPS64 user-defined interface (UDI) to bridge the semantic gap between x86 ISA and MIPS64 ISA. More than 200 instructions are defined with limited hardware costs. As a consequence, the number of translated instructions can be significantly reduced.

With x86 emulation, the Godson-3 virtual machine

| MS Windows                          | Linux Apps. on x86      | Linux Apps. on MIPS |
|-------------------------------------|-------------------------|---------------------|
| System Level<br>x86 VM              | Process Level<br>x86 VM |                     |
| Linux on MIPS                       |                         |                     |
| Enhanced MIPS Decode                |                         |                     |
| Enhanced Godson Internal Operations |                         |                     |

Fig.7. Architecture of GS464 virtual machine.

is compatible with x86 at both the ISA level and Linux application binary interface (ABI) level, where the ISAlevel compatibility is for low-cost PC applications and the Linux ABI-level compatibility is for server applications. Fig.7 shows the virtual machine of GS464. In this framework, the process-level and system-level virtual machine (VM) monitors have been improved to provide x86-compatible system calls. Primary performance evaluation shows that the above hardware support can speed up x86 binary translation significantly.

## 4.4 Low Power Consumption

Godson processors reduce power consumption at architecture design, physical design and transistor level.

Many power reducing techniques are used in the architecture and logical design of Godson processors. The coupled logic among different processor modules is minimized to maintain good locality and reduce the switching activities of logic. With the help of an accurate architecture-level power simulator, many micro-architecture parameters including the number of queues, the depth of pipelines, and even interconnect and routers of multi-core can be carefully selected considering the tradeoff between performance and power. For some power hungry modules like floating-point multiply-add units, specialized algorithms are also proposed to ensure low power consumption while achieving high performance.

Many resorts are utilized in Godson physical design to reduce power consumption. Godson adopts a hybrid static-dynamic physical design methodology. Only the timing-critical components (e.g., register file, RAM and CAM) of Godson processors employ dynamic circuit. Clock gating technique is widely used in Godson designs. For each main component of Godson processors such as Godson-3B, coarse-grain gating cells are visible to operating system. Meanwhile, fine-grain gating cells cover most Flip-Flops (FFs) (85% FFs in Godson-3B). There are also mode gating cells to disable useless circuits (such as test circuits) in the current mode. The state-of-the-art Godson design, such as Godson-2H, employs power gating, DFS (Dynamic Frequency Scaling) and DVFS (Dynamic Voltage Frequency Scaling) schemes to reduce the power consumption.

The Godson products take the state-of-the-art semiconductor process. Most of Godson products are based on 65 nm CMOS technology. Products based on 28 nm are in the design process. There are three types of cells with distinct threshold voltage (HVT, SVT and LVT), where HVT cells are high voltage threshold cells with lowest leakage power and speed, SVT cells are standard voltage threshold cells with higher speed and power consumption, and LVT cells are the lowest voltage threshold cells with the highest speed. In 65 nm designs, only HVT cells are used at compiling stage, while SVT cells are used for optimization, no LVT cells are used.

## 4.5 Design for Debug

Godson-3B provides design-for-debug (DFD) feature for post silicon hardware debugging and software debugging. Because of the high complexities of the architecture and interconnection network in Godson-3B, the number of signals needed to probe is huge. A DFD subsystem should be designed with minimum increase in area, minimum extra power cost, and minimum change in architecture, because even a small change on architecture may take a lot of time to redesign, to re-layout, and to re-verify.



Fig.8. Godson-3B with DFD subsystem.

To resolve above problems, in the design of Godson-3B, a general-purpose architecture for the DFD subsystem is proposed. The architecture is shown in Fig.8. The DFD subsystem can monitor, probe, and even control the interconnection network with a low extra cost. It consists of several low-cost monitors, a dedicated DFD network, a control center supported single-step debug, a high performance trace compressor (in control center), and a standard EJTAG port to output the compressed trace. The debugging process for Godson-3B works as follows: a number of low-cost monitors can probe the state of interconnection network completely. The control center can cut off the connection between the routers and the monitor that are not concerned; only the target signals will be transferred to the routers (rather than to the control center). In the control center, probed trace will be compressed in the efficient trace compressor to reduce trace size. Then the compressed trace can be scanned out via EJTAG and the software connected to the ETAG interface will analyze the trace and detect (or locate) the errors.

## 5 Future Work

Under the Godson project, we will continue to design processors with high integration density for personal computer applications and with high throughput/scalability for server applications. From the architectural level, the main efforts will be invested to achieve finer optimization of dynamic pipeline mechanism. From circuit level, Godson will further combine the advantages of custom-design and ASIC design trading off the frequency with power consumption. From the level of integration, Godson will integrate 2 to 16 cores in a single chip, combined with graphical processing core, high-definition audio/video processing core, memory controller and abundant peripheral interfaces, such as SATA, USE, network card, PCI-E.

From a long-term perspective, for personal computer application, Godson may implement high density computer-on-a-chip. In such a chip, except for IP cores and peripheral interfaces as stated, specific application processing engines, such as security engines, 3G/4G wireless modules, may be integrated. For server applications, Godson will further improve the computing density and throughput via the integration of more cores and high-density computing coprocessors. High speed I/O modules will be utilized to balance the computing and data serving. Optical interconnection may also be used to tackle with the bandwidth bottleneck.

#### References

 Hu W W, Hou R, Xiao J H, Zhang L B. High performance general-purpose microprocessors: Past and future. *Journal of* Computer Science and Technology, 2006, 21(5): 631-640.

- [2] Hu W W, Zhang F X, Li Z S. Microarchitecture of the Godson-2 processor. Journal of Computer Science and Technology, 2005, 20(2): 243-249.
- [3] Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G. Godson-3: A scalable multicore RISC processor with x86 emulation. *IEEE Micro*, 2009, 29(2): 17-29.
- [4] Hu W, Wang R, Chen Y, Fan B, S. Zhong, Gao X, Qi Z, Yang X. Godson-3B: A 1 GHz 40 W 8-Core 128GFlops Processor in 65nm CMOS. In Proc. the 58th IEEE International Solid-State Circuits Conference (ISSCC 2011), San Francisco, USA, Feb. 20-24, 2011, pp.75-76.
- [5] Gao X, Chen Y J, Wang H D, Tang D, Hu W W. System architecture of Godson-3 multi-core processors. *Journal of Computer Science and Technology*, 2010, 25(2): 181-191.
- [6] Hu W, Chen Y. GS464V: A high-performance low-power XPU with 512-bit vector extension. In Proc. 22nd IEEE Symposium on High Performance Chips (HOT CHIPS 2010), Stanford University, USA, Aug. 22-24, 2010.
- [7] Cheng X, Wang X, Lu J, Yi J, Tong D, Guan X, Liu F, Liu X, Yang C, Feng Y. Research progress of UniCore CPUs and PKUnity SoCs. *Journal of Computer Science and Technology*, 2010, 25(2): 200-213.
- [8] Chen S, Wan J, Lu J, Liu Z, Sun H, Sun Y, Liu H, Liu X, Li Z, Xu Y, Chen X. YHFT-QDSP: High-performance heterogeneous multi-core DSP. Journal of Computer Science and Technology, 2010, 25(2): 214-224.
- [9] Huang Y, Zhu Y, Ju P, Wu Z, Chen C. Functional verification of "ShenWei-1" high performance microprocessor. *Journal of Software*, 2009, 20(4): 1077-1086. (In Chinese)
- [10] Chen Y, Lv Y, Hu W, Chen T, Shen H, Wang P, Pan H. Fast complete memory consistency verification. In Proc. the 15th International Symposium on High-Performance Computer Architecture (HPCA 2009), Raleigh, USA, Feb. 14-18, pp.381-392.
- [11] Chen Y, Hu W, Chen T, Wu R. LReplay: A pending period based deterministic replay scheme. In Proc. the 37th ACM IEEE International Symposium on Computer Architecture (ISCA 2010), Saint-Maro, France, Jun. 19-23, 2010, pp.187-197.
- [12] http://www.top500.org, 2011.





Wei-Wu Hu is a professor of computer science at the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include high-performance computer architecture, parallel processing, and VLSI design. He received the Ph.D. degree in computer science from the Institute of Computing Technology.

Yan-Ping Gao participated in Loongson Team in 2002. She is now a Ph.D. candidate of the Institute of Computing Technology, the Chinese Academy of Sciences. Her research interests include IC physical design methodology, low power design methodology and low power key techniques, asynchronous circuits and system.



**Tian-Shi Chen** received the B.S. degree in mathematics from the Special Class for the Gifted Young, University of Science and Technology of China (USTC), Hefei, China, in 2005, and the Ph.D. degree in computer science from USTC in 2010. He is currently an assistant professor at Institute of Computing Technology, Chinese Academy of Sciences. His

research interests include parallel computing, hardware verification and computational intelligence. He has authored or coauthored more than 20 papers in these areas.



Jun-Hua Xiao received his Ph.D. degree from the Institute of Computing Technology, the Chinese Academy of Sciences in 2008. He is currently an assistant professor in the Institute of Computing Technology. His research interests include high performance computer architecture, microprocessor design, performance analysis.