We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Yong Dou, Jie Zhou, Gui-Ming Wu, Jing-Fei Jiang, Yuan-Wu Lei, Shi-Ce Ni. A Unified Co-Processor Architecture for Matrix Decomposition[J]. Journal of Computer Science and Technology, 2010, 25(4): 874-885. DOI: 10.1007/s11390-010-1068-5
Citation: Yong Dou, Jie Zhou, Gui-Ming Wu, Jing-Fei Jiang, Yuan-Wu Lei, Shi-Ce Ni. A Unified Co-Processor Architecture for Matrix Decomposition[J]. Journal of Computer Science and Technology, 2010, 25(4): 874-885. DOI: 10.1007/s11390-010-1068-5

A Unified Co-Processor Architecture for Matrix Decomposition

Funds: Supported by the National Natural Science Foundation of China under Grant Nos. 60633050 and 60833004, 60903057, and the National High-Technology Research and Development 863 Program of China under Grant No. 2009AA01Z101.
More Information
  • Author Bio:

    Yong Dou is currently a Ph.D. supervisor and professor in National Laboratory for Parallel and Distributed Processing, National University of Defense Technology. His main research area focuses on high performance computer architecture, reconfigurable computing. As one of the main researchers, he took part in the project, "Research on Architecture of High-end Performance Computers", supported by NSFC in 2000. Dr. Dou is the co-chair of the 2009 conference on Advanced Parallel Processing Techniques (APPT2009). He is a member of IEEE, ACM and CCF.

    Jie Zhou is currently a Ph.D. candidate in National Laboratory for Parallel and Distributed Processing, National University of Defense Technology. His main research area focuses on high performance embedded architecture and reconfigurable computing.

    Gui-Ming Wu is currently a Ph.D. candidate in National Laboratory for Parallel and Distributed Processing, National University of Defense Technology. His main research area focuses on reconfigurable computing and high-level synthesis.

    Jing-Fei Jiang is currently a Ph.D. in National Laboratory for Parallel and Distributed Processing, National University of Defense Technology. Her main research area focuses on reconfigurable computing and high performance microprocessor design.

    Yuan-Wu Lei is currently a Ph.D. candidate in National Laboratory for Parallel and Distributed Processing, National University of Defense Technology. His main research area focuses on high performance embedded architecture and reconfigurable computing.

    Shi-Ce Ni is currently a graduated student in National Laboratory for Parallel and Distributed Processing, National University of Defense Technology. His main research area focuses on high performance embedded architecture and reconfigurable computing.

  • Received Date: February 09, 2009
  • Revised Date: November 02, 2009
  • Published Date: July 08, 2010
  • QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified co-processor structure with a scalable linear array of processing elements (PEs), in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads.
  • [1]
    Farina A, Timmoneri L. Parallel algorithms and processing architectures for space-time adaptive processing. In Proc. Radar CIE International Conference, Beijing, China, October 8-10, 1996, pp.770-774.
    [2]
    Rabideau D J, Kogon S M. A signal processing architecture for space-based GMTI radar. In Proc. the Record of the IEEE Radar Conference, Waltham, Massachusetts, April 20-22, 1999, pp.96-101.
    [3]
    Fischer B, Modersitzki J. Fast inversion of matrices arising in image processing. Computer Science, 1999, 22(1): 1-11.
    [4]
    Batchelor G H. Introduction to Fluid Dynamics. 2nd Edition, Cambridge University Press, 2000.
    [5]
    Ojalvo I U. Proper use of Lanczos vectors for large eigenvalue problems. Computers & Structures, 1985, 20(1-3): 115-120.
    [6]
    Buttari A, Langou J, Kurzak J, Dongarra J. Parallel tiled QR factorization for multicore architecture. Concurrency and Computation: Practice and Experience, 2008, 20(13): 1573-1590.
    [7]
    The LINPACK Benchmark. http://www.netlib. org/linpack/, December, 2008.
    [8]
    Xu H, Alexander W E. Parallel QR factorization on a block data flow architecture. In Proc. the 24th Southeastern Symposium and the 3rd Annual Symposium on Communications, Signal Processing Expert Systems, and ASIC VLSI Design, March 1-3, 1992, pp.332-336.
    [9]
    Fernandez L, Garcia J M. The performance of fast Givens Rotation problem implemented with MPI extensions in multicomputer. In Proc. International Conference on Applications of High-Performance Computers in Engineering, Santiago de Compostela, Espagne, July 1997, pp.83-92.
    [10]
    Ian N Dunn, Gerard G L Meyer. Parallel QR factorization for hybrid message passing/shared memory operation. Journal of the Franklin Institute, 338(5): 601-613.
    [11]
    Hernandez V, Roman J E, Tomas A. A parallel variant of the gram-Schmidt process with reorthogonalization. In Proc. International Conference on Parallel Computing: Current \& Future Issues of High-End Computing, Malaga, Spain, Sept. 13-16, 2005, pp.221-228.
    [12]
    Bischof C H. A parallel QR factorization algorithm using local pivoting. SIAM Journal on Scientific and Statistical Computing, 1991, 12(1): 36-57.
    [13]
    Peng S, Sedukhin S, Sedukhin I. Householder bidiagonalization on parallel computers with dynamic ring architecture. In Proc. the 2nd Aizu International Symposium on Parallel Algorithms/Architecture Synthesis, Aizu-Wakamatsu, Japan, March 17-21, 1997, pp.182-191.
    [14]
    Rüger G, Schwind M. Comparison of different parallel modified Gram-Schmidt algorithms. In Proc. Euro-Par Parallel Processing, Lisbon, Portugal, Aug. 30-September 2, 2005, pp.826-836.
    [15]
    Buttari A, Langou J, Kurzak J, Dongarra J. Parallel tiled QR factorization for multicore architectures. Concurrency and Computation: Practice and Experience, 2008, 20(13): 1573-1590.
    [16]
    Oliveria S, Soma T. New partitioning schemes for parallel modified Gram-Schmidt orthogonalization. In Proc. the 3rd International Symposium on Parallel Architectures, Algorithms and Networks, Las Vegas, Nevada, USA, Dec. 7-9, 1997, pp.233-239.
    [17]
    Wilburn C Wilburn, Hak-Lim Ko, Winser E Alexander. An algorithm and architecture for the parallel solution of systems of linear equations. In Proc. IEEE Fifteenth Annual International Phoenix Conference on Computers and Communications, Arizona, USA, 1996, pp.392-398.
    [18]
    Singh C K, Prasad S H, Balsara P T. VLSI architecture for matrix inversion using modified Gram-Schmidt based QR decomposition. In Proc. the 20th International Conference on VLSI Design, Bangalore, India, Jan. 6-10, 2007, pp.836-841.
    [19]
    Lorenzelli F, Yao K. A linear systolic array for recursive least squares. IEEE Transactions on Signal Processing, 1995, 43(12): 3014-3021.
    [20]
    Liu K J R, Heieh S F , Yao K. Recursive LS filtering using block Householder transformation. In Proc. IEEE ICASSP, Albuquerque, USA, April 3-6, 1990, pp.1631-1634.
    [21]
    Tang C F T, Liu K J R, Tretter S A. On systolic arrays for recursive complex Householder transformations with applications to array processing. In Proc. International Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, May 14-17, 1991, pp.1033-1036.
    [22]
    Sergyienko A, Maslennikov O. Implementation of givens QR-decomposition in FPGA. In Proc. Int. Conf. Parallel Processing and Applied Mathematics, Na Leczów, Porland, Sept. 9-12, 2000, pp.458-465.
    [23]
    Yokoyama Y, Kim M, Arai H. Implementation of Systolic RLS adaptive array using FPGA and its performance evaluation. In Proc. the 64th Vehicular Technology Conference, Montreal, Canada, Sept. 25-28, 2006, pp.1-5.
    [24]
    Karkooti M, Cavallaro J R, Dick C. FPGA implementation of matrix inversion using QRD-RLS algorithm. In Conference Record of the 39th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Oct. 30-Nov. 2, 2005, pp.1625-1629.
    [25]
    Echman F, Owall V. A scalable pipelined complex valued matrix inversion architecture. In Proc. IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 23-26, 2005, pp.4489-4492.
    [26]
    Kim D, Rajopadhye S. An improved systolic architecture for LU decomposition. In Proc. ASAP 2006, Steamboat Springs, USA, Sept. 11-13, 2006, pp.231-238.
    [27]
    Choi S, Prasanna V. Time and energy efficient matrix factorization using FPGAs. In Proc. FPL 2003, Lisbon, Portugal, Sept. 1-3, 2003, pp.507-519.
    [28]
    Yao K, Lorenzelli F. Systolic algorithm and architecture for high-throughput processing applications. Journal of VLSI Signal Processing, 2007, 53(1/2): 15-34.
  • Related Articles

    [1]Yi-Fan Zhang, Lei Sun, Qiang Cao. TLP-LDPC: Three-Level Parallel FPGA Architecture for Fast Prototyping of LDPC Decoder Using High-Level Synthesis[J]. Journal of Computer Science and Technology, 2022, 37(6): 1290-1306. DOI: 10.1007/s11390-022-1499-9
    [2]Shu-Quan Wang, Lei Wang, Yu Deng, Zhi-Jie Yang, Sha-Sha Guo, Zi-Yang Kang, Yu-Feng Guo, Wei-Xia Xu. SIES: A Novel Implementation of Spiking Convolutional Neural Network Inference Engine on Field-Programmable Gate Array[J]. Journal of Computer Science and Technology, 2020, 35(2): 475-489. DOI: 10.1007/s11390-020-9686-z
    [3]Ji-Liang Zhang, Wei-Zheng Wang, Xing-Wei Wang, Zhi-Hua Xia. Enhancing Security of FPGA-Based Embedded Systems with Combinational Logic Binding[J]. Journal of Computer Science and Technology, 2017, 32(2): 329-339. DOI: 10.1007/s11390-017-1700-8
    [4]Ji-Liang Zhang, Qiang Wu, Yi-Peng Ding, Yong-Qiang Lv, Qiang Zhou, Zhi-Hua Xia, Xing-Ming Sun, Xing-Wei Wang. Techniques for Design and Implementation of an FPGA-Specific Physical Unclonable Function[J]. Journal of Computer Science and Technology, 2016, 31(1): 124-136. DOI: 10.1007/s11390-016-1616-8
    [5]Cinzia Bernardeschi, Luca Cassano, Andrea Domenici. SRAM-Based FPGA Systems for Safety-Critical Applications: A Survey on Design Standards and Proposed Methodologies[J]. Journal of Computer Science and Technology, 2015, 30(2): 373-390. DOI: 10.1007/s11390-015-1530-5
    [6]Ji-Liang Zhang, Gang Qu, Yong-Qiang Lv, Qiang Zhou. A Survey on Silicon PUFs and Recent Advances in Ring Oscillator PUFs[J]. Journal of Computer Science and Technology, 2014, 29(4): 664-678. DOI: 10.1007/s11390-014-1458-1
    [7]Xiaofang (Maggie) Wang, Swetha Thota. A Resource-Efficient Communication Architecture for Chip Multiprocessors on FPGAs[J]. Journal of Computer Science and Technology, 2011, 26(3): 434-447. DOI: 10.1007/s11390-011-1145-4
    [8]Hui Dai, Qiang Zhou, Ji-Nian Bian. Multilevel Optimization for Large-Scale Hierarchical FPGA Placement[J]. Journal of Computer Science and Technology, 2010, 25(5): 1083-1091. DOI: 10.1007/s11390-010-1085-4
    [9]Ehsan Atoofian, Zainalabedin Navabi. A Test Approach for Look-Up Table Based FPGAs[J]. Journal of Computer Science and Technology, 2006, 21(1): 141-146.
    [10]Zheng Fangqing. The Decomposition of Belief Function[J]. Journal of Computer Science and Technology, 1992, 7(2): 189-192.

Catalog

    Article views (29) PDF downloads (1709) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return