
Architecture of the matrix processing unit (MPU), which consists of an array of Px×Py matrix processing elements (MPEs) and a buffer controller. (a) MPU architecture. (b) MPE architecture.
Figures of the Article
-
Overall flow of EKF/PF SLAM.
-
Overall flow of RGB-D SLAM[18].
-
Overall flow of ORB SLAM[32].
-
Analysis of performance and power bottlenecks. (a) Breakdown of execution time of different stages. (b) Breakdown of power of different stages.
-
CPI stack of BenchSLAM.
-
CPI stack of BenchSLAM comparison among branch misprediction of SPEC CPU 2006 benchmarks (gcc and libquantum), neural network algorithms (CNN Train and CNN Test), and BenchSLAM algorithms (EKF, PF, RGB-D SIFT, RGB-D SURF, and ORB).
-
Overall architecture of the proposed accelerator. SPE: scalar processing element.
-
Architecture of the matrix processing unit (MPU), which consists of an array of Px×Py matrix processing elements (MPEs) and a buffer controller. (a) MPU architecture. (b) MPE architecture.
-
Architecture of a vector processing unit (VPU), which consists of Pz vector processing elements (VPEs). (a) VPU architecture. (b) VPE architecture.
-
Architecture of the control unit (CU). Inst.: instruction.
-
CONV on MPU (using 3×3 MPU in the example). (a) Scheduling of CONV instruction. (b) Data reuse between MPEs (four MPEs in the example).
-
Convolution code using non-macro instructions.
-
MMmV: the matrix-vector multiplication on MPU. ∗ means the multiplication operation.
-
MMmM: the matrix-matrix multiplication on MPU.
-
MVmV: the vector-vector multiplication on MPU. (a) Multiplication of Px×Py pairs of inputs. (b) Accumulated partial productions in each MPE (Px×Py MPEs in total). (c) Propagating accumulation-summation from right to left in each MPE row. (d) Propagating accumulation-summation (Acc-Sum) from bottom to top in the left-most MPE column and outputting the final result from the top-left-most MPE.
-
VVmV: the vector-vector multiplication on VPU. (a) Multiplication of Pz pairs of inputs. (b) Accumulated partial productions in each VPE (Pz VPEs in total). (c) Propagating accumulation-summation from right to left and outputting the final result from the left-most VPE.
-
Mapping process of the SIFT algorithm.
: subtraction.
-
Mapping process of the g2o algorithm. b: coefficient vector. Δx: increments of pose. \boldsymbol b=\boldsymbol H\cdot \varDelta x .
-
Layout of the implemented accelerator (45 nm).
-
Accelerator (Acc) and x86 CPU (x86) speedups over ARM CPU on BenchSLAM.
-
Energy costs of CPUs (x86 and ARM) and accelerator (Acc) on BenchSLAM.
Others
-
External link to attachment
https://rdcu.be/dxI8B -
Compressed file
2023-6-7-1523-Highlights 248KB -
PDF format
2023-6-7-1523-Highlights 349KB
Related articles
-
2021, 36(2): 334-346. DOI: 10.1007/s11390-021-0861-7
-
2016, 31(4): 836-848. DOI: 10.1007/s11390-016-1666-y
-
2011, 26(1): 176-186. DOI: 10.1007/s11390-011-1120-0
-
2009, 24(3): 534-543.
-
1996, 11(6): 562-569.
-
1994, 9(2): 175-181.
-
1991, 6(4): 370-375.
-
1989, 4(4): 315-322.
-
1989, 4(1): 29-34.
-
1988, 3(3): 203-213.