基于机器视觉算法的RISC-V向量指令扩展性能评测
Evaluating RISC-V Vector Instruction Set Architecture Extension with Computer Vision Workloads
-
摘要:研究背景
今天,图像分类、目标检测、人脸识别等机器视觉(CV)应用已广泛应用于我们的日常生活中。随着IoT(物联网)技术的快速发展,为满足实时性要求,嵌入式系统中的多媒体应用需要对图像和视频数据进行高效处理。由于机器视觉算法在计算时能够被简单分割成更小的数据单位进行计算,这些计算特性使得机器视觉工作负载特别适合使用能够在一次执行中处理多个数据项的单指令多数据(SIMD)指令进行处理。和传统的指令集架构(ISA),如x86以及ARM等相比,RISC-V作为新进提出的开放指令集架构,在向量扩展中采用了增量和模块化设计,并支持可配置的数据长度以及寄存器分组。我们采用机器视觉算法对RISC-V的向量扩展(简称为RV-V)进行了评测,研究表明相比于采用标量指令实现的典型机器视觉算法(如灰度算法、Mean filter、边缘检测等),其对应的向量版本在指令数量上降低了约24倍。然而指令数量的降低并不能直接转换为性能的提升,具体的性能提升情况还取决于RISC-V处理器的具体实现。在玄铁C906上,我们的评测结果显示:通过采用向量指令,典型机器视觉算法能够获得相比于标量实现约2.98倍的性能提升。
目的采用典型的机器视觉算法,对RISC-V的向量扩展进行综合评测。对相同算法,对比采用和不采用向量扩展的情况,以及对比在RISC-V平台和ARM平台上执行的性能。
方法从纵向和横向两个维度,对RISC-V向量环境下运行的机器视觉算法进行评测。纵向上,比较采用RISC-V向量实现的算法,与采用RISC-V标量指令实现的算法,在指令数量、执行所消耗的时钟数量上的差别。横向上,对比在相同数据集上执行的相同机器视觉算法,在RISC-V平台(采用典型RV-V加速器)以及ARM平台(采用NEON加速器)的性能差距。
结果1)使用RISC-V的向量扩展(RV-V)可以有效提高机器视觉算法的性能。与使用RISC-V标量指令实现的机器视觉算法相比,其默认设置LMUL=1(不打开寄存器分组的情况下)的RV-V实现减少了约3倍的指令数,在我们评估中使用的玄铁 C906 处理器上,实际性能提升(执行的指令周期数减少)大约是 1.83 倍。2)对于机器视觉算法而言,考虑到真实图像的尺寸,RV-V 的可变向量长度特征对进一步提高 CV 算法的性能几乎没有帮助。 RV-V 的可变向量长度特性使程序员免于处理数据长度小于向量长度的情况。然而,由于在现实世界的图像中这种案例的部分很小,并不能有效地提高 CV 算法的整体性能。3)RV-V 的寄存器分组特性,即连接多个向量寄存器形成长向量,对于进一步提高机器视觉算法的性能是有效的。我们的实验表明,与默认的不打开寄存器分组的情况相比,将8 个向量寄存器分到一组(LMUL=8)能够进一步减少了机器视觉算法的指令数约 8 倍,意味着相比于标量实现,减少了24 倍。4)然而,寄存器分组对循环计数(实际性能加速)的减少是由 RV-V 协处理器的底层架构决定的。例如,使用玄铁C906的RV-V协处理器(有8条执行通道),最大级别的寄存器分组(即LMUL=8)相比于默认未打开寄存器分组的情况(LMUL=1),机器视觉算法能够获得1.63倍的性能提升。总的来说,对于运行在玄铁 C906 处理器上的典型机器视觉算法,通过使用RISC-V的向量指令,机器视觉算法总共能够获得的性能加速约为2.98倍。
结论我们研究了 RISC-V 矢量扩展 (RV-V) 在执行机器视觉算法方面的潜力。实验结果表明,对于典型的 CV 算法,使用向量指令的机器视觉算法的指令数比使用标量指令实现的相同算法的指令数减少达到了24倍。尽管如此,我们的实验还表明,将RV-V在算法指令数量上的减少转化为实际性能时,还必须考虑实际的SIMD协处理器的架构。我们在实现了RISC-V指令集的玄铁 C906 处理器上的评测结果就表明,采用RISC-V向量指令相比于采用标量指令,最多只能获得约2.98倍的性能加速。
-
关键词:
- RISC-V向量扩展 /
- SIMD /
- 计算机视觉 /
- OpenCV
Abstract:Computer vision (CV) algorithms have been extensively used for a myriad of applications nowadays. As the multimedia data are generally well-formatted and regular, it is beneficial to leverage the massive parallel processing power of the underlying platform to improve the performances of CV algorithms. Single Instruction Multiple Data (SIMD) instructions, capable of conducting the same operation on multiple data items in a single instruction, are extensively employed to improve the efficiency of CV algorithms. In this paper, we evaluate the power and effectiveness of RISC-V vector extension (RV-V) on typical CV algorithms, such as Gray Scale, Mean Filter, and Edge Detection. By our examinations, we show that compared with the baseline OpenCV implementation using scalar instructions, the equivalent implementations using the RV-V (version 0.8) can reduce the instruction count of the same CV algorithm up to 24x, when processing the same input images. Whereas, the actual performances improvement measured by the cycle counts is highly related with the specific implementation of the underlying RV-V co-processor. In our evaluation, by using the vector co-processor (with eight execution lanes) of Xuantie C906, vector-version CV algorithms averagely exhibit up to 2.98x performances speedups compared with their scalar counterparts.
-
1. Introduction
Nowadays, computer vision (CV) applications such as image classification[1], target detection[2], face recognition[3], and many others, have been widely used in our daily lives. With the rapid development of IoT (Internet of Things) technologies, multimedia applications of embedded systems require efficient processing on the image and video data to catch up with their real-time requirements[4].
Behind the CV applications lie the CV algorithms that operate on the input multimedia resources such as audio, images and videos, to perform conversions, extractions and transformations. Typically, the inputs of these algorithms are treated as streams of data, each of which is uniformly and equivalently computed, and the computing on each datum (e.g., an image or a frame of a video stream) can be further broken down to that on smaller units of data (e.g., pixels in an image). These characteristics of computing make the CV workloads especially suitable to be processed by using the Single Instruction Multiple Data (SIMD) instructions that are capable of processing multiple data items with a single instruction[5].
In history, there are many architectures proposed to support the SIMD style of processing, including dedicated multimedia processors (typically on System-On-Chip[6]), programmable multimedia processors (such as Mali Video Processors[7]), and general-purpose processors (GPPs) with multimedia extensions[8]. Among these architectures, GPPs with SIMD extensions ultimately become the mainstream method in the world of embedded computing due to their power on processing general-purpose workloads as well as executing SIMD instructions. In the GPP approaches, various instruction extensions (such as MMX, SSE, AVX of Intel x86 processors, and NEON for ARM-based processors[9]) have been proposed and developed for both desktop/server computers and mobile/embedded platforms. While as a recently-proposed open instruction set architecture (ISA)[10], RISC-V employs an incremental and modular design[11], compared with the traditional ISAs (such as x86 and ARM). Its vector extension (RV-V) supports configurable data lengths as well as register grouping and the effectiveness of the RV-V support on CV algorithms is still unclear.
In this paper, we evaluate the effectiveness of RV-V by using the CV algorithms as input workloads. Our findings are listed as follows.
1) It is effective to improve the performances of CV algorithms by using RV-V. For example, compared with the CV algorithms implemented by using RISC-V scalar instructions, the RV-V implementations of CV algorithms with the default setting of LMUL=1 reduce the instruction count by about 3x, and reduce the cycle counts (actual performances speedups) by about 1.83x on the Xuantie C906 processor
1 used in our evaluations.2) Given the large sizes of real-world images, the feature of variable vector length of RV-V helps little on further improving the performances of CV algorithms. The variable vector length feature of RV-V relieves programmers from dealing with the “corner” cases, where the data lengths are smaller than the vector lengths. However, as the portion of the corner cases is small in today's real-world images, accelerating the computations on corner cases does not effectively improve the overall performances (measured by the instruction count or cycle count) of CV algorithms.
3) The register grouping feature, i.e., concatenating multiple vector registers to form long vectors, of RV-V is effective on further improving the performances of CV algorithms. Our evaluation shows that grouping eight vector registers (LMUL=8) further reduces about 8x the instruction count of CV algorithms, compared with those of the default setting of LMUL=1, which leads to a 24x (i.e., 3 × 8) overall reduction on the instruction count over the scalar implementations.
4) The reduction on the cycle counts (actual performances speedups) by register grouping is, however, determined by the underlying architecture of an RV-V co-processor. For example, with the RV-V co-processor (that has eight execution lanes) of Xuantie C906, register grouping at the maximum level (i.e., LMUL=8) reduces the cycle counts by 1.63x than that of the default setting of LMUL=1 (i.e., no grouping) for our chosen CV algorithms. In summary, the overall reduction on the cycle count (i.e., performances improvement) by using RV-V instructions is up to 2.98x (i.e., 1.83 × 1.63) over their scalar counterparts for typical CV algorithms running on the Xuantie C906 processor.
With the evaluations, our paper makes the following contributions.
∙ It gives a discussion on the characteristics of RV-V, and the principles of using RV-V SIMD instructions on solving real-world computing problems.
∙ It implements typical CV algorithms on the RISC-V platform using both scalar and vector instructions.
∙ It extensively evaluates the performances of CV algorithms on the RV-V platform, to reveal the effectiveness of prominent features (i.e., variable vector length and register grouping) of RV-V on CV algorithms.
The rest of this paper is organized as follows. Section 2 discusses the background and related work of this paper. Section 3 introduces RV-V. Section 4 uses RV-V to solve real-world problems including CV algorithms. Section 5 discusses the general architecture of RV-V co-processor. Section 6 evaluates the performances of CV algorithms running on top of RV-V. Section 7 concludes the paper and discusses the future work.
2. Background and Related Work
In this section, we present the research history of SIMD in Section 2.1, then introduce CV algorithms and a popular software package, i.e., OpenCV, in Subsection 2.2, and discuss related work in Subsection 2.3.
2.1 History of SIMD
The computer architectures supporting SIMD are widely studied in high-performance computing (HPC) and multimedia research fields with a long history. Such computer architectures[12] provide vector registers with longer data-width than general purpose scalar registers, a set of extended vector instructions that operate by using the vector registers, and (possibly multiple) processing units that support the execution of the vector instructions. The idea of SIMD instructions as well as processing units was first invented and applied to vector supercomputers around 1970 in the ILLIAC IV computer from the University of Illinois[13], the TI ASC supercomputer architecture from Texas Instruments[14], and the CRAY-1 computer system[15]. During the early years, these vector processors were mainly used for scientific applications.
In 1997, Intel introduced a new type of co-processor with the Multimedia Extension (MMX) support that is capable of processing multimedia data with SIMD instructions, alongside Intel's desktop processors[16]. The introduction of the MMX co-processor ignites the trend of designing specialized computer architectures that cooperate SIMD processing units to general processors, such as desktop and embedded processors. With years of development, the size of the vector register of the SIMD co-processor has increased from 64 bits (MMX) to 128 bits (Streaming SIMD Extension, SSE) and then to 256 bits (Advanced Vector Extension, AVX) in Intel's (x86) processors. Theoretically, the larger the size the vector register gets, the more the data can be processed by a single instruction, and the better the data-level parallelism can be achieved. However, the SIMD co-processor architectures of Intel x86 and ARM NEON[9] adopt the fixed data-width design (i.e., SIMD instructions operating on vector registers of fixed data length), and have the following limitations.
1) Limited Parallelism. The amount of data that can be processed with an SIMD instruction is fixed by the size of the vector register, which limits the data-level parallelism that can be exploited at the instruction level.
2) Inflexibility. If the input data amount is insufficient to fulfill a vector register, these architectures have to use scalar instructions to process the data, and cannot use the SIMD instructions at all.
3) Compatibility Issues. The fixed data-width limitation makes it difficult to port an application developed in one platform to another platform with different data-width of SIMD instructions.
Note that although ARM proposes a scalable vector extension (SVE)[17] that supports flexible vector length in its v8-A AArch64 instruction set, it is mainly used for HPC workloads. For the embedded application domain, the NEON architecture is mainly used in ARM processors to provide the SIMD processing capability.
2.2 CV Algorithms and OpenCV
CV algorithms including typical conversion, extraction and transformation operations, are mainly used in image processing. We refer readers to a solid survey [18] on CV algorithms. OpenCV[19] is an open-source CV software library that provides a common infrastructure for multimedia applications and numerous optimized algorithms for CV. It supports various architectures such as x86 and ARM, and can take advantage of SIMD instructions when available. Imgproc is the core module of OpenCV, and it offers various image processing algorithms generic for the upper-level CV algorithm. Our evaluation will focus on typical algorithms in the Imgproc module, such as Gray Scale[20], Mean Filter[21], and Edge Detection[22].
2.3 Related Work
The vector extension of RISC-V has received more and more attention in recent years. A group in ETH Zürich[23] implemented an energy-efficient and scalable RISC-V vector processor based on RV-V (draft version 0.5). Tagliavini et al. implemented and evaluated small float SIMD extensions in RISC-V[24]. Their work reported 2.18x speedups in performances and about 50% power savings in the SVM (support vector machine) workload for 8-bit small-float types, compared with 16-bit and 32-bit float-point types. Louis et al.[25] implemented a set of deep learning models on TensorFlow Lite with draft version 0.5 RV-V extension, and the evaluations showed that the vector implementation achieved about 8x reduction in the number of instructions compared with scalar implementations. Today, the newest draft for RV-V is under the version code of 0.9, and we take its latest stable version of 0.8 in this paper to evaluate its effectiveness on CV algorithms, which are unaddressed in previous work[24, 25].
3. RISC-V Vector Extension
As a newly emerging ISA, RISC-V is featured with its modular design[11]. ISA consists of a basic component that supports integer instructions, and many extensions such as multiplication, atomicity, float and double-precision computations. Within these extensions, RV-V enables the computer to execute SIMD instructions.
RV-V provides two types of registers: the data registers and the control and status registers (CSRs), and three types of instructions: configuration, memory access, and arithmetic operation instructions[26]. Table 1 lists the selected instructions supported by the draft version 0.8 RV-V. Note that the set of instructions in Table 1 is by no means the complete instruction set of the version 0.8 RV-V, but is enough to reveal the principles and characteristics of processing in RV-V. We refer interested readers to RV-V specification[26] (draft version 0.8
2 ) for complete and comprehensive discussions on RV-V.Table 1. Selected Instructions of RV-V[26] (Draft Version 0.8)Instruction Type Instruction Function Configuration set vsetvli rd, rs1, vtypei Set the vtype and vl CSRs and write the new value of vl CSR from an immediate value vtypei and integer scale registers rs1 and rd Memory access vl{b, h, w, e}.v vd, (rs1) Load a vector into vd from memory address in rs1 vls{b, h, w, e}.v vd, (rs1), rs2 Load a vector into vd with stride value in rs2 from memory address in rs1 vs{b, h, w, e}.v vs3, (rs1) Store a vector in vs3 to memory address in rs1 vss{b, h, w, e}.v vs3, (rs1), rs2 Store a vector in vs3 with stride in rs2 to memory address in rs1 Arithmetic operation vadd.vx vd, vs2, rs1 vd[i] = vs2[i] + x[rs1] vmul.vx vd, vs2, rs1 vd[i] = vs2[i] × x[rs1] vfadd.vf vfd, vfs2, rfs1 vfd[i] = vfs2[i] + f[rfs1] vfmul.vf vfd, vfs2, rfs1 vfd[i] = vfs2[i] × f[rfs1] vadd.vv vd, vs2, vs1 vd[i] = vs2[i] + vs[i] vmul.vv vd, vs2, vs1 vd[i] = vs2[i]×vs[i] vfadd.vv vfd, vfs2, vfs1 vfd[i] = vfs2[i] + vfs[i] vfmul.vvvfd, vfs2, vfs1 vfd[i] = vfs2[i]×vfs[i] vmacc.vx vd, rs1, vs2 vd[i] = (vs2[i] × x[rs1]) + vd[i] vfmacc.vf vfd, rfs1, vfs2 vfd[i] = (vfs2[i] × f[rfs1]) + vfd[i] Note: vd and vs are integer vector registers, while rs and rd are integer scale registers. vfd and vfs are float vector registers, while rfs and rfd are float scale registers. “s” in vs, rs, vfs and rfs stands for source, and “d” in vd, rd, vfd and rfd stands for destination. RV-V has 32 vector (data) registers (named v0–v31), each of which has a fixed data-length (denoted as VLEN in RV-V specification) defined according to the architect's choices. To simply our following discussion, we take VLEN=128(bits) in this paper. RV-V defines two important CSR registers, namely vtype and vl. The value of the vl register decides the effective length of the vectors, and can be set by using the vsetvli instruction according to the value of the vtype register. As the vector length can be defined and adjusted dynamically (by setting the value of the vl register), arithmetic instructions listed in Table 1 can be conducted on vectors of different lengths, which is a key difference from the fixed data-width vector approach used in the SIMD designs of Intel AVX[27] or ARM NEON[9].
Suppose we have the following assembly code that sets the value of the vl register:
vsetvlia4,a0,e32. (1) In (1), both a0 and a4 are general (scalar) registers of the RISC-V platform, and they are used as the input and output parameters of the vsetvli instruction respectively. e32 is an assembler name, and sets (implicitly the vtype register) the width of an element of the vector (denoted as SEW in RV-V specification) to 32 (bits). Taking (1) as an input, the value of the vl register will be computed by using the following equation:
[vl]=[a4]=min (2) As we assume {\tt VLEN}=128 and {\tt SEW}=32 in this paper, we have {\tt VLEN/SEW}=4 ({\tt VLEN} integer divides {\tt SEW}) in (2). Therefore, if the input of [a0] is smaller than 4, the value of the vl register will be [a0]; otherwise it will be 4.
Further, in order to efficiently process long vectors, RV-V provides a register grouping mechanism that combines multiple vector registers to provide long vector widths for the arithmetic instructions. For example, we can group eight vector registers to form a long vector storage space by using the following assembly code:
\rm{vsetvli} \quad \rm{a4, a0, e32, m8}. (3) In (3), e32 and m8 are assembler names, where e32 sets the width of an element of the vector to 32 bits ({\tt SEW}=32 bits), and m8 defines the number of vector registers (denoted as {\tt LMUL} in RV-V specification) used in a register group as 8. The available choices (i.e., digits after “m”) are 1, 2, 4, and 8 in version 0.8 RV-V specification, and m1 is the default value (i.e., no register grouping) if the “m” part is missing as in assembly code (1). That is, RV-V can group at most eight vector registers (up to \;8\times\;{\tt VLEN} = 8 × 128 = 1024 bits). With the introduction of register grouping, the value of vl is computed using the following (4):
\left[\mathrm{vl}\right] = \left[ \rm{a4}\right] = \min(\left[ \rm{a0}\right], {\tt LMUL}\times {\tt VLEN}/{\tt SEW}). (4) For the assembly code (3), the value of vl will be 32 if [a0] is big enough ( \geqslant 32 ), as {\tt LMUL}=8, {\tt VLEN}=128 and {\tt SEW}=32, which means a vector can now contain 32 data elements (32 bits for each) after register grouping.
Besides the configuration instructions, RV-V defines a set of memory access instructions and a set of arithmetic instructions as listed in Table 1. The memory access instructions are capable of loading/storing data from/to the memory, while the arithmetic instructions conduct operations (such as adding, multiplication, and accumulation) based on the values of the vector and scalar registers. Some of the memory access instructions (such as the instruction of “vls{b, h, w, e}.v vd , ( rs1 ), rs2 ”) can read data whose width is defined by the instruction itself (e.g., “b” means a byte, “h” means a half-word, “w” means a word, and “e” means an element) from the memory address, which is stored in a scalar register rs1 , with a stride defined by the value stored in another scalar register rs2 . The suffix of arithmetic instructions (e.g., “\rm vf ” of “{\rm{ vfmul.vf}}” instruction) in RV-V regulates the participating operands, where “v” means a vector register, “x” means a scalar register, and “{\rm{f}}” denotes a float register.
4. Using RV-V
In this section, we first discuss how to use RV-V by a simple example in Subsection 4.1, and then implement a typical CV algorithm (Gray Scale) in Subsection 4.2.
4.1 Simple Example
To demonstrate the method of using RV-V on solving real-world problems, Fig.1 gives an example code snippet that adds two integer input arrays w and y , and stores the result in array z . The addition computing example is conducted in the element-wise fashion.
As the operations conducted on the elements of input arrays are homogeneous (integer addition), the problem listed in Fig.1 is suitable to be solved by using the SIMD instructions of RV-V. However, the array length (i.e., the input parameter “ len ”) directly affects the executions of using the SIMD instructions. In the following, we focus on discussing how to use RV-V to solve the typical problem under two cases: len =7 and len =1024. The first case ( len =7) represents the problem with non-aligned boundaries, and the second case ( len =1024) represents the problem with excessively long vectors.
4.1.1 Case of len =7
The code listed in Fig.2(a) shows the solution of the above array addition example by using RV-V. A general register (i.e., a0) is initialized to len (i.e., 7) at the first line, and used as a parameter to the following code. Line 6 of Fig.2(a) computes the value of the vector length register, i.e., vl. As we now know that {\tt VLEN}=128 (system parameter), {\tt SEW}=32 (also set at line 6), and [a0]=7 (array length), according to (2), we have [vl]=4. That is, we use vectors whose length equals 4 in the first round of the loop.
After loading and adding the first four elements of arrays w and y , the length parameter stored in a0 is subtracted by the value of vl (at line 11), and then we have [a0]=7–4=3. Therefore, when the execution of the code in Fig.2(a) comes back to line 6 again in the second loop, the vector length ([vl]) needs be recomputed, and will be 3. That is, we use vectors whose length equals 3 in the second round of the loop.
The len =7 array addition example demonstrates the capacity of RV-V, with which RV-V can comfortably cope with the boundary cases with vector instructions, on conducting vector operations even when the length of vectors varies during the fly. Such a capacity is, however, absent in SIMD ISA designs with fixed-length vectors. In a platform (e.g., Intel AVX or ARM NEON) that uses the fixed-length vector SIMD ISA design, supposing the length of vector is four integers (128 bits, similar to our {\tt VLEN} setting of RV-V), the vectors have to be resorted to scalar instructions to handle the second loop, since there are only three data elements and they cannot be fitted into a vector.
4.1.2 Case of len =1024
Fig.2(b) lists the codes of our above array addition example when len =1024. As the input arrays are relatively long, we use the register grouping feature of RV-V to accelerate the computation.
We can observe that different from the case when len =7 listed in Fig.2(a), the code of Fig.2(b) first sets [a0] to 1024 as the input parameter in the first line, and then switches on the register grouping by passing the m8 assembler name to the vsetvli instruction in line 6. The effectiveness of m8 is to set {\tt LMUL} to 8, i.e., grouping eight vector registers to form long vector registers to improve the efficiency of processing. According to (4), we have [vl] = 32 (elements), as [a0] = 1024, {\tt VLEN}=128, {\tt SEW}=32 and {\tt LMUL}=8. Moreover, [vl] keeps to be 32 till the end of execution, as 1024 is divisible by 32.
As discussed in Section 3, the candidate values of {\tt LMUL} are 1, 2, 4 and 8 in version 0.8 RV-V specification, respectively, and vectors of different lengths can be defined by giving different values after “m” in the sixth line of code in Fig.2(b). Table 2 lists the number of instructions needed to complete the addition of two input arrays, whose sizes are both 1024.
{\tt LMUL} Instruction Count 1 11 \times 1024/4 = 2816 2 11 \times 1024/8 = 1408 4 11 \times 1024/16 = 704 8 11 \times 1024/32 = 352 From Table 2, we can observe that the instruction count is inversely proportional to {\tt LMUL}. That is, larger {\tt LMUL} generally implies smaller instruction count in RV-V. We will observe similar phenomena in the experimental results of our evaluations in Section 6.
4.2 Implementing CV Algorithms
Although there are a variety of CV algorithms[18], the basic functionality of these algorithms is similar: taking an image as input, conducting some computations on the input image data, and producing an output image. To demonstrate the method of implementing and optimizing CV algorithms by using RV-V, we take the Gray Scale algorithm[20] as an example, and discuss how to use RV-V on the computation of this algorithm in the following.
The input image of the Gray Scale algorithm is a color image consisting of multiple pixels, and the number of pixels is decided by the resolution of the images. For example, for an image with the resolution of 800 × 600, it has totally 480 000 pixels. To display the color of a pixel, our computer uses three channels (i.e., red, green and blue, generally abbreviated as RGB), and gives an integer value (occupies 24 bits) to each channel to express the depth of color. Generally, higher color depth (occupies larger storage space) means more vivid images. Therefore, the 800 × 600 input (color) image in the above example is actually stored (on disk or in memory) as 480 000 tuples of (R, G, B), and each letter of the tuple occupies a storage space of color-depth.
The effectiveness of the Gray Scale algorithm is to convert the input color image into a black and white image which has only one color channel. For each pixel, its color value expresses the degree of blackness. During conversion, the algorithm needs to first assign a weight value to each original color channel (thus we have three weight values), and then it computes the degree value of blackness for each pixel using the following equation:
GScale = \lfloor R\times wr + G \times wg + B\times wb \rfloor, (5) where wr , wg and wb denote the (float) weight values assigned to the red, green and blue channels respectively, and GScale denotes the result. Assuming an input image src has been loaded into memory, the algorithm logic of the gray scale is listed in (5).
From Algorithm 1, we can observe that the algorithm first prepares three arrays, i.e., vr , vg and vb , for the three channels R, G, B respectively, and then uses the dot multiplications and accumulation operations to produce the final result. As the arrays can be easily broken down into many (shorter) vectors during computation, the algorithm can be comfortably implemented by using the facilities provided by RV-V. For example, we can use the vls{b, h, w, e}.v instruction (loading data from memory to vector registers with stride) to assembly vectors, the vfmul.vf instruction (multiplying a vector and a float scalar) to conduct the dot multiplication, and the vfmacc.vf instruction (accumulating the result of a multiplication to another vector) from Table 1 to accumulate the results.
Algorithm 1. Gray Scale Algorithm Input: int resolution; //say 800 × 600 src[resolution × 3]; //input image Output: dst[resolution]; //output image /* {\tt Load\ RGB\ data\ into\ three\ arrays} * / 1 load R channel values of src to vr; 2 load G channel values of src to vg; 3 load B channel values of src to vb; /* {\tt Calculate\ result\ with\ vector\ registers} * / 4 dst ← vr × wr + vg × wg + vb × wb; 5. Architecture of RV-V Co-Processor
RV-V is merely an ISA specification regulating the SIMD instructions that an RV-V co-processor should support. And, the implementations of RV-V co-processors may vary from one design to another. For example, the Hwacha Vector processor[28] and RISC-V2 Vector processor[29] are two open-source co-processor designs, and the RV-V co-processor of Xuantie C906 (chosen in our evaluations in Section 6) and Xuantie C910[30] are two proprietary full-processor designs. Fig.3 illustrates the general architecture of these RV-V co-processors.
As shown in Fig.3, the instruction stream of a program is first sent to the RISC-V processor (scalar core), where the vector instruction flow (i.e., VIF) is diverted to the RV-V co-processor and stored in the vector instruction queue (i.e., VIQ). The instructions in VIQ are then issued to the vector instruction storage (i.e., VIS) for decoding, and subsequently continue to execute in a pipeline fashion. At the execution stage of a vector instruction, multiple elements (stored in the vector register file (i.e., VRF)) of an instruction are computed in parallel on multiple vector execution lanes (i.e., VXUs). Then the computing results are first stored in the vector send data queue (i.e., VSDQ) and then written back to memory at the write-back stage. During the execution of a vector instruction, the vector load data queue (i.e., VLDQ) loads data required by the next vector instruction into the VRF, to overlap computation and memory access.
Within this architecture, assuming the execution stage consumes only one cycle and the overheads of memory accessing are perfectly masked, the number of execution lanes will have a decisive impact on the efficiency of an RV-V co-processor. Since apparently, the number of execution lanes decides the maximum number of data elements that can be processed in parallel of a vector instruction, it takes one cycle for the co-processor to process all the elements in a vector when the number of data elements in a vector is smaller than or equal to that of the execution lanes but will take more than one cycle otherwise.
6. Evaluations
In this section, we evaluate the effectiveness of using RV-V on CV algorithms. We introduce the three typical CV algorithms used in our evaluations in Subsection 6.1, introduce the test-beds in Subsection 6.2, and present our experimental results in Subsection 6.3.
6.1 Selected CV Algorithms
Besides the Gray Scale algorithm discussed in Subsection 4.2, we choose two more CV algorithms, i.e., Mean Filter[21] and Edge Detection[22], to conduct our evaluations.
Fig.4 takes an example of 640 × 480 image (in Fig.4(a)), each pixel of which has three channels (RGB) and 256 colors (i.e., each color value occupies 8 bits, color-depth=8). The gray scale uses Algorithm 1 with parameters wr=0.114 , wg=0.587 and wb=0.299 (weight values for the three channels respectively), to convert the original image into a black and white (single channel) image shown in Fig.4(b), each pixel of which occupies 8 bits.
The Mean Filter algorithm[21] takes a black and white image as its input, and for each pixel of the input image, the algorithm re-computes its color value by averaging it with those of its surrounding (8) pixels. Fig.4(c) presents the output image produced by the Mean Filter algorithm that takes Fig.4(b) as input. The Edge Detection algorithm also takes a black and white image as its input, computes the gradient value of each pixel at both horizontal and vertical directions, and gives a bipolar value (i.e., 0 or 255; 0 means white, while 255 means black) to each pixel according to its computing result. Fig.4(d) gives the resulting image produced by the Edge Detection algorithm that uses Fig.4(b) as input. Both Fig.4(c) and Fig.4(d) are black and white images with color-depth=8 (i.e., each pixel occupies 8 bits).
During the above computations, we define {\tt SEW}=32 bits for all (scalar and vector) cases. Accordingly, we extend the color values into 32 bits (by padding higher bits of color depth values of Fig.4(a) with 0 s) during the computation of gray scale, and store each pixel of the black and white image in Fig.4(b) as 32 bits to align with {\tt SEW}. Within these evaluations, we will use images of other realistic resolutions (like 256 × 256, 512 × 512, 800 × 600 and 1024 × 768) to evaluate the CV algorithms. The organizations and processing of these images are similar to those of the example 640 × 480 image.
6.2 Test-Beds
We use a RISC-V platform and an ARM platform as listed below in our evaluations.
6.2.1 RISC-V Platform
We use the Allwinner D1
3 developing board containing a Xuantie C906 processor (hard core) running at 1 GHz and 512 MB DDR3 RAM to measure the performances of our chosen CV algorithms. The C906 processor is configured with 32 KB instruction cache and 32 KB data cache, and is featured with its five stages in-order pipeline and its support on “RV64GCV” ISA. The SIMD co-processor of C906 supports the version 0.7.1 RV-V specification which is similar to the 0.8 version RV-V specification discussed in this paper, and its vector register width is 128 bits. The D1 platform runs the 64-bit Debian Linux version 11.We compile the scalar-version CV algorithms by using the RISC-V GCC cross-compiler
4 . To produce vector-version CV algorithms, we implement a set of intrinsic functions5 to conduct basic functionalities of vector operations using the instructions listed in Table 1, and link the source codes of CV algorithms against these intrinsic functions. Moreover, within these intrinsic functions, we vary the number of vector registers in a group ({\tt LMUL}) to be 1 (minimal degree), 2, 4 and 8 (maximum degree), respectively, during the experiments to examine the power of register grouping of RV-V. During evaluations, we execute both the scalar-version and the vector-version of our selected CV algorithms with various input images on the D1 RISC-V platform, and collect the values of the instret and cycle registers (CSRs of RISC-V processor) to track the number of committed (scalar and SIMD) instructions and cycles during the executions[26].As Xuantie C906 is not an open-source design, details (e.g., the number of execution lanes) of its RV-V co-processor are unknown. We design a micro-benchmark to reveal its number of execution lanes. The benchmark defines three equal-length integer/float arrays ( A , B and C ) and conducts a loop of vector instructions each of which adds A and B to C in the element-wise fashion. The arrays are small enough to be stored in the 32 KB data cache to prevent memory accessing during the executions. We vary the number of elements (from 1 to 32) to be added in each vector instruction by leveraging the variable vector length and register grouping feature of RV-V, and record the CPI (cycles per instruction) in Fig.5.
We can observe from Fig.5 that the vector instructions whose vector lengths are in the range of [1, 8] consume the same amount of cycles, which means that the RV-V co-processor of C906 has eight execution lanes. Besides, we can also observe from Fig.5 that the vector instructions whose vector lengths are in the range of [9, 16] (meaning {\tt LMUL}=4) consume twice the number of cycles as those whose vector lengths are in the range of [1, 8] (meaning {\tt LMUL}=1 or 2). Moreover, the vector instructions whose vector lengths are in the range of [17, 32] (meaning {\tt LMUL}=8) consume twice the number of cycles as those whose vector lengths are in the range of [9, 16]. Such phenomena are caused by the limited number of execution lanes, and we will use these experiment data to explain the performances characteristics in Subsection 6.3.3.
6.2.2 ARM Platform
We use gem5[31] in the SE (System-call Emulation) mode to simulate an in-order Cortex-A ARM processor with a NEON SIMD co-processor running at 1 GHz. The ARM processor is also configured with 32 KB instruction cache and 32 KB data cache, and the simulated platform is configured with 512 MB DDR3 RAM. Similar to the RV-V co-processor, the vector register width of NEON is also 128 bits.
We also cross-compile the CV algorithms by using the ARM compiler
6 for our selected CV algorithms to produce the scalar-version executables on the ARM platform, and also rely on the build-in intrinsic functions of the ARM compiler to program the vector-version CV algorithms. We count the number of instructions and cycles, which store the performances data during the executions of CV algorithms on given input images, from the gem5's outputs.6.3 Experimental Results
We evaluate RV-V when conducting CV algorithms from three aspects. First, we compare the performances of the same CV algorithms that use scalar or vector instructions respectively. Second, we evaluate the power of variable vector length of RV-V. Finally, we examine the effectiveness of register grouping in RV-V on the performances of our selected CV algorithms.
6.3.1 SIMD vs Scalar
Fig.6 compares the performances of our selected CV algorithms implemented using both scalar instructions and SIMD vector instructions.
From Fig.6(a), we can observe that for a same CV algorithm, the instruction count of its vector version ({\tt LMUL}=1) is about 30% of its scalar version, when conducted in the same 800 × 600 image. It means, by using the SIMD vector instructions, the performances of CV algorithms can be improved by about three times. The reason behind such reductions on instruction count is that RV-V uses vectors to conduct the computations of pixels, where a single instruction can process four data items with the vector registers. Moreover, the necessary scalar instructions, for example those used to move forward memory-referencing pointers, slightly bring down the acceleration, and finally lead to 3x smaller instruction count. Comparing the cycle counts of two versions (scalar and vector) of CV algorithms of the RISC-V platform, we can observe that the vector version of CV algorithms averagely achieves only about 45.4% reductions on the counts, which means 1.83x average speedups by using the RV-V co-processor to conduct the CV workloads.
Apparently, the reductions on cycle counts are much less than those on the instruction count. We speculate (as C906 has no open documents) the reason is that vector instruction scheduling and memory accessing in the RV-V co-processor of C906 incur overheads, which introduce pipeline stalls and subsequent extra cycles.
From Fig.6(b), we can observe that when the sizes of input images increase, the instruction count in Mean Filter increases accordingly, while the performances gains by using the SIMD facilities of RV-V maintain at about 3x. Similar patterns can also be observed for the other two CV algorithms. Similarly, for the Mean Filter algorithm, the reductions on cycle counts are about 45% as in Fig.6(a).
Comparing the performances (instruction and cycle count) of RISC-V and ARM platforms, we can observe that both scalar and vector version CV algorithms achieve better performances on our RISC-V platform (i.e., Allwinner's D1) than on the simulated ARM system. Moreover, the average performances speedups by using the vector co-processor on ARM are about 1.41x, which is lower than that (1.83x) of the RISC-V platform.
6.3.2 Variable Vector Length
As discussed in Section 3 and Section 4, an interesting feature of RV-V is that it can use vectors of variable lengths to cope with the computation whose data do not fully populate a vector register. Such a feature is useful when the CV algorithm processes the image border region, whose data-width at each row is shorter than that of a vector.
In order to better explain such a problem, we first define the term of “border region” by an example m\times n image (where m and n are integers bigger than 1) as shown in Fig.7, where the border region is the gray area at the right side of the image. As we suppose the length of vector is 4 in this example image, the border region has n{\text%}4 , where % is the modulo operator, and the result of n{\text%}4 is in the range of [0, 3] pixel columns. Clearly, for each pixel row of the border region, its data items (values of pixels) are insufficient to fulfill a vector register. In platforms that employ fixed data-width, such as ARM NEON, the computations on the pixels inside the border region have to resort to scalar instructions, and are conducted on one pixel after another. However, RV-V can still use vector instructions to conduct the computations of such border regions, by leveraging its capability on varying the length of vectors as discussed in Section 4.
Will the variable vector length feature of RV-V lead to performances gains over the fixed data-width design? With such a question, we take three images under realistic resolutions (i.e., 256 × 256, 640 × 480 and 1024 × 768) as inputs, to measure the instruction count during the execution of the Mean Filter algorithm on both RV-V (with {\tt LMUL}=1) and ARM NEON, and compare them in Fig.8.
As 256, 480 and 768 are divisible by 4, we slightly change the input images by increasing the number of pixel columns by 3, and also record the instruction and cycle counts on these altered images of the Mean Filter in Fig.8. The artificially added three pixel columns form the worst-case border regions, and force ARM NEON to use scalar instructions during computation, while RV-V still uses vector ones.
From Fig.8, we can observe that with the increased pixel columns (accordingly the area of the border regions), the instruction counts in the Mean Filter algorithm on both ARM NEON and RV-V increase, where the increased parts on ARM NEON are scalar instructions, while those on RV-V are still vector instructions. However, the portions of instruction increasing are generally rather marginal (e.g., on ARM NEON, they are below 4%; on RV-V, they are below 2%). The reason is that our real-world images processed by using the CV algorithms have a relatively large number of pixel columns (i.e., the value of n ) which greatly reduce the area percentages of border regions to the whole images. For example, when n =256, the area portion of the border region to the whole image is about: 3/256 \approx 1.2\% , which will not incur notable performances penalties, even by using scalar instructions. Such an observation can also be verified by the slight increases in cycle counts. Moreover, larger n means even smaller area percentages of border regions to whole images, and thus smaller extra overheads of processing.
With these observations, we conclude that the variable vector length feature of RV-V does not bring direct performances advantages to the CV algorithms running on top of the RISC-V platforms. Besides, we speculate that for the CV workloads, the real value of the variable vector length feature lies in the ease of programming and portability, since the computing on whole images can all be implemented with the SIMD codes.
6.3.3 Register Grouping
As discussed in Section 3 and Section 4, RV-V can group multiple vector registers to form “large registers” that can store long vectors to further improve the data-level parallelism. To examine the effectiveness of this feature, we conduct the scalar-version and vector-version (with {\tt LMUL} set to be 1, 2, 4 and 8, respectively) of our chosen CV algorithms on images at the resolution of 800 × 600, and record their committed instructions during the executions in Fig.9.
From Fig.9, we can observe that, for the vector-version CV algorithms, the instruction count reduces almost inverse proportionally when {\tt LMUL} increases from 1 to 8. The reason behind such an instruction reduction pattern is that most of the CV algorithms (such as our chosen ones) are typical SIMD workloads, i.e., an algorithm conducts the same operation on multiple data items (pixels in an input image). Moreover, the operations conducted on the data items are embarrassingly parallel, i.e., there is no data-dependence between any two operations. All these characteristics of CV algorithms make them easy to take advantage of the large vector registers (after grouping) to improve their parallelism and efficiency.
Nevertheless, such large reductions on the instruction counts do not translate into similar reductions on the cycle counts. For our chosen algorithms, the instruction count drops (averagely about 38.7%, meaning 1.63x speedups) only from the case of {\tt LMUL}=1 to that of {\tt LMUL}=2, and has no further apparent reductions when {\tt LMUL} further increases to 4 and 8. The reason is that the RV-V co-processor has only eight execution lanes as revealed by the micro-benchmark in Fig.5, and the execution lanes will be fully used when {\tt LMUL}=2 (eight data elements in each vector). Vector instructions with more elements (e.g., 16 elements when {\tt LMUL}=4, and 32 elements when {\tt LMUL}=8) consume more cycles (double or quadruple the cycle numbers when {\tt LMUL}=2) as shown in Fig.5, and do not bring performances advantages.
Comparing the instruction count of the scalar version and the {\tt LMUL}=8 vector version CV algorithms in Fig.9, we can observe that by maximizing the register grouping level (to 8), the instruction count of our chosen CV algorithms reduces by 24x, when compared with those using scalar instructions. However, on our chosen RISC-V platform (Allwinner D1 with C906 processor), the average performance speedup (measured in the cycle count) by using RV-V is only about 2.98x (1.83 × 1.63) over the scalar instructions when conducting a same CV algorithm in the same input image.
7. Conclusions
In this paper, we examined the potential of RV-V on processing typical CV workloads. We showed that for a typical CV algorithm, the instruction count of its OpenCV implementation using vector instructions was about three times less than that using scalar instructions. Moreover, by enabling vector register grouping, an additional eight times reduction on the instruction count could be achieved. Nevertheless, our evaluations in this paper also showed that we had to take into account the architecture of an actual SIMD co-processor, when translating this huge advantage of RV-V ISA to real performances gains, and the overall 24x (3 × 8) reduction in the instruction count could be thought as the upper-bound of performances improvements on replacing the scalar instructions with RV-V vector instructions when programming a CV algorithm.
The evaluations in this paper considered only the cases of attaching RV-V co-processors with a single-issue in-order scalar processor (e.g., Xuantie C906). It will be interesting on evaluating the performances of CV algorithms in the scenarios where RV-V co-processors are with out-of-order cores like Xuantie C910[30], by using the performances results in this paper as the baseline performances.
-
Figure 2. RV-V code for example in Fig.1. (a) When len =7. (b) When len =1 024.
Table 1 Selected Instructions of RV-V[26] (Draft Version 0.8)
Instruction Type Instruction Function Configuration set vsetvli rd , rs1 , vtypei Set the vtype and vl CSRs and write the new value of vl CSR from an immediate value vtypei and integer scale registers rs1 and rd Memory access vl{b, h, w, e}.v vd , ( rs1 ) Load a vector into vd from memory address in rs1 vls{b, h, w, e}.v vd , ( rs1 ), rs2 Load a vector into vd with stride value in rs2 from memory address in rs1 vs{b, h, w, e}.v vs3 , ( rs1 ) Store a vector in vs3 to memory address in rs1 vss{b, h, w, e}.v vs3 , ( rs1 ), rs2 Store a vector in vs3 with stride in rs2 to memory address in rs1 Arithmetic operation vadd.vx \quad vd , vs2 , rs1 vd[i] = vs2[i] + x[ rs1 ] vmul.vx \quad vd , vs2 , rs1 vd[i] = vs2[i]\ \times x[ rs1 ] vfadd.vf \quad vfd , vfs2 , rfs1 vfd[i] = vfs2[i] + f[ rfs1 ] vfmul.vf \quad vfd , vfs2 , rfs1 vfd[i] = vfs2[i]\ \times f[ rfs1 ] vadd.vv \quad vd , vs2 , vs1 vd[i] = vs2[i] + vs[i] vmul.vv \quad vd , vs2 , vs1 vd[i] = vs2[i] \times vs[i] vfadd.vv \quad vfd , vfs2 , vfs1 vfd[i] = vfs2[i] + vfs[i] vfmul.vv \quad vfd , vfs2 , vfs1 vfd[i] = vfs2[i] \times vfs[i] vmacc.vx \quad vd , rs1 , vs2 vd[i] = ( vs2[i]\ \times x[ rs1 ]) + vd[i] vfmacc.vf \quad vfd , rfs1 , vfs2 vfd[i] = ( vfs2[i]\ \times f[ rfs1 ]) + vfd[i] Note: vd and vs are integer vector registers, while rs and rd are integer scale registers. vfd and vfs are float vector registers, while rfs and rfd are float scale registers. “ s” in vs , rs , vfs and rfs stands for source, and “ d” in vd , rd , vfd and rfd stands for destination. Table 2 Different Choices of {\tt LMUL} on Solving Example in Fig.1 If len =1024
{\tt LMUL} Instruction Count 1 11 \times 1024/4 = 2816 2 11 \times 1024/8 = 1408 4 11 \times 1024/16 = 704 8 11 \times 1024/32 = 352 Algorithm 1. Gray Scale Algorithm Input: int resolution; //say 800 × 600 src[resolution × 3]; //input image Output: dst[resolution]; //output image /* {\tt Load\ RGB\ data\ into\ three\ arrays} * / 1 load R channel values of src to vr; 2 load G channel values of src to vg; 3 load B channel values of src to vb; /* {\tt Calculate\ result\ with\ vector\ registers} * / 4 dst ← vr × wr + vg × wg + vb × wb; -
[1] Lu D, Weng Q. A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 2007, 28(5): 823–870. DOI: 10.1080/01431160600746456.
[2] Zhang Z, Hu Y T, Lipton A J, Venetianer P L, Yu L, Yin W H. Target detection and tracking from video streams. US Patent 7801330. September 21, 2010.
[3] Zhao W, Chellappa R, Phillips P J, Rosenfeld A. Face recognition: A literature survey. ACM Computing Surveys, 2003, 35(4): 399–458. DOI: 10.1145/954339.954342.
[4] Nauman A, Qadri Y A, Amjad M, Zikria Y B, Afzal M K, Kim S W. Multimedia internet of things: A comprehensive survey. IEEE Access, 2020, 8: 8202–8250. DOI: 10.1109/ACCESS.2020.2964280.
[5] Diefendorff K, Dubey P K. How multimedia workloads will change processor design. Computer, 1997, 30(9): 43–45. DOI: 10.1109/2.612247.
[6] Wolf W, Jerraya A A, Martin G. Multiprocessor system-on-chip (MPSoC) technology. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2008, 27(10): 1701–1713. DOI: 10.1109/TCAD.2008.923415.
[7] Mijat R. Take GPU processing power beyond graphics with Mali GPU computing. White Paper, ARM, 2012. https://developer.arm.com/-/media/Files/pdf/graphics-and-multimedia/WhitePaper_GPU_Computing_on_Mali.pdf, July 2023.
[8] Shahbahrami A, Juurlink B H H, Vassiliadis S. A comparison between processor architectures for multimedia applications. In Proc. the 15th Annual Workshop on Circuits, Systems and Signal Processing, Apr. 2004, pp.138–152.
[9] Reddy V G. Neon technology introduction. ARM Corporation, 2008, 4(1): 1–33.
[10] Asanović K, Patterson D A. Instruction sets should be free: The case for RISC-V. Technical Report, EECS Department, University of California, Berkeley. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.html, July 2023.
[11] Patterson D, Waterman A. The RISC-V Reader: An Open Architecture Atlas. Strawberry Canyon, 2017.
[12] Duncan R. A survey of parallel computer architectures. Computer, 1990, 23(2): 5–16. DOI: 10.1109/2.44900.
[13] Barnes G H, Brown R M, Kato M, Kuck D J, Slotnick D L, Stokes R A. The ILLIAC IV computer. IEEE Trans. Computers, 1968, C-17(8): 746–757. DOI: 10.1109/TC.1968.229158.
[14] Watson W J. The TI ASC: A highly modular and flexible super computer architecture. In Proc. the Fall Joint Computer Conference, Dec. 1972, pp.221–228.
[15] Russell R M. The CRAY-1 computer system. Communications of the ACM, 1978, 21(1): 63–72. DOI: 10.1145/359327.359336.
[16] Peleg A, Wilkie S, Weiser U. Intel MMX for multimedia PCs. Communications of the ACM, 1997, 40(1): 24–38. DOI: 10.1145/242857.242865.
[17] Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P. The ARM scalable vector extension. IEEE Micro, 2017, 37(2): 26–39. DOI: 10.1109/MM.2017.35.
[18] Parker J R. Algorithms for Image Processing and Computer Vision (2nd edition). John Wiley & Sons, 2010.
[19] Bradski G, Kaehler A. Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media, Inc., 2008.
[20] Saravanan C. Color image to grayscale image conversion. In Proc. the 2nd International Conference on Computer Engineering and Applications, Mar. 2010, pp.196–199. DOI: 10.1109/ICCEA.2010.192.
[21] Chandel R, Gupta G. Image filtering algorithms and techniques: A review. International Journal of Advanced Research in Computer Science and Software Engineering, 2013, 3(10): 198–202.
[22] Maini R, Aggarwal H. Study and comparison of various image edge detection techniques. International Journal of Image Processing, 2009, 3(1): 1–11. DOI: 10.1049/iet-ipr:20080080.
[23] Cavalcante M, Schuiki F, Zaruba F, Schaffner M, Benini L. Ara: A 1-GHz+ scalable and energy-efficient RISC-V vector processor with multiprecision floating-point support in 22-nm FD-SOI. IEEE Trans. Very Large Scale Integration (VLSI) Systems, 2020, 28(2): 530–543. DOI: 10.1109/TVLSI.2019.2950087.
[24] Tagliavini G, Mach S, Rossi D, Marongiu A, Benini L. Design and evaluation of SmallFloat SIMD extensions to the RISC-V ISA. In Proc. the 2019 Design, Automation & Test in Europe Conference & Exhibition, Mar. 2019, pp.654–657. DOI: 10.23919/DATE.2019.8714897.
[25] Louis M S, Azad Z, Delshadtehrani L, Gupta S, Warden P, Reddi V J, Joshi A. Towards deep learning using tensorFlow lite on RISC-V. In Proc. the 3rd Workshop on Computer Architecture Research with RISC-V, Jun. 2019. DOI: 10.13140/RG.2.2.30400.89606.
[26] Waterman A, Asanović K. The RISC-V instruction set manual volume II: Privileged architecture version 20190608-Priv-MSU-Ratified. RISC-V Foundation, 2019. DOI: 10.1109/HOTCHIPS.2013.7478332.
[27] Lomont C. Introduction to Intel® advanced vector extensions. White Paper, Intel®, 2011. https://hpc.llnl.gov/sites/default/files/intelAVXintro.pdf, July 2023.
[28] Lee Y. Decoupled vector-fetch architecture with a scalarizing compiler [Ph.D. Thesis]. University of California, Berkeley, 2016.
[29] Patsidis K, Nicopoulos C, Sirakoulis G C, Dimitrakopoulos G. RISC-V2: A scalable RISC-V vector processor. In Proc. the 2020 IEEE International Symposium on Circuits and Systems, Sept. 2020. DOI: 10.1109/ISCAS45731.2020.9181071.
[30] Chen C, Xiang X Y, Liu C, Shang Y H, Guo R, Liu D Q, Lu Y M, Hao Z Y, Luo J H, Chen Z J, Li C Q, Pu Y, Meng J Y, Yan X L, Xie Y, Qi X N. Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: Industrial product. In Proc. the 47th ACM/IEEE Annual International Symposium on Computer Architecture, Jun. 2020, pp.52–64. DOI: 10.1109/ISCA45697.2020.00016.
[31] Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7. DOI: 10.1145/2024716.2024718.
-
期刊类型引用(2)
1. Nizar El Zarif, Mohammadhossein Askari Hemmat, Theo Dupuis, et al. Polara-Keras2c: Supporting Vectorized AI Models on RISC-V Edge Devices. IEEE Access, 2024, 12: 171836. 必应学术
2. Lorenzo Carpentieri, Mohammad VazirPanah, Biagio Cosenza. A Performance Analysis of Autovectorization on RVV RISC-V Boards. 2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 必应学术
其他类型引用(0)
-
其他相关附件