在忆阻器中基于模式表示法的二值神经网络权重映射法
Area Efficient Pattern Representation of Binary Neural Networks on RRAM
-
摘要: 1、研究背景(context)。
近年来,一些工作利用忆阻器实现并行的乘累加运算,并进而用其加速卷积神经网络中的全连接层和卷积层。由于卷积神经网络需要大量数模转换器,又有一些工作开始尝试用忆阻器加速二值神经网络。二值神经网络中的权重为-1和+1,对数模转换需求较小。然而,主流的两种二值神经网络权重表示方法在表示负权时都引入了许多冗余的0和1。
2、目的(Objective):准确描述该研究的目的,说明提出问题的缘由,表明研究的范围和重要性。
在本工作中,我们希望减少冗余的0和1,节省阵列面积。为此,我们希望使用一种新的基于模式的权重表示方法,并设计相应的硬件架构。
3、方法(Method):简要说明研究课题的基本设计,结论是如何得到的。
首先,我们通过最近邻算法将权重矩阵分成若干小矩阵。然后,我们从各个小矩阵中提取1的模式,每一权重列都可以用这些模式组合而成。接着,我们将这些模式映射到忆阻器阵列中,模式计算阵列负责计算这些模式的值,模式累加阵列负责累加模式以得到最终输出。最后,我们比较我们的模式表示方法和传统表示方法,选出更省面积的方法。
4、结果(Result&Findings):简要列出该研究的主要结果,有什么新发现,说明其价值和局限。叙述要具体、准确,尽量给出量化数据而不只是定性描述,并给出结果的置信值(如果有)。
我们使用MNIST和CIFAR-10中的卷积层和全连接层做了测试。相较于两种主流的权重表示方式,我们的模式表示法在超过70%的测试用例中有效,平均可以节省约20%的面积。
5、结论(Conclusions):简要地说明经验,论证取得的正确观点及理论价值或应用价值,是否还有与此有关的其它问题有待进一步研究,是否可推广应用,其应用价值如何?
和传统方法直接映射权重不同,我们的模式表示法先提取模式,再通过模式构造原始输出。实验结果表明这样的方式对于远大于阵列大小的权重矩阵更加有效。由于外围电路占据了绝大部分面积,我们未来会进一步探索如何节省这一部分面积。Abstract: Resistive random access memory (RRAM) has been demonstrated to implement multiply-and-accumulate (MAC) operations using a highly parallel analog fashion, which dramatically accelerates the convolutional neural networks (CNNs). Since CNNs require considerable converters between analog crossbars and digital peripheral circuits, recent studies map the binary neural networks (BNNs) onto RRAM and binarize the weights to {+1, -1}. However, two mainstream representations for BNN weights introduce patterns of redundant 0s and 1s when dealing with negative weights. In this work, we reduce the area of redundant 0s and 1s by proposing a BNN weight representation framework based on the novel pattern representation and a corresponding architecture. First, we spilt the weight matrix into several small matrices by clustering adjacent columns together. Second, we extract 1s' patterns, i.e., the submatrices only containing 1s, from the small weight matrix, such that each final output can be represented by the sum of several patterns. Third, we map these patterns onto RRAM crossbars, including pattern computation crossbars (PCCs) and pattern accumulation crossbars (PACs). Finally, we compare the pattern representation with two mainstream representations and adopt the more area efficient one. The evaluation results demonstrate that our framework can save over 20% of crossbar area effectively, compared with two mainstream representations. -
1. Introduction
Synthesizing novel views from given images has been a hot research topic in the fields of computer vision and computer graphics. This technology is also fundamental for achieving realistic augmented reality/virtual reality (AR/VR) experiences. Recently, Neural Radiance Fields (NeRF)[1] techniques have gained significant attention due to their impressive rendering quality. NeRF and its subsequent work can achieve photo-realistic rendering of novel views, but they require a large number of images of a single scene as input and involve lengthy optimization processes to obtain accurate radiance fields, which limits their practical applicability.
Recent advancements have addressed these limitations. [2-4] propose methods that extract 2D features as additional inputs to the radiance field, reducing the requirement for dense input views. DS-NeRF (Depth-Supervised Neural Radiance Fields)[5] introduces sparse depth information as additional supervision, improving rendering quality and speeding up the rendering process with fewer training views. DietNeRF[6] introduces semantic consistency loss as an auxiliary task, enabling training with fewer input views for a single scene. MVSNeRF[7] combines multi-view stereo (MVS) geometry with neural radiance fields, enhancing the generalization of the radiance field without the need for per-scene training. However, MVSNeRF cannot handle scene details and occlusions. MVS usually uses convolutional neural networks to extract the information and correlation between multiple views to estimate the depth of the scene. Benefiting from the inductive bias mechanism of convolutional neural networks, MVS can be trained and inferred across scenes and can accurately understand the 3D structure of the scene. MVS's understanding of scene 3D structure is input into NeRF as a priori, which can overcome the disadvantage that NeRF needs to be trained scene by scene, and enable NeRF to complete the task of new view synthesis after one forward propagation in a fully trained pipeline. MVSNeRF[7] has demonstrated the effectiveness of this idea. GeoNeRF[8] improves upon MVSNeRF but relies on supervised training with processed ground truth depth information from the DTU dataset[9] to enhance the performance of the geometric reasoning module. We use GeoNeRF as the baseline for comparison and make several improvements to it.
Specifically, we still adopt the idea of combining the multi-view geometry with the neural radiance fields, so that the neural radiance fields can be trained and inferred across scenes. The difference is that we improve the module of constructing cost volume in traditional MVS technology, and expand the perceptual interaction between multi-level cost volumes by means of multi-level cost volume fusion. More valuable spatial feature information is provided to the neural radiance fields. In addition, we propose a deep self-supervised loss, which uses the depth information of MVS inference to distort the source view, reducing the dependence of the generalization model on the true depth information. Instead of the coarse-fine sampling strategy of original NeRF, we use a mixture of Gaussian-uniform sampling to directly utilize the depth information inferred by MVS to sample points near the object surface as many as possible, simplifying the neural radiance fields inference rendering process and requiring no additional real depth information.
Our main contributions are as follows.
● Multi-Level Cost Volume Fusion Module. This fusion module enhances the interaction between cost volume contexts and achieves high-quality geometry perception.
● Feature Information Decoding Module. Decoding features instead of mapping location and orientation, this module enhances the understanding of the scene and the generalization ability of the neural network.
● Structure of Scene Geometry Reasoning and Feature Decoding. It enables our model learn to understand the scene from the source view, and enables the model to train and reason across scenes.
2. Related Work
2.1 Multi-View Stereo
Multi-view stereo is a classic problem in computer vision, aiming to recover the dense geometric representation of a scene given multiple views with overlapping regions. Traditional methods[10-13] have made extensive exploration in solving the multi-view stereo problem. Recent approaches[12-14] have introduced deep learning techniques to address the MVS problem. MVSNet[14] builds a cost volume on the scanning planes of the source views and applies 3D convolutional neural networks for post-processing to obtain the depth information of the scene. This approach significantly improves the quality of 3D reconstruction compared with traditional methods. However, the major limitation of this approach is the requirement for a large amount of memory space. R-MVSNet[15] is an improvement over MVSNet by changing the process of regularizing the cost volume from simultaneous regularization at multiple depths to sequential regularization at individual depths, leveraging the output of the previous depth to reduce memory consumption and enhance model scalability. Some methods[16-18] introduce a cascaded architecture that progressively refines the constructed cost volume, reducing memory consumption and obtaining depths at different scales without sacrificing accuracy. We also utilize such a cascaded architecture, where the initial depth interval of the cost volume is related to the predicted depths from the previous level, enhancing the interaction between different levels of cost volumes.
2.2 Self-Attention
Self-attention is a specific implementation of the attention mechanism, introduced by [19]. Fundamentally, self-attention remains focused on addressing the issue of varying points of interest in the input when predicting outputs at different positions in sequence problems. It represents one way to implement the attention mechanism. Initially, the attention mechanism was employed to tackle the issue of polysemy in machine translation. In a given sentence, the words at different positions are not entirely independent; they encompass certain contextual information. The incorporation of an attention layer associates the information of elements at different positions, thereby facilitating information interaction.
In the field of computer vision, the self-attention mechanism is frequently applied in image segmentation to enhance image understanding and processing capabilities[20-22]. It pays simultaneous attention to both local and global information when processing images. For a set of images, just like words in a sentence, there is an abundance of contextual information between them. This contextual information provides a priori conditions for the accurate synthesis of new perspectives. However, the weights of the information provided by different perspectives for the same position are not entirely the same. The self-attention mechanism offers a theoretical basis for calculating these weights. We utilize the self-attention mechanism to calculate the weight information that neighboring perspectives contribute to a new perspective, achieving high-quality view synthesis. The effectiveness of this mechanism has been confirmed by work such as [7, 8].
2.3 New View Synthesis
In previous work, various methods have been explored for view synthesis, including light field-based approaches[23-25], image-based rendering techniques[26-28], and deep learning based methods[29-32]. Image-based methods typically learn a blending weight based on ray-space proximity or approximate geometry to perform weighted blending of pixel colors from the source views to generate the colors of the target view. Their synthesis quality relies on the image quality of the source views and is limited by occlusions. Synthesizing radiance fields on meshes[33] or point clouds[34, 35] have the advantage of synthesizing new views using a small set of reference views, but they are often limited by the quality of 3D reconstruction. In the case of non-Lambertian surfaces, the colors of the same point can vary across different views, and this multi-view inconsistency often leads to failure in 3D reconstruction on these surfaces.
Our approach combines traditional MVS techniques with neural rendering techniques by taking spatial features corresponding to sampled points as prior input and decoding colors and densities from scene features corresponding to arbitrary 3D positions. We simulate continuous radiance fields using ray projection techniques and obtain the final pixel colors using volume rendering techniques, enabling realistic view synthesis.
2.4 Neural Scene Representations
Recently, Ben et al. proposed the use of neural networks to encode scenes as a 5D Neural Radiance Fields (NeRF)[1]. NeRF optimizes this neural radiance field to render realistic novel views of a fixed scene. Subsequent work[36-38] has improved upon NeRF but still requires hours or days of optimization per scene. GRF (General Radiance Field)[3] directly takes 2D feature representations of sampled points and ray directions as input, replacing the 3D coordinates in the 5D neural radiance fields. PixelNeRF[2] introduces the use of convolutional layers to process the input images and modify the NeRF structure. It incorporates image features as additional inputs, similar to residual connections, allowing the network to be trained across scenes and synthesize new views from a sparse set of images (one or a few). IBRNet (Image-Based Rendering Network)[39] proposes a generic interpolation function that aggregates density features of sampled points on the same ray using transformer modules. It requires the input of source view colors and directions and its synthesis quality is limited by the quality of the source views. GNT (Gereralizable NeRF Transformer)[40] heavily relies on attention mechanisms to fuse multi-view features and directly predicts the pixel colors of the reference view without volume rendering. GNT[40] uses attention mechanisms to achieve a ray-based learnable scene-adaptive rendering, eliminating the need for per-scene optimization. We believe that the generalization capability of the radiance field mainly stems from the model's inference of the scene. Specifically, the rendering of new views, without per-scene optimization, depends on prior input obtained from the source views, including 3D spatial features and global 2D features. To enhance the model's inference capability, we supervise the geometric reasoning module using the inferred depth and the final rendered depth as pseudo-ground truth values, aiming to construct a more accurate geometric neural field.
3. Method
We train SG-NeRF (Sparse-Input Generalized Neural Radiance Fields) across scenes and divide the rendering of scenes into two phases. The first phase builds the geometric reasoning module, and the second phase performs scene rendering. Specifically, we first process the 2D features in the channel and spatial dimensions, secondly use these processed 2D features to construct the cost volume, and then fuse the cost volume as the 3D prior information of the reference view. We describe the geometric reasoning module in detail in Subsection 3.1. In the second phase, as the rendering phase, we use the NeRF network to build a decoding module that uses the 3D features of the first phase as an additional prior guide to predict the density and color information of the spatial sampling points. At the same time, we use the rough depth information predicted in the first phase for fine sampling, avoiding the additional time consumption caused by NeRF hierarchical sampling. We describe this sampling method in Subsection 3.2.1 and our decoding module in Subsection 3.2.2. The overall flow chart is shown in Fig.1.
The entire inference pipeline can be summarized as follows. For the target reference view rendering, we first select N neighboring views based on camera parameters and input them into a geometric reasoning model as source views. A UNet with convolutional attention modules is employed to extract multi-level 2D features from these source views, which are used for constructing and fusing multi-level cost volumes. Next, at level l, the cost volumes of the source views are regularized to obtain predicted depth maps and 3D features Fli for each source view. These predicted depth maps are used to guide the construction of the next-level cost volumes and ray sampling. Finally, the multi-level 3D features corresponding to the spatial sampling points, along with the full-resolution 2D features, are fed into the decoding module. Through a multi-head attention mechanism, the feature information from different source views is aggregated and separately passed to the color decoding network and density decoding network for decoding.
3.1 Geometric Reasoning
3.1.1 Building Cost Volumes
Given N adjacent source images \{{\boldsymbol I_i}\}_{i = 1}^{{N}} \in {\mathbb{R}}^{3\times {H} \times {W}}, we first extract multi-level feature information from the images to construct a cost volume for the source view[8]. In MVS[14], this multi-scale structure is more helpful for inferring scene depth information, and previous work[3, 38] has demonstrated the effectiveness of multi-scale structures. For the extracted 2D scene features, we use a convolutional block attention module to enhance the scene-related features and suppress irrelevant features, reducing the loss of detail in small-scale feature information. Specifically, we improve the model's representational capacity by adaptively adjusting the weights of different channel and spatial position feature maps.
First, we process the channel dimension of the feature maps using a channel attention module, selectively enhancing the representation ability of each channel. The specific implementation can be divided into the following steps. We begin by performing global maximal pooling and global average pooling on the feature map channels to obtain two channel value pooling weights {W}_\mathrm{C_1} and {W}_\mathrm{C_2} , respectively. Then, we feed these two weights into a shared neural network, obtaining the weight coefficients {W}_\mathrm{C} for the channel dimension, performing dot product operation with the original feature map to obtain the channel-weighted feature map f_{c}^{l} .
\begin{split} &{W}_\mathrm{C} = \mathrm{MLP}({W}_\mathrm{C_1}, {W}_\mathrm{C_2}),\\& f_\mathrm{c}^l = {W}_\mathrm{C} \otimes f_i^l . \end{split} After obtaining the channel-weighted feature f_{\mathrm{c}}^l , we also obtain two spatial pooling weights {W}_\mathrm{S_1} and {W}_\mathrm{S_2} by performing global max pooling and global average pooling on the spatial dimension. We concatenate the two weights obtaining the weight coefficients {W}_\mathrm{S} for the spatial dimension. We use these two weights to process the channel-weighted feature map, obtaining multi-scale features \hat{f}_{i}^l based on convolutional attention.
\begin{split}& {W}_\mathrm{S} = \mathrm{CNN}({W}_\mathrm{S_1}, {W}_\mathrm{S_2}),\\& \hat{f}_{{i}}^l = {W}_\mathrm{S} \otimes f_\mathrm{c}^l . \end{split} This attention-based multi-scale feature approach helps to aggregate more valuable feature information into the cost volumes, thereby enabling the geometric neural field to provide more valuable local spatial features.
We also adopt the cascaded cost volume construction proposed by CastMVSNet[17]. By using the camera parameters [K, R, t] , we can find the {K} nearest views to view {I}_i among the {N} source views and perform homography warping to obtain the multi-level cost volume {V}^{l}_{i} based on group correlation for the source view {I}_i .
{\cal{\boldsymbol H}}_{k}(z) = K_{k} \times\left(R_{k} \times R_{i}^{T}+\frac{\left(t_{i}-t_{k}\right) \times n_{i}^{T}}{z}\right) \times K_{i}^{-1}, {V}_{i}^{l} (u, v, z) = {{G}}(\hat{f}_k^l \times ({\cal{\boldsymbol H}}_k(z) \times [u,v,1]^T)_k^{{K}}, where {\cal{\boldsymbol H}}_{k}(z) represents the homography matrix that warps the k -th image to the reference view \boldsymbol{I}_i . (u, v, z) represents the spatial coordinates of a point in 3D space. G(\cdot) computes the group correlation among the {K} images.
3.1.2 Geometric Neural Field
This group-based cost volume encodes the appearance of the scene from different input viewpoints, capturing the variations in appearance caused by geometry and viewpoint changes. During the construction of the scene cost volume, the connections between different-level cost volumes are established based on the relationships between the feature maps extracted by the UNet at different levels. To strengthen the interaction between cost volumes at different levels, we propose a cost volume fusion module that integrates the current cost volume with the cost volume from the previous level. The small-scale cost volume is composed of the scene features with large receptive field, which often contains more abstract spatial information of the scene, but it is easy to ignore some details of the scene. Based on this reason, we adopt a method similar to UNet to fuse large-scale cost volume with small-scale cost volume, and enhance the perception ability of small-scale cost volume to scene details. Specifically, we first perform trilinear interpolation on the cost volume from the previous level to match the width, height, and depth dimensions of the current level's cost volume. Then, we use a convolutional layer to adjust the channel dimensions to match the current level's cost volume, ensuring consistency in size. Finally, the cost volumes from the previous level and the current level are concatenated and fused. This cost volume fusion module is illustrated in Fig.2.
Figure 2. Before regularization, the cost volume C_l is first concatenated and fused with the upsampled cost volume C'_l from the previous level. Afterward, regularization is performed to obtain the predicted depth map and scene 3D features. Additionally, the cost volume C_l is upsampled to generate C'_{l+1}, which serves as the input for regularization in the next level.In the cost volume regularization phase, the traditional MVS method directly predicts the depth information of the scene and only interprets the scene geometry. Our aim is to perceive scene geometry versus appearance across scenes, so different from traditional MVS is to generate meaningful geometric neural fields F_i^l while inferential scene depth. The geometric neural field F_i^l is input into the subsequent decoding network for decoding as the geometric understanding of the scene. The inferential scene depth is used for the sampling prior of subsequent ray-casting steps of neural radiation fields. Our geometric reasoning module does not use the real depth information to supervise. In order to constrain the depth, we use the predicted depth and camera parameters to distort {K} neighboring views of the source view \boldsymbol{I}_i , and calculate the photometric consistency loss between the source view and the distorted neighboring view.
3.2 Feature Decoding
3.2.1 Gaussian-Uniform Mixture Sampling
After constructing the geometric neural field, we use ray casting techniques to render new views. We simulate N_r rays based on the camera parameters of the reference view {I}_0 and sample discrete points along the rays for rendering the final ray colors. To enhance the correlation between the sampling point positions and the spatial depth, we use inverse warp to distort the predicted depth maps from the source views to the reference view. By performing this inverse warp, smaller distorted depth values are overlaid on top of larger distorted depth values, resulting in a fused depth map for the reference view. This fused depth map serves as the coarse predicted depth \hat{D} , providing a prior guidance for the fine sampling of the point positions.
First, we uniformly sample S_c points along each camera ray to cover the entire depth range.
t_{k} \sim {\cal{U}}\left[t_{n}+\frac{k-1}{K}\left(t_{f}-t_{n}\right), t_{n}+\frac{k}{K}\left(t_{f}-t_{n}\right)\right], where t_f and t_n represent the far and near boundaries of the scene, respectively, and t_k represents the sampling interval for ray casting.
Subsequently, guided by the coarse predicted depth, we sample candidate points following a Gaussian distribution. These candidate points are sampled in a way that takes into account the estimated depth information, allowing us to capture the variations in scene geometry more effectively. Assuming the pixel coordinate is denoted as P = (u,v) and the predicted ray depth is denoted as z_p = \hat{D}(u,v) , sample S_f fine sample points using the following equation:
\begin{split} &t_k \sim {\cal{N}}(z_p, s^2_p),\\& s_p = \dfrac{\mathrm{min}(|z_p-t_f|, |z_p-t_n|)}{3}, \end{split} where z_p and s_p are the mean and standard deviation of the proposed normal distribution, respectively. By using these values, we can optimize geometric features more effectively by sampling more candidate points near the object surface. This differentiable sampling method also contributes to better convergence of the geometric neural field.
3.2.2 Aggregation Feature and Decoding
Using the camera parameters, each point \mathrm{X} on the ray is projected onto each source view and bilinear interpolation is performed to obtain corresponding multi-level 3D features. These multi-level features are then merged to form the final geometric field feature {F}_i . As for the scene's 2D features, only the full-size 2D feature with l = 0 is retained and subjected to bilinear interpolation. This full-size 2D feature encompasses a global understanding of the scene and also provides a mask to determine if the sampling point is projected outside the source view. Finally, these aggregated feature values \hat{F}_i serve as the input to the decoding module.
\hat{\boldsymbol F}_i = [\{F_i^l\}_{l = 0}^2, \{f_i^l\}_{l = 0}]. Now that the vector \hat{\boldsymbol F}_i contains all the necessary data of the scene, and it can be used to learn the scene appearance and predict the density of spatial sampling points. NeRF employs a fully connected neural network (MLP) to map the coordinate vector and direction vector of a spatial point to its corresponding color and density, resulting in overfitting to the specific scene. This overfitting restricts NeRF's ability to train and render across different scenes. In contrast, SG-NeRF constructs the geometric neural field of the scene during the scene feature inference stage, using the bias-inductive power of convolutional neural networks for cross-scene training. After obtaining the scene feature vector \hat{\boldsymbol F}_i , we first compute the mean and variance of the full-sized 2D features as the view-independent token. We combine the 3D and 2D features of the source view as view-dependent tokens. Since the contribution of different source views to the new view is not the same, we use the multi-head attention mechanism proposed in Transformer to aggregate the tokens from different views and obtain the attention-weighted feature from different views. The effectiveness of this mechanism has been demonstrated by [8]. Once the scene feature vector is obtained, two separate feature decoding networks (MLPs) are used to decode the color and density of the spatial sampling points. The color decoding network takes the view-dependent vector as input, while the density decoding network takes the global view-independent vector token as input.
{{\boldsymbol c}}_{{{n}}}, {{\boldsymbol \sigma}}_{{{n}}} = \mathrm{MLP}(\mathrm{MHA}((\mathrm{mean}(f_i^{0}), \mathrm{var}(f_i^{0}))_{i = 1}^\mathrm{N}, \hat{F}_i), \bigtriangleup d). Note that to enhance the mapping ability of the decoding network, we combine the relative direction vectors of the sampling point \bigtriangleup d as residual items, and the encoding method of the direction vector is consistent with that in NeRF[1]. After decoding the density and color of the space sampling point, we use traditional volume rendering techniques[41] to render the color and depth values of the ray.
{\hat{\boldsymbol{c}}} = \sum\limits_{n = 1}^{S} \exp \left(-\sum\limits_{k = 1}^{n-1} {\boldsymbol{\sigma}}_{k}\right)\left(1-\exp \left(-{\boldsymbol{\sigma}}_{{\boldsymbol{n}}}\right)\right) {\boldsymbol{c}}_{{\boldsymbol{n}}}. (1) (1) for rendering the color is slightly modified to obtain the depth value of the ray.
{\hat{\boldsymbol{d}}} = \sum\limits_{n = 1}^{S} \exp \left(-\sum\limits_{k = 1}^{n-1} {\boldsymbol{\sigma}}_{{\boldsymbol{k}}}\right)\left(1-\exp \left(-{\boldsymbol{\sigma}}_{{\boldsymbol{n}}}\right)\right) z_{n}. 3.3 Loss Function
For the color loss, we follow the same approach as the original NeRF[1]. We calculate the mean squared error to measure the difference between the rendered color and the true pixel color.
{\cal{L}}_{c} = \frac{1}{|R|} \sum\limits_{r \in R}\left\|{\hat{\boldsymbol{c}}}(r)-c_{g t}(r)\right\|^{2}. Using only the final rendered color loss for supervision is insufficient to constrain the entire pipeline of geometric reasoning. Therefore, we propose the photometric consistency loss of the self-supervised module as well as the deep inference loss to supervise the geometric inference module. Specifically, during the geometric neural field inference stage, in addition to obtaining the scene's 3D features, we also obtain the inferred depth map of the scene. We use this depth map to perform an inverse warp of the neighboring views of the source view. This process generates the warped source view and a binary mask, {M} , which masks out invalid pixels outside the source view. The photometric consistency loss is then calculated by comparing the differences between the warped view and the real source view:
\begin{split} {\cal{L}}_{\rm PC} =& \displaystyle\sum\limits_{j = 1}^{K} \dfrac{1}{\left\|M_{j}\right\|_{i}} \left(\left\|(\hat{\boldsymbol I}_{i}^{j}-\boldsymbol I_{i}) \odot M_{j}\right\|_{2}+\right. \\&\left.\left\|(\nabla \hat{\boldsymbol I}_{i}^{j}-\nabla \boldsymbol I_{i}) \odot M_{j}\right\|_{2} \right) . \end{split} (2) In (2), the symbol ‘‘ \nabla ’’ represents the pixel-wise gradient, and ‘‘ \odot ’’ represents the pixel-wise multiplication. This loss measures the consistency of pixel intensities between the warped and real views. Additionally, we utilize the fused depth from the sampling stage of the reference view, which determines the efficiency of our fine sampling. To enhance the consistency between the distorted source view and the reference view in terms of depth, we use the rendered depth as the pseudo-real depth value for the reference view and warp this depth value to each source view. We minimize the difference between the predicted depth in the source view and the pseudo-depth value in the optimization process.
{\cal{L}}_{\rm DC} = smooth_{L_1} (\hat{D}(r)-z(r)). Here, \hat{D}(r) represents the depth value obtained from volume rendering, and z(r) represents the depth map from the convolutional operation on the cost volume. The final loss formulation can be represented as follows:
L = {\cal{L}}_{\rm c} + \lambda_{\rm pc} \sum^{N}\limits_{i = 1} {\cal{L}}_{\rm PC} / N + \lambda_{\rm dc} \times {\cal{L}}_{\rm DC}, where \lambda_{\rm pc} and \lambda_{\rm dc} are weighting factors that balance the influence of each loss term in the overall optimization process.
4. Experiments
Datasets. We trained our model on real forward-facing datasets from LLFF[31], IBRNet[39], and the DTU dataset[9]. The camera parameters for the real forward-facing scenes were obtained from COLMAP[42]. In total, there were
5689 images used for training, which came from 102 indoor and outdoor forward-facing scenes (35 scenes from LLFF and 67 scenes from IBRNet), as well as 88 real DTU scenes. Unlike GeoNeRF, we did not utilize real depth data from DTU for training. Instead, we relied solely on RGB images for self-supervision. We conducted testing on a subset of the LLFF dataset, which consists of eight real-world forward-facing scenes, as well as eight synthetic scenes. Additionally, we performed testing on 15 real scenes from the DTU dataset. We also conducted fine-tuning and testing on these datasets to further improve the performance of our model.During training, we randomly selected one image as the reference view and simulated partial rays based on its camera parameters. To conserve memory, we resized the resolution of each image to 640 \times 480 .
Implementation Details. We trained SG-NeRF for 40 epochs, where each epoch involved iterating through all training views. During training, we randomly selected one image from the training dataset as the reference view. From the reference view, we emitted 512 rays and sampled 128 points along each ray, including 96 coarse samples and 32 fine samples. We trained our code on a single 3090ti GPU. The initial training for cross-scene initialization took approximately four days. Once the training was completed, there was no need to train for each scene separately. A single forward pass was sufficient to synthesize the reference view from the source views.
For each epoch, we used the Adam optimizer with an initial learning rate of 5.0\times10^{-4} . We employed the ReduceLROnPlateau learning rate strategy, dynamically adjusting the learning rate based on the average PSNR obtained from each epoch.
4.1 Experimental Results
To evaluate the generalization ability of our model, we compared it with the original NeRF and other well-known open-source generalized NeRF models: PixelNeRF[2], IBRNet[39], MVSNeRF[7], and GeoNeRF[8]. In the generalization capability tests, we primarily used the RGB images as the original input, without the need to include the corresponding depth information of the scene. This approach is based on our observation that the purpose of introducing the multi-view stereo (MVS) technology is to enable the network to infer the 3D information of the scene. Using depth information as an input undermines the effectiveness of this module and limits the practical application of the model. The original experimental results of GeoNeRF, which exhibit performance jumps on both synthetic datasets and the DTU dataset (where depth information was inputted), highlight this issue. Conversely, by integrating the depth information predicted from source views as supplementary input for the reference view, we have demonstrated that our module functions effectively even without depth inputs. As shown in Table 1, we tested these models on three unseen test dataset and quantitatively compared them based on PSNR, SSIM[43], and LPIPS[44] metrics. The results indicate that our model outperforms the others in terms of performance. When tested on the DTU dataset without utilizing real depth information, our model performs the best, demonstrating the effectiveness of the geometric neural field. Fig.3 and Fig.4 showcase the rendering results in unseen scenes, where our model better preserves scene details and exhibits fewer artifacts compared with others.
Settings Method Real Data (DTU/LLFF) Synthetic Data PSNR \uparrow SSIM \uparrow LPIPS \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow No per-scene optimization pixelNeRF[2] 19.31/11.16 0.789/0.486 0.382/0.671 7.39 0.658 0.411 IBRNet[39] 20.01/23.38 0.803/0.789 0.347/0.229 25.11 0.902 0.108 MVSNeRF[6] 20.10/20.30 0.812/0.726 0.338/0.317 23.62 0.897 0.176 GeoNeRF[17] 21.77/25.00 0.847/0.823 0.217/0.183 28.14 0.936 0.090 Ours 24.91/25.21 0.891/0.836 0.195/0.173 27.79 0.926 0.119 Per-scene optimization NeRF[21] 27.01/25.97 0.901/0.870 0.263/0.236 30.63 0.962 0.093 MVSNeRF[7] 21.97/25.45 0.847/0.877 0.226/0.192 27.07 0.931 0.168 GeoNeRF[8] 23.78/25.81 0.897/0.841 0.176/0.173 28.94 0.941 0.077 Ours 25.80/25.85 0.898/0.853 0.188/0.156 28.91 0.943 0.070 Note: We used three metrics for quantitative comparison: PSNR ( \uparrow indicates higher is better), SSIM ( \uparrow indicates higher is better), and LPIPS ( \downarrow indicates lower is better). Bold indicates the best results, and underlining represents the second best results. Figure 4. We tested the generalization synthesis effect of our model on LLFF (Room, Horns) and DTU (Scan21) datasets. When we performed the generalization test on the DTU dataset, we did not utilize the processed depth information in the DTU dataset to verify the generalization ability of our model under the low-information condition. Compared with other generalization models, our model performs better in detail and alleviates the artifacts in weak texture regions. (a) Ground truth. (b) MVSNeRF[6]. (c) GeoNeRF[17]. (d) Ours.In order to test the ability of our generalization model to synthesize novel views when dense views are available, we refined the training on a specific scene from the NeRF synthetic dataset and compared it with the original NeRF. Our results show that our method achieves comparable results to the original NeRF in a short amount of time with refined training. Compared with other generalization models, our model shows the best performance after refined training, second only to the original NeRF with full training. Fig.5 shows the optimized rendering results of NeRF for each scene and the results of SG-NeRF's refined training.
Figure 5. We showcase the synthesis results of our model on new views after short fine-tuning, achieving performance comparable to the original NeRF model. (a) Ground truth. (b) NeRF[1]. (c) Ours.4.2 Ablation Study
Fig.5 shows the synthesis results of our generalization model retained for more details on the scene, and proves that we improve the effectiveness of the geometry of neural field. To prove the validity of the other modules in the model, we conducted an ablation study of our generalized model on the LLFF dataset. Table 2 shows our ablation results, which include: 1) no self-supervised loss was used to constrain the geometric neural field, 2) only the points on the line were uniformly sampled, and 3) the attention mechanism of the decoding module was removed.
Table 2. Ablation Study of Key Components of SG-NeRFSetting LLFF Data PSNR \uparrow SSIM \uparrow LPIPS \downarrow No self-supervised loss 24.33 0.821 0.186 No mixture sampling 24.84 0.826 0.184 No attention mechanism 17.70 0.620 0.375 Full SG-NeRF 25.21 0.836 0.173 Note: The evaluation is performed on the real forward-facing LLFF. It can be seen from Table 2 that the model can show the best effect when the improved module is fully used. 1) When self-supervised loss is not used, the geometric inference module lacks constraints on spatial geometry at the beginning of the process, which causes our convergence process to slow down further and affects our final synthesis quality. It can be observed from Fig.6 that the predicted depth without self-supervised depth loss is prone to blur, detail loss and other problems. 2) When we do not use the Gaussian uniform mixture sampling, we do the equivalent of just performing the coarse sampling phase of NeRF without sampling more points on the object surface, which causes the resultant new view to lose some detail. 3) Removing the attention mechanism has the biggest impact on the model. The lack of attention mechanism causes all source views to provide equivalent features, whereas for different new perspectives, the source view with the closest perspective should provide more weighted features. In Fig.7, we show the results of the new view synthesis after removing the attention mechanism, and the overall quality degradation can be clearly observed, including large deviations in color information and false artifacts.
4.3 Influence of Source View Count
We investigated the impact of the number and quality of source views on our model to analyze its robustness to source views. As shown in Table 3, we evaluated the influence of different numbers of source views on our model. The results demonstrate that even with a small number of source views, our model can still synthesize realistic new viewpoint images.
Table 3. Quantitative Analysis of Different Numbers of Source Views on LLFF DatasetNumber of Source Views PSNR SSIM LPIPS 4 20.92 0.786 0.205 5 23.52 0.818 0.176 6 25.21 0.836 0.173 Table 4 showcases the robustness of our model when there is a significant viewpoint difference between the source views and the reference view. We discarded the K nearest neighboring views to the reference view and used the remaining neighboring views as source views for rendering. As the viewpoint difference between the source views and the reference view increases, the effective information provided by the source views decreases. In this scenario, our model does not exhibit a significant performance drop.
Table 4. Quantitative Analysis of Skipping the Nearest K Neighboring Views on LLFF DatasetK PSNR SSIM LPIPS 0 25.21 0.836 0.173 2 23.48 0.792 0.218 3 23.27 0.782 0.229 4 22.85 0.770 0.240 5. Conclusions
We introduced SG-NeRF, a few-view novel view synthesis method that can render realistic novel views for complex scenes without per-scene optimization. Our approach enhances the performance of traditional multi-view geometry architectures using convolutional attention modules and a cost volume fusion mechanism. It constructs a geometric neural field for scene representation and assists the neural network inferring the scene. Multi-head attention is used to aggregate information from the source views, enabling the synthesis of realistic images from new viewpoints. We believe that more advanced multi-view stereo geometry techniques may extend the application of our method to surround-shooting source views and reduce artifacts in weakly textured regions.
-
[1] Hinton G, Deng L, Yu D et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. IEEE Signal Process Mag., 2012, 29(6):82-97. DOI: 10.1109/MSP.2012.2205597.
[2] Akinaga H, Shima H. Resistive random access memory (ReRAM) based on metal oxides. Proc. IEEE, 2010, 98(12):2237-2251. DOI: 10.1109/JPROC.2010.2070830.
[3] Chi P, Li S, Xu C et al. PRIME:A novel processing-inmemory architecture for neural network computation in ReRAM-based main memory. In Proc. the 43rd International Symposium on Computer Architecture, Jun. 2016, pp.27-39. DOI: 10.1109/ISCA.2016.13.
[4] Chen L, Li J, Chen Y et al. Accelerator-friendly neuralnetwork training:Learning variations and defects in RRAM crossbar. In Proc. the Design, Automation & Test in Europe Conference & Exhibition, Mar. 2017, pp.19-24. DOI: 10.23919/DATE.2017.7926952.
[5] Liu C, Yan B, Yang C et al. A spiking neuromorphic design with resistive crossbar. In Proc. the 52nd Design Automation Conference, Jun. 2015. DOI: 10.1145/2744769.2744783.
[6] Rastegari M, Ordonez V, Redmon J, Farhadi A. XNORNet:ImageNet classification using binary convolutional neural networks. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.525-542. DOI: 10.1007/978-3-319-46493-032.
[7] Alemdar H, Leroy V, Prost-Boucle A, Pétrot F. Ternary neural networks for resource-efficient AI applications. In Proc. the International Joint Conference on Neural Networks, May 2017, pp.2547-2554. DOI: 10.1109/IJCNN.2017.7966166.
[8] Tang T, Xia L, Li B, Wang Y, Yang H. Binary convolutional neural network on RRAM. In Proc. the 22nd Asia and South Pacific Design Automation Conference, Jan. 2017, pp.782-787. DOI: 10.1109/ASPDAC.2017.7858419.
[9] Ni L, Liu Z, Song W et al. An energy-efficient and highthroughput bitwise CNN on sneak-path-free digital ReRAM crossbar. In Proc. the 2017 IEEE/ACM International Symposium on Low Power Electronics and Design, Jul. 2017. DOI: 10.1109/ISLPED.2017.8009177.
[10] Sun X, Yin S, Peng X, Liu R, Seo J, Yu S. XNOR-RRAM:A scalable and parallel resistive synaptic architecture for binary neural networks. In Proc. the Design, Automation & Test in Europe Conference & Exhibition, Mar. 2018, pp.1423-1428. DOI: 10.23919/DATE.2018.8342235.
[11] Sun X, Peng X, Chen P Y, Liu R, Seo J, Yu S. Fully parallel RRAM synaptic array for implementing binary neural network with (+1, -1) weights and (+1, 0) neurons. In Proc. the 23rd Asia and South Pacific Design Automation Conference, Jan. 2018, pp.574-579. DOI: 10.1109/ASPDAC.2018.8297384.
[12] Wang P, Ji Y, Hong C, Lyu Y, Wang D, Xie Y. SNrram:An efficient sparse neural network computation architecture based on resistive random-access memory. In Proc. the 55th ACM/ESDA/IEEE Design Automation Conference, Jun. 2018. DOI: 10.1109/DAC.2018.8465793.
[13] Chi C C, Jiang J H R. Logic synthesis of binarized neural networks for efficient circuit implementation. IEEE Trans. Comput. Des. Integr. Circuits Syst.. DOI: 10.1109/TCAD.2021.3078606.
[14] Garey M R, Johnson D S, Stockmeyer L. Some simplified NP-complete problems. In Proc. the 6th ACM Symposium on Theory of Computing, Apr. 30-May 2, 1974, pp.47-63. DOI: 10.1145/800119.803884.
[15] Kazemi A, Alessandri C, Seabaugh A C, Sharon H X, Niemier M, Joshi S. A device non-ideality resilient approach for mapping neural networks to crossbar arrays. In Proc. the 57th ACM/IEEE Design Automation Conference, Jul. 2020. DOI: 10.1109/DAC18072.2020.9218544.
[16] Song L, Qian X, Li H, Chen Y. PipeLayer:A pipelined ReRAM-based accelerator for deep learning. In Proc. the International Symposium on High Performance Computer Architecture, Feb. 2017, pp.541-552. DOI: 10.1109/HPCA.2017.55.
[17] Shafiee A, Nag A, Muralimanohar N et al. ISAAC:A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News, 2016, 44(3):14-26. DOI: 10.1145/3007787.3001139.
[18] Zhu Z, Sun H, Lin Y et al. A configurable multi-precision CNN computing framework based on single bit RRAM. In Proc. the 56th ACM/IEEE Design Automation Conference, Jun. 2019, Article No. 56. DOI: 10.1145/3316781.3317739.
[19] Peng X, Liu R, Yu S. Optimizing weight mapping and dataflow for convolutional neural networks on processing-in-memory architectures. IEEE Trans. Circuits Syst. I Regul. Pap., 2020, 67(4):1333-1343. DOI: 10.1109/TCSI.2019.2958568.
[20] Cheng M, Xia L, Zhu Z et al. TIME:A training-in-memory architecture for RRAM-based deep neural networks. IEEE Trans. Comput. Des. Integr. Circuits Syst., 2019, 38(5):834-847. DOI: 10.1109/TCAD.2018.2824304.
[21] Zhu Z, Lin J, Cheng M et al. Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method. In Proc. the International Conference on Computer-Aided Design, Nov. 2018, Article No. 69. DOI: 10.1145/3240765.3240825.
-
期刊类型引用(1)
1. Shao-Feng Zhao, Fang Wang, Bo Liu, et al. LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-Optimization. Journal of Computer Science and Technology, 2023, 38(2): 328. 必应学术
其他类型引用(0)
-
其他相关附件
-
本文英文pdf
2021-5-15-0906-Highlights 点击下载(189KB) -
本文附件外链
https://rdcu.be/cRqUd
-
计量
- 文章访问数: 137
- HTML全文浏览量: 12
- PDF下载量: 0
- 被引次数: 1