
- Chinese Optics Letters
- Vol. 23, Issue 5, 051102 (2025)
Abstract
1. Introduction
Point clouds are a common method for representing three-dimensional (3D) data. With the rapid advancement of technologies such as LiDAR and other 3D scanning techniques, capturing point clouds from real-world scenes has become feasible[1,2]. As a result, 3D point clouds have found extensive applications in various domains including military technology, autonomous driving[3,4], and cartography[5]. However, due to limitations imposed by capture equipment and acquisition conditions, raw point clouds produced by 3D scanning are often locally sparse and non-uniform, posing difficulties for tasks such as point cloud classification[6–8], segmentation[9,10], and detection[4,11–13]. Therefore, the task of point cloud upsampling, which converts sparse, incomplete, and noisy point clouds into denser, complete, and cleaner forms, has attracted much attention.
In addition to traditional 3D scanners like LiDAR, tactile sensing represents another approach to perceiving 3D shapes. Most tactile sensors can measure force distribution or geometric shapes on small contact areas, offering an alternative means of obtaining 3D models. By utilizing the sensor’s position and orientation during touch, the sensor can assist in reconstructing the shape of an object. However, tactile perception is constrained by the size and scale of the sensor. Since each touch only provides information from a localized area, multiple touches and a considerable amount of time may be required to fully reconstruct the complete shape of the object. With the development of tactile sensors, prevalent devices like Gelsight[14] and DIGIT[15] can capture localized geometric shapes of contact surfaces through touch. High-resolution local geometric data, valued for its accuracy, is often applied in tasks involving 3D reconstruction[16–18]. 3D reconstruction often leverages RGB images and tactile data sequentially, by first using RGB images to learn 3D object shape priors on simulated data, and subsequently using touch to refine the vision reconstructions[19,20]. However, the fusion of tactile information with other modalities remains largely unexplored.
With the successful application of deep learning in image super-resolution[21–23], many researchers have designed more advanced deep learning networks to achieve better image super-resolution[24–26]. 3D point cloud super-resolution shares similarities with 2D image super-resolution, and as such, various algorithms and concepts from 2D image super-resolution can be adapted and applied to the field of 3D point cloud super-resolution. However, different from 2D image data, the data structure of 3D point clouds is more complex, and the data are sparse and irregular. So, super-resolution for 3D point clouds is still a very challenging problem. With the pioneering introduction of PointNet[6], which employed deep learning networks for processing point cloud data, researchers have shifted their focus toward constructing deep learning networks to accomplish point cloud upsampling tasks. The general approach employed by prevailing learning-based methods involves the initial design of an upsampling module aimed at augmenting the number of points within the feature space. Subsequently, losses are formulated to impose constraints on the output points, ensuring both distribution uniformity and proximity to the surface[27–30]. In other words, these methods share a key to point cloud upsampling: learning the representative features of given points to estimate the distribution of new points. However, in high magnification point cloud super-resolution tasks, the dense points generated by existing works still tend to exhibit non-uniform noise or retain excessive noise. This issue may arise from the insufficient point density in low-resolution point clouds, resulting in an ineffective extraction of point cloud features.
Sign up for Chinese Optics Letters TOC. Get the latest issue of Chinese Optics Letters delivered right to you!Sign up now
In recent times, there has been a swift progression in the utilization of industrial robots. In many cases, achieving specific tasks requires the coordinated operation of visual sensors and robotic arms equipped with tactile sensors. Our integration of tactile sensing primarily aims to supplement visual data in challenging conditions. For example, in high-intensity light situations such as robot arm welding, visual sensors often struggle. Similarly, in low-light conditions, like underwater archaeological explorations in murky seas, relying solely on visual data can be inadequate. Moreover, in scenarios with adverse weather conditions, such as heavy fog or smog, visual sensors frequently face significant challenges. In these cases, the use of tactile information from robot arms or other equipped machinery becomes invaluable for effective point cloud upsampling. Additionally, in fast-paced industrial assembly lines where visual sensors may not capture 3D data quickly enough, tactile data from existing sensors can significantly enhance the upsampling process. By integrating tactile information with visual data and leveraging tactile assistance for point cloud upsampling, it becomes possible to accomplish this task even with a limited number of point clouds. Consequently, this approach can provide high-quality point clouds for high-level vision tasks such as point cloud classification and object detection, setting the stage for further advancements.
In this paper, we leverage both visual information and local tactile data to enhance point cloud upsampling. To do so, we introduce a feature fusion module that integrates tactile features with visual features. By progressively refining visual features using tactile information obtained during each touch, this module exploits the complementarity of both modalities, leading to a substantial performance boost. Inspired by the PU-Transformer[29], we input well-fused features along with the 3D coordinates into the cascaded transformer encoders to generate a more comprehensive feature map. Finally, we use the shuffle operation[31]to form a dense feature map and reconstruct the 3D coordinates with a multilayer perceptron (MLP). Our main contributions can be summarized as follows: (1) we introduce both visual and tactile information into the point cloud upsampling task and achieve improved qualitative and quantitative results compared to using a single modality; (2) we introduce a feature fusion module that effectively leverages information from both tactile and visual modalities; (3) we build a dataset containing tactile information to benchmark point cloud upsampling algorithms in this setting.
2. Related Work
2.1. Deep-learning-based point cloud upsampling
Compared to optimization-based approaches[32–34], deep learning methods exhibit a promising advancement due to their data-driven nature and the learning capacity of neural networks. Through deep neural networks, it has become possible to directly learn features from point clouds, such as PointNet[6], PointNet++[7], and PointCNN[35]. The earliest use of deep learning networks to process low-resolution point clouds was the PU-Net[36], which drew inspiration from PointNet. It initially divides the low-resolution point cloud into small blocks of different scales and then employs multiple MLPs for feature extraction at different scales. These multi-scale features are aggregated and fed into an upsampling module to expand the features. Finally, a coordinate reconstruction module remaps the features back into the 3D coordinate space to obtain high-resolution point clouds. As the pioneer in the field of deep-learning-based point cloud super-resolution tasks, there have been many outstanding works that have drawn inspiration from its foundations. Deep-learning-based point cloud super-resolution algorithms can generally be divided into two modules: the feature extraction module and the upsampling module. In the feature extraction module, the 3D coordinates are mapped to a feature space, resulting in high-dimensional features that are then input to the upsampling module. Through feature expansion, a denser set of features is generated. Finally, these features are remapped back into the 3D coordinate space to obtain high-resolution point clouds. Wang et al. proposed MPU[37], a patch-based upsampling pipeline that can flexibly upsample point cloud patches with rich local details. Furthermore, MPU employs skip connections between each upsampling unit to facilitate information sharing, enabling the utilization of fewer parameters while achieving enhanced generalization performance. However, MPU is computationally expensive due to its progressive nature. Li et al.[30] introduced DisPU, which disentangles the upsampling task into two sub-goals. First, it generates a coarse but dense point set and then refines these points over the underlying surface to improve distribution uniformity. More recently, Qiu et al.[29] proposed PU-Transformer, the first introduction of a transformer-based model for point cloud upsampling. The PU-Transformer gradually encodes a more comprehensive feature map through cascaded encoders using the preliminary feature map. Subsequently, a coordinate reconstruction module is employed to map the refined features back into the 3D coordinate space. Existing point cloud super-resolution networks exclusively rely on low-resolution point cloud information, thus leading to suboptimal performance when dealing with high magnification point cloud upsampling tasks where the quality of the low-resolution point clouds is poor. We propose a new approach to incorporate denser and more precise tactile point cloud information as a supplementary input, resulting in high-resolution point clouds that exhibit improved uniformity and superior local representation.
2.2. Tactile-assisted point cloud processing
Traditionally, researchers have predominantly relied on visual information for shape reconstruction. However, with the advancement of tactile sensors, Li et al.[14] introduced a novel tactile sensor known as Gelsight. Using the Gelsight sensor, it becomes possible to utilize photometric stereo algorithms to reconstruct the 3D geometric shape of object surfaces. Wang et al.[38] introduced a convolutional neural network (CNN)-based approach that combines visual and tactile information for 3D reconstruction. They reconstructed the 3D model of the object from RGB images and then proceeded to touch the areas of the 3D model with the highest uncertainty. Tactile data provides accurate shape information of the object’s surface, which is then employed as a constraint to refine the shape of the 3D model. Smith et al.[28] proposed a visual-tactile fusion 3D reconstruction network based on graph convolutional network (GCN). They employed a U-Net network to transform tactile information into a mesh representation. Then, a predefined 3D spherical object was used to construct an adjacency matrix for the graph structure. RGB visual information and the tactile mesh were fed into the GCN, continuously updating the shape of the 3D object and ultimately obtaining an accurate point cloud model. Rustler et al.[39] introduced ActVH, a method that constructs a point cloud model from depth images acquired by a depth camera. They then actively interact with regions of higher uncertainty and iteratively refine the 3D shape through tactile feedback, further enhancing the precision of the reconstruction. Currently, tactile information is primarily employed in 3D reconstruction tasks, demonstrating its capacity to provide accurate local information. We introduce tactile information into the point cloud upsampling task and reconstruct a point cloud upsampling dataset named TSR-PD with tactile information based on the 3D reconstruction dataset proposed by Smith et al.[19]
3. Method
3.1. Overview
Essentially, 3D scanning is a sampling problem in 3D physical space, while upsampling is a predictive task aimed at inferring additional samples on the original surface, particularly when dealing with sparse samples obtained during scanning. Given a point set
The overall framework of our method is shown in Fig. 1. We formulate the point cloud upsampling model as follows:
Figure 1.An illustration of the tactile-assisted framework. Given sparse point cloud
3.2. Feature extraction block
As shown in Fig. 2, given
Figure 2.The architecture of the feature extraction block (FE block).
3.3. Feature fusion block
Many approaches exploiting vision and touch for 3D tasks, especially in 3D reconstruction, tend to independently process visual and tactile information[20,39]. Specifically, researchers used touch and partial depth maps separately to predict independent voxel models, which were then combined to produce a final prediction. Indeed, this strategy effectively leveraged tactile information and has proven beneficial. However, a principal limitation of the experimental approach is that its utilization of tactile information is restricted to a small area that has been touched. Different from previous works that generate preliminary results of the final target and gradually refine them using tactile feedback, we prefer generating fused features to integrate both tactile and visual information. This approach enables us to comprehensively learn both the local and global features of tactile information, allowing touch information not only to enhance the upsampling at the touch site but also to extrapolate its local neighborhood. Therefore, we introduce a feature fusion block as shown in Fig. 3. We feed tactile features
1: |
2: |
3: |
4: |
5: |
6: |
7: |
8: |
Table 1. Feature Fusion Pipeline
Figure 3.The architecture of the feature fusion block (FF block). In this module, we iteratively fuse tactile features Ft into visual features Fp, ultimately obtaining the fused features Ff. Specifically, during the first fusion of tactile features, the input features are the initial point cloud features Fp.
In Algorithm 1, we present the basic operations that are employed to build our feature fusion block. We first feed the low-resolution point cloud feature
As discussed in Sec. 3.2, in order to provide better assistance to visual information using tactile cues, the dimension of the visual features is 4 times that of the tactile features. Specifically, when adding the first tactile iteration, the feature
The design of the fusion module combines low-resolution point cloud features with tactile features for more accurate feature correction. In each fusion process, the module utilizes low-resolution point cloud features as the basis and gradually introduces tactile features as auxiliary information. This gradual fusion not only helps improve the accuracy of the overall features, but also effectively corrects possible errors. At the same time, the number of input and output channels of the module is set to a fixed value, and this design makes the module highly flexible and scalable. As the number of tactile touches increases, we can easily connect multiple fusion modules by cascading to further enhance the fusion effect of the features.
3.4. Transformer encoder
The PU-Transformer[29] encodes the initial feature mapping into a comprehensive feature map via a cascaded transformer encoder containing a positional fusion block and a shifted-channel multi-head self-attention (SC-MSA) block. It fully utilizes the transformer’s power in feature representation. So, we chose to employ the same cascaded transformer encoder as the PU-Transformer to encode a more comprehensive feature map. After extracting the fused features, we used the fused feature map along with the inherent 3D coordinates as the input to the transformer encoder.
The transformer encoder utilizes a positional fusion block to encode and combine both the provided 3D coordinates
3.5. Coordinate reconstruction
We reconstruct points from the latent space to the coordinate space, resulting in the desired denser point cloud of size
4. Experiments
In order to validate the effectiveness of tactile information in point cloud super-resolution tasks, we reconstructed a tactile-enhanced point cloud super-resolution dataset, named the tactile super-res point cloud dataset (TSR-PD), based on the 3D reconstruction dataset introduced by Smith et al. (2020). We also conduct ablation studies to show the benefits of the proposed modules. To be specific, there are two key components: the feature extraction block and the feature fusion block.
4.1. Datasets
Smith et al.[19] built a new dataset that aims to capture the interactions between a robotic hand and an object it is touching for 3D reconstruction. They loaded example objects into the 3D robotics simulator Pybullet[41], placed the hand randomly on its surface, and then closed its fingers to attempt to produce contact between the sensors and some point on the object using inverse kinematics. Touch signals were ultimately obtained through the simulation of the DIGIT principle[15]. In this paper, we chose 8192 points from the object model in ABC Dataset[42] as the ground truth, which were then downsampled to 512 points to produce the low-resolution point cloud. Figure 4 depicts an object from our dataset. Tactile point clouds are generated by tactile sensors, and the number of points in a tactile point cloud depends on the area of contact with the tactile sensor. A tactile point cloud containing at least 1000 points will be regarded as a successful touch. Initially, these tactile point clouds were aligned with the object point cloud using Euler angles and then downsampled to 512 points to serve as the tactile input for our network. The dataset comprises a total of 12,732 samples, covers a large semantic range of 3D objects, and includes simple, as well as complex, shapes to evaluate the generalization capability of the model.
Figure 4.An object from the TSR-PD, where (a) represents the high-resolution point cloud (GT), (b) corresponds to the low-resolution point cloud (blue) and tactile information (red) for 5 touches, and (c) depicts the point cloud from one tactile interaction.
4.2. Loss function and implementation details
4.2.1. Loss function
We use the Chamfer distance (CD) loss to minimize the distance of the predicted point cloud and the referenced ground truth in our experiments:
4.2.2. Implementation details
We train our network with a single NVIDIA RTXA6000 running on the Linux OS in all the experiments. In terms of the hyperparameters for training, we have a batch size of 32 for 300 training epochs, optimized using Adam with an initial learning rate of 0.001 with a 0.7 decay rate. Similar to previous work, we perform point cloud normalization and augmentation[29] (rotation, scaling, and random perturbations). We report results using a
4.3. Quantitative and qualitative results
Table 1 shows the quantitative comparisons on the dataset under upsampling rates
1: |
2: |
3: |
4: |
5: |
6: |
7: |
8: |
Table 1. Feature Fusion Pipeline
However, we found that it is not the case that upsampling performs better as the number of tactile points increases; as the number of tactile iterations increases, the rate of decrease in CD and HD slows down. However, when adding a fifth tactile iteration, CD and HD increase compared to four tactile iterations. When adding a second tactile iteration, the EMD reaches its minimum value. This observation emphasizes the existence of an upper limit to the beneficial effects of tactile feedback on vision. We consider that the reason for this result may be that the tactile data is not perfect, and the tactile point cloud we obtained by simulating the DIGIT principle may contain errors. Within a certain range, the gain brought by adding tactile information is greater than the mistakes brought, so the performance of upsampling is improved as the number of touches increases. Beyond this range, adding tactile information multiple times increases the mistakes, which affects the overall point cloud features and makes the upsampling performance degrade. In addition, multiple tactile inputs cause the network to focus excessively on local tactile information and reduce the sensitivity to visual information, which should dominate the features of the visual point cloud in the point cloud upsampling process.
The qualitative results of different point cloud upsampling models are presented in Fig. 5. Three object models were selected for visualization, and local regions were magnified for closer examination. From left to right, the three columns represent the joint, arch, and lamp post, respectively. The first row shows the GT with 8192 points. The second row shows the input low-resolution point cloud to the network (512 points), with the red regions indicating four densely arranged tactile point clouds. Each tactile point cloud also consists of 512 points. The third row shows the upsampling points (8192 points) without incorporating tactile information. The fourth row shows the upsampling points (8192 points) produced by the proposed module and incorporates four instances of tactile information.
Figure 5.Comparing point set upsampling (16×) results from sparse inputs with and without tactile information using 512 input points. Among them are (a) joint, (b) arch, and (c) lamp post. The first row is the input low-resolution point cloud, the second row is the reconstructed point cloud without tactile information, the third row is the reconstructed point cloud with tactile information, and the fourth row is GT.
Comparing the dense points produced in the cases of incorporating tactile information and not incorporating tactile information, we can see that the method without adding tactile information tends to introduce excessive noise [e.g., Fig. 5(a)], cluster points together with a non-uniform distribution [e.g., Fig. 5(b)], or destroy some tiny structures [e.g., Fig. 5(c)] in the results. In contrast, our method incorporating tactile information produces the most similar visual results to the target points, and our dense points can well preserve tiny local structures with a uniform point distribution. Consequently, it can be inferred that the addition of tactile information not only impacts the local effects during point cloud upsampling but also integrates tactile cues as part of the global information, influencing the overall results. This can be particularly observed in the magnified views depicted in Fig. 5.
To further validate the effectiveness of tactile fusion and the superiority of the proposed algorithm, we employed several point cloud super-resolution algorithms for training and testing on the TSR-PD dataset, including PU-GAN[27], PU-GCN[28], Grad-PU[43], and PU-Transformer[29]. To provide a more intuitive representation of the differences in high-resolution point clouds obtained by different algorithms, we conducted 16× point cloud super-resolution experiments using models trained in this section on the test set and performed qualitative analysis.
As shown in Fig. 6, we selected a buckle for visualization. Figure 6(a) represents the original high-resolution point cloud, containing 8192 points. Figure 6(b) depicts the low-resolution point cloud, containing 512 points, with shaded regions representing dense tactile point clouds incorporated using our algorithm, where each tactile point cloud comprises 512 points. Figure 6(c) shows the high-resolution point cloud reconstructed using the PU-GCN. Figure 6(d) shows the high-resolution point cloud reconstructed using the Grad-PU. Figure 6(e) shows the high-resolution point cloud reconstructed by the PU-Transformer[29]. Figure 6(f) shows the high-resolution point cloud reconstructed by TAPSR and incorporates four instances of tactile information.
Figure 6.Visualization results of different algorithms for upsampling on the same objects (a). We show the 16× upsampled results of (b) input point clouds (512 points) when processed by different upsampling methods: (c) PU-GCN[
We can observe that several comparative algorithms demonstrate a certain degree of super-resolution effect compared to the input low-resolution point cloud, but the super-resolution effect is unsatisfactory. For example, as shown in Fig. 6(c), the PU-GCN algorithm can reconstruct the buckle’s outline, but it generates numerous outlier points on the left side of the buckle. Conversely, Fig. 6(d) generated fewer outlier points, but there are some missing parts in the outline of the buckle. In comparison, our algorithm can better reconstruct the buckle’s outline with fewer outlier points. This indicates that the additional local information provided by the tactile point cloud effectively assists in the point cloud super-resolution task.
As shown in Table 2, we compared the quantitative performance of different algorithms in point cloud super-resolution on the TSR-PD test set. Compared to other point cloud super-resolution algorithms that do not incorporate tactile information, our model achieves optimal super-resolution performance, exhibiting the lowest CD, HD, and EMD. This indicates the effectiveness of utilizing tactile information to assist point cloud super-resolution through iterative fusion.
CD | HD | EMD | |
---|---|---|---|
PU-GAN[ | 4.634 | 9.219 | 13.672 |
PU-GCN[ | 3.009 | 8.751 | 10.576 |
Grad-PU[ | 2.464 | 6.308 | 9.582 |
PU-Transformer[ | 1.162 | 3.724 | 5.421 |
Ours (Number of touch = 4) |
Table 2. Quantitative Comparisons to Other Methods on the TSR-PD
Table 3 shows the quantitative comparisons on the dataset under different upsampling rates. Different upsampling rates use the same model, while the network we designed is mainly used for point cloud upsampling at
Rate | PU-Transformer[ | Ours |
---|---|---|
8× | 1.096 | |
16× | 1.162 | |
32× | 1.184 |
Table 3. Quantitative Comparisons Under Different Upsampling Rates Between the State-of-the-Art Work and Our Present Work
4.4. Ablation study
To evaluate the effectiveness of the major components in our framework, we conduct ablation studies on the feature extraction block and feature fusion block. All the models are trained and evaluated on TSR-PD. The results in Table 4 show the effectiveness of our fusion module when incorporating tactile information. Specifically, we remove the feature extraction module and the feature fusion module under different tactile iterations. In the first row, we removed the feature extraction and feature fusion modules and directly concatenated the tactile point clouds. This led to a decrease in the quality of upsampling. Due to the disruptive effect of dense local information on the overall structure of the point cloud, the performance without incorporating tactile information was better than those with tactile information. In the second row, we removed the feature fusion module and directly concatenated the sparse features with the tactile features. Clearly, our complete pipeline consistently achieves the best performance with the lowest CD value across different numbers of tactile iterations. Furthermore, removing any component leads to a reduction in overall performance, meaning that each component in our framework contributes.
FE block | FF block | Number of touches | |||||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | ||
× | × | 1.162 | 1.242 | 1.271 | 1.220 | 1.253 | 1.256 |
✓ | × | — | 1.018 | 0.858 | 1.123 | 1.094 | 1.117 |
✓ | ✓ | — |
Table 4. Comparing the Upsampling Performance of Our Full Pipeline with Various Cases in the Ablation Study (
4.5. Training and testing efficiency
Table 5 shows the time required for the model to train each epoch and the inference time required to reconstruct from a low-resolution point cloud to a high-resolution point cloud with different numbers of touches. It can be seen that with the increase in the number of touches, the training time of the model has a certain increase, and the increase is also within a reasonable range.
Number of touches | Training speed (per epoch) | Inference time (per sample) |
---|---|---|
0 | 76.68 s | 15.8 ms |
1 | 78.35 s | 23.9 ms |
2 | 79.39 s | 24.0 ms |
3 | 80.35 s | 24.1 ms |
4 | 82.18 s | 24.1 ms |
5 | 83.45 s | 24.2 ms |
Table 5. Training Speed and Inference Time for the Model With Different Numbers of Touches (
5. Conclusion
In this paper, we propose a method of point cloud upsampling assisted with tactile information. Specifically, we design a feature fusion module that effectively leverages information from both tactile and visual modalities. With the assistance of tactile information, our approach can significantly improve the quality of point cloud upsampling both quantitatively and qualitatively compared with using visual information only, since the tactile point cloud data contains both local and global features of objects.
The network in this paper is built on the PU-Transformer[29] network architecture, but this idea of utilizing multimodal information to assist in the task of point cloud upsampling can be applied to other networks for point cloud upsampling as well. We note that a number of feature extraction methods have been proposed; in the future, we plan to explore new feature extraction methods[44]. Additionally, we expect to further explore the application of visual and tactile fusion, expanding its adaptability in high-level 3D visual tasks.
References
[1] H. Liu, J. Luo, P. Wu et al. People perception from rgb-d cameras for mobile robots. 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2020(2015).
[6] C. R. Qi, H. Su, K. Mo et al. PointNet: deep learning on point sets for 3D classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652(2017).
[7] C. R. Qi, L. Yi, H. Su et al. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems(2017).
[9] H. Su, V. Jampani, D. Sun et al. Splatnet: Sparse lattice networks for point cloud processing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2530(2018).
[10] Q. Hu, B. Yang, L. Xie et al. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11108(2020).
[11] C. R. Qi, X. Chen, O. Litany et al. ImVoteNet: boosting 3D object detection in point clouds with image votes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4404(2020).
[12] Y. Zhou, O. Tuzel. VoxelNet: end-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4490(2018).
[13] N. Carion, F. Massa, G. Synnaeve et al. End-to-end object detection with transformers. European Conference on Computer Vision, 213(2020).
[16] M. Björkman, Y. Bekiroglu, V. Högman et al. Enhancing visual perception of shape through tactile glances. 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 3180(2013).
[19] E. Smith, R. Calandra, A. Romero et al. 3D shape reconstruction from vision and touch. 34th Conference on Neural Information Processing Systems(2020).
[20] E. Smith, D. Meger, L. Pineda et al. Active 3D shape reconstruction from vision and touch. 35th Conference on Neural Information Processing Systems(2021).
[23] C. Dong, C. C. Loy, K. He et al. Learning a deep convolutional network for image super-resolution. Computer Vision–ECCV 2014: 13th European Conference, 184(2014).
[24] B. Lim, S. Son, H. Kim et al. Enhanced deep residual networks for single image super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 136(2017).
[25] Y. Zhang, K. Li, K. Li et al. Image super-resolution using very deep residual channel attention networks. Proceedings of the European Conference on Computer Vision (ECCV), 286(2018).
[26] Y. Zhang, H. Wang, C. Qin et al. Aligned structured sparsity learning for efficient image super-resolution. Advances in Neural Information Processing Systems(2021).
[27] R. Li, X. Li, C.-W. Fu et al. PU-GAN: a point cloud upsampling adversarial network. Proceedings of the IEEE/CVF International Conference on Computer Vision, 7203(2019).
[28] G. Qian, A. Abualshour, G. Li et al. PU-GCN: point cloud upsampling using graph convolutional networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11683(2021).
[29] S. Qiu, S. Anwar, N. Barnes. PU-Transformer: point cloud upsampling transformer. Proceedings of the Asian Conference on Computer Vision, 2475(2022).
[30] R. Li, X. Li, P.-A. Heng et al. Point cloud upsampling via disentangled refinement. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 344(2021).
[31] W. Shi, J. Caballero, F. Huszar et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016).
[35] Y. Li, R. Bu, M. Sun et al. PointCNN: convolution on x-transformed points. Advances in Neural Information Processing Systems(2018).
[36] L. Yu, X. Li, C.-W. Fu et al. PU-Net: point cloud upsampling network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2790(2018).
[37] W. Yifan, S. Wu, H. Huang et al. Patch-based progressive 3D point set upsampling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5958(2019).
[38] S. Wang, J. Wu, X. Sun et al. 3D shape perception from monocular vision, touch, and shape priors. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1606(2018).
[40] O. Ronneberger, P. Fischer, T. Brox. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, 234(2015).
[42] S. Koch, A. Matveev, Z. Jiang et al. ABC: a big CAD model dataset for geometric deep learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9601(2019).
[43] Y. He, D. Tang, Y. Zhang et al. Grad-PU: arbitrary-scale point cloud upsampling via gradient descent with learned distance functions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5354(2023).
[44] X. Ma, Y. Zhou, H. Wang et al. Image as set of points. The Eleventh International Conference on Learning Representations(2023).

Set citation alerts for the article
Please enter your email address