Tactile-assisted point cloud super-resolution

Haoran Shen; Puzheng Wang; Ming Lu; Chi Zhang; Jian Li; Qin Wang

doi:10.3788/COL202523.051102

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Abstract

With the rapid advancement of three-dimensional (3D) scanners and 3D point cloud acquisition technology, the application of 3D point clouds has been increasingly expanding in various fields. However, due to the limitations of 3D sensors, the collected point clouds are often sparse and non-uniform. In this work, we introduce local tactile information into the point cloud super-resolution task to aid in enhancing the resolution of the point cloud using fine-grained local details. Specifically, the local tactile point cloud is denser and more accurate compared to the low-resolution point cloud. By leveraging tactile information, we can obtain better local features. Therefore, we propose a feature extraction module that can efficiently fuse visual information with dense local tactile information. This module leverages the features from both modalities to achieve improved super-resolution results. In addition, we introduce a point cloud super-resolution dataset that includes tactile information. Qualitative and quantitative experiments show that our work performs much better than existing similar works that do not include tactile information, both in terms of handling low-resolution inputs and revealing high-fidelity details.

Keywords

attention mechanism multi-modal fusion point cloud super-resolution

1. Introduction

Point clouds are a common method for representing three-dimensional (3D) data. With the rapid advancement of technologies such as LiDAR and other 3D scanning techniques, capturing point clouds from real-world scenes has become feasible^[1,2]. As a result, 3D point clouds have found extensive applications in various domains including military technology, autonomous driving^[3,4], and cartography^[5]. However, due to limitations imposed by capture equipment and acquisition conditions, raw point clouds produced by 3D scanning are often locally sparse and non-uniform, posing difficulties for tasks such as point cloud classification^[6–8], segmentation^[9,10], and detection^[4,11–13]. Therefore, the task of point cloud upsampling, which converts sparse, incomplete, and noisy point clouds into denser, complete, and cleaner forms, has attracted much attention.

In addition to traditional 3D scanners like LiDAR, tactile sensing represents another approach to perceiving 3D shapes. Most tactile sensors can measure force distribution or geometric shapes on small contact areas, offering an alternative means of obtaining 3D models. By utilizing the sensor’s position and orientation during touch, the sensor can assist in reconstructing the shape of an object. However, tactile perception is constrained by the size and scale of the sensor. Since each touch only provides information from a localized area, multiple touches and a considerable amount of time may be required to fully reconstruct the complete shape of the object. With the development of tactile sensors, prevalent devices like Gelsight^[14] and DIGIT^[15] can capture localized geometric shapes of contact surfaces through touch. High-resolution local geometric data, valued for its accuracy, is often applied in tasks involving 3D reconstruction^[16–18]. 3D reconstruction often leverages RGB images and tactile data sequentially, by first using RGB images to learn 3D object shape priors on simulated data, and subsequently using touch to refine the vision reconstructions^[19,20]. However, the fusion of tactile information with other modalities remains largely unexplored.

With the successful application of deep learning in image super-resolution^[21–23], many researchers have designed more advanced deep learning networks to achieve better image super-resolution^[24–26]. 3D point cloud super-resolution shares similarities with 2D image super-resolution, and as such, various algorithms and concepts from 2D image super-resolution can be adapted and applied to the field of 3D point cloud super-resolution. However, different from 2D image data, the data structure of 3D point clouds is more complex, and the data are sparse and irregular. So, super-resolution for 3D point clouds is still a very challenging problem. With the pioneering introduction of PointNet^[6], which employed deep learning networks for processing point cloud data, researchers have shifted their focus toward constructing deep learning networks to accomplish point cloud upsampling tasks. The general approach employed by prevailing learning-based methods involves the initial design of an upsampling module aimed at augmenting the number of points within the feature space. Subsequently, losses are formulated to impose constraints on the output points, ensuring both distribution uniformity and proximity to the surface^[27–30]. In other words, these methods share a key to point cloud upsampling: learning the representative features of given points to estimate the distribution of new points. However, in high magnification point cloud super-resolution tasks, the dense points generated by existing works still tend to exhibit non-uniform noise or retain excessive noise. This issue may arise from the insufficient point density in low-resolution point clouds, resulting in an ineffective extraction of point cloud features.

Sign up for Chinese Optics Letters TOC. Get the latest issue of Chinese Optics Letters delivered right to you！Sign up now

In recent times, there has been a swift progression in the utilization of industrial robots. In many cases, achieving specific tasks requires the coordinated operation of visual sensors and robotic arms equipped with tactile sensors. Our integration of tactile sensing primarily aims to supplement visual data in challenging conditions. For example, in high-intensity light situations such as robot arm welding, visual sensors often struggle. Similarly, in low-light conditions, like underwater archaeological explorations in murky seas, relying solely on visual data can be inadequate. Moreover, in scenarios with adverse weather conditions, such as heavy fog or smog, visual sensors frequently face significant challenges. In these cases, the use of tactile information from robot arms or other equipped machinery becomes invaluable for effective point cloud upsampling. Additionally, in fast-paced industrial assembly lines where visual sensors may not capture 3D data quickly enough, tactile data from existing sensors can significantly enhance the upsampling process. By integrating tactile information with visual data and leveraging tactile assistance for point cloud upsampling, it becomes possible to accomplish this task even with a limited number of point clouds. Consequently, this approach can provide high-quality point clouds for high-level vision tasks such as point cloud classification and object detection, setting the stage for further advancements.

In this paper, we leverage both visual information and local tactile data to enhance point cloud upsampling. To do so, we introduce a feature fusion module that integrates tactile features with visual features. By progressively refining visual features using tactile information obtained during each touch, this module exploits the complementarity of both modalities, leading to a substantial performance boost. Inspired by the PU-Transformer^[29], we input well-fused features along with the 3D coordinates into the cascaded transformer encoders to generate a more comprehensive feature map. Finally, we use the shuffle operation^[31]to form a dense feature map and reconstruct the 3D coordinates with a multilayer perceptron (MLP). Our main contributions can be summarized as follows: (1) we introduce both visual and tactile information into the point cloud upsampling task and achieve improved qualitative and quantitative results compared to using a single modality; (2) we introduce a feature fusion module that effectively leverages information from both tactile and visual modalities; (3) we build a dataset containing tactile information to benchmark point cloud upsampling algorithms in this setting.

2. Related Work

2.1. Deep-learning-based point cloud upsampling

Compared to optimization-based approaches^[32–34], deep learning methods exhibit a promising advancement due to their data-driven nature and the learning capacity of neural networks. Through deep neural networks, it has become possible to directly learn features from point clouds, such as PointNet^[6], PointNet++^[7], and PointCNN^[35]. The earliest use of deep learning networks to process low-resolution point clouds was the PU-Net^[36], which drew inspiration from PointNet. It initially divides the low-resolution point cloud into small blocks of different scales and then employs multiple MLPs for feature extraction at different scales. These multi-scale features are aggregated and fed into an upsampling module to expand the features. Finally, a coordinate reconstruction module remaps the features back into the 3D coordinate space to obtain high-resolution point clouds. As the pioneer in the field of deep-learning-based point cloud super-resolution tasks, there have been many outstanding works that have drawn inspiration from its foundations. Deep-learning-based point cloud super-resolution algorithms can generally be divided into two modules: the feature extraction module and the upsampling module. In the feature extraction module, the 3D coordinates are mapped to a feature space, resulting in high-dimensional features that are then input to the upsampling module. Through feature expansion, a denser set of features is generated. Finally, these features are remapped back into the 3D coordinate space to obtain high-resolution point clouds. Wang et al. proposed MPU^[37], a patch-based upsampling pipeline that can flexibly upsample point cloud patches with rich local details. Furthermore, MPU employs skip connections between each upsampling unit to facilitate information sharing, enabling the utilization of fewer parameters while achieving enhanced generalization performance. However, MPU is computationally expensive due to its progressive nature. Li et al.^[30] introduced DisPU, which disentangles the upsampling task into two sub-goals. First, it generates a coarse but dense point set and then refines these points over the underlying surface to improve distribution uniformity. More recently, Qiu et al.^[29] proposed PU-Transformer, the first introduction of a transformer-based model for point cloud upsampling. The PU-Transformer gradually encodes a more comprehensive feature map through cascaded encoders using the preliminary feature map. Subsequently, a coordinate reconstruction module is employed to map the refined features back into the 3D coordinate space. Existing point cloud super-resolution networks exclusively rely on low-resolution point cloud information, thus leading to suboptimal performance when dealing with high magnification point cloud upsampling tasks where the quality of the low-resolution point clouds is poor. We propose a new approach to incorporate denser and more precise tactile point cloud information as a supplementary input, resulting in high-resolution point clouds that exhibit improved uniformity and superior local representation.

2.2. Tactile-assisted point cloud processing

Traditionally, researchers have predominantly relied on visual information for shape reconstruction. However, with the advancement of tactile sensors, Li et al.^[14] introduced a novel tactile sensor known as Gelsight. Using the Gelsight sensor, it becomes possible to utilize photometric stereo algorithms to reconstruct the 3D geometric shape of object surfaces. Wang et al.^[38] introduced a convolutional neural network (CNN)-based approach that combines visual and tactile information for 3D reconstruction. They reconstructed the 3D model of the object from RGB images and then proceeded to touch the areas of the 3D model with the highest uncertainty. Tactile data provides accurate shape information of the object’s surface, which is then employed as a constraint to refine the shape of the 3D model. Smith et al.^[28] proposed a visual-tactile fusion 3D reconstruction network based on graph convolutional network (GCN). They employed a U-Net network to transform tactile information into a mesh representation. Then, a predefined 3D spherical object was used to construct an adjacency matrix for the graph structure. RGB visual information and the tactile mesh were fed into the GCN, continuously updating the shape of the 3D object and ultimately obtaining an accurate point cloud model. Rustler et al.^[39] introduced ActVH, a method that constructs a point cloud model from depth images acquired by a depth camera. They then actively interact with regions of higher uncertainty and iteratively refine the 3D shape through tactile feedback, further enhancing the precision of the reconstruction. Currently, tactile information is primarily employed in 3D reconstruction tasks, demonstrating its capacity to provide accurate local information. We introduce tactile information into the point cloud upsampling task and reconstruct a point cloud upsampling dataset named TSR-PD with tactile information based on the 3D reconstruction dataset proposed by Smith et al.^[19]

3. Method

3.1. Overview

Essentially, 3D scanning is a sampling problem in 3D physical space, while upsampling is a predictive task aimed at inferring additional samples on the original surface, particularly when dealing with sparse samples obtained during scanning. Given a point set $P \in R^{N \times 3}$ , the point cloud upsampling process with an upsampling ratio of $r$ produces a dense point cloud $Q \in R^{r N \times 3}$ based on the geometric information of the sparse point cloud $P$ . The resulting high-resolution point cloud $Q$ should have uniformly distributed points and have an edge surface similar to ground truth. Considering the inherent complexity and ill-posed nature of this upsampling process, which is compounded by the incomplete representation of the original geometry, achieving a high upsampling rate of $r$ introduces even greater challenges. Different from the existing methods that primarily focus on the low-resolution point cloud itself, our approach integrates locally dense tactile information to generate high-resolution point clouds that accurately represent the underlying object surface. It holds importance for industrial robots that rely on both 3D scanners and tactile sensors.

The overall framework of our method is shown in Fig. 1. We formulate the point cloud upsampling model as follows: $Q = f (P, T),$ (1)where $f (\cdot)$ represents our tactile-assisted point cloud super-resolution network (TAPSR-Net). Given a sparse point cloud $P \in R^{N \times 3}$ and dense tactile point cloud $T \in R^{N \times 3}$ , our proposed TAPSR-Net can generate a dense point cloud $Q \in R^{r N \times 3}$ . First, the feature extraction block extracts preliminary feature maps $F_{p}$ and $F_{t}$ from the input. Then we feed them into the feature fusion block to produce the fused feature map $F_{f}$ . Next, we use the transformer encoder^[29] to process $P$ and the fused feature $F_{f}$ , thereby refining the feature map. Finally, we proceed to reconstruct points from the latent space back to the coordinate space, resulting in a denser point cloud of size $r N \times 3$ .

Figure 1.An illustration of the tactile-assisted framework. Given sparse point cloud $P$ with N points and touch point cloud $T$ with N points, the feature extraction block extracts feature maps F_p and F_t from the input, then feeds them into the feature fusion block, where F_t and F_p are merged to produce the fused feature map F_f. Next, the transformer encoder consumes both $P$ and the fused feature F_f to refine the feature map, then a high-resolution point cloud Q is obtained through coordinate reconstruction.

3.2. Feature extraction block

As shown in Fig. 2, given $P \in R^{N \times 3}$ and $T \in R^{N \times 3}$ , our feature extraction block produces preliminary feature maps $F_{p} \in R^{N \times m C}$ and $F_{t} \in R^{N \times C}$ , where $C$ is the number of feature channels and $m$ is the ratio of tactile point cloud features to low-resolution point cloud features. Inspired by the application of U-Net^[40] in the fields of point clouds and images, we employ cross-layer connections to retain visual and tactile information at different scales, utilizing this approach for initial feature extraction. Specifically, the feature extraction block is downsampled two times by the max-pooling layer and upsampled two times by the deconvolution, and the skip connection is used at the same stage, which ensures that the final recovered feature map incorporates more low-level features and also enables the fusion of features of different scale point cloud features. The initial processing of the tactile point cloud and visual point cloud follows a similar approach to produce feature maps, and the difference between $F_{p}$ and $F_{t}$ lies in the number of feature channels. In this work, we empirically determined that the optimal feature dimension for visual point clouds is 4 times that of tactile features. This allows the tactile point cloud to better assist during feature fusion, without excessively diverting attention to tactile information.

Figure 2.The architecture of the feature extraction block (FE block).

3.3. Feature fusion block

Many approaches exploiting vision and touch for 3D tasks, especially in 3D reconstruction, tend to independently process visual and tactile information^[20,39]. Specifically, researchers used touch and partial depth maps separately to predict independent voxel models, which were then combined to produce a final prediction. Indeed, this strategy effectively leveraged tactile information and has proven beneficial. However, a principal limitation of the experimental approach is that its utilization of tactile information is restricted to a small area that has been touched. Different from previous works that generate preliminary results of the final target and gradually refine them using tactile feedback, we prefer generating fused features to integrate both tactile and visual information. This approach enables us to comprehensively learn both the local and global features of tactile information, allowing touch information not only to enhance the upsampling at the touch site but also to extrapolate its local neighborhood. Therefore, we introduce a feature fusion block as shown in Fig. 3. We feed tactile features $F_{t}$ and visual features $F_{p}$ into the network sequentially and iteratively optimize visual features through tactile features, finally obtaining fused features $F_{f}$ .

Require: a low-resolution point cloud feature

f_{p}

touch point cloud features

f_{t i}

i = {1,2,…, α}

Ensure: a fused feature

f_{f}

1: for

i \in {1,2,…, α}

2: if

i = = 1

then

f_{i} = f_{p}

4: end if

f_{i + 1} = FeaFus (f_{p}, f_{t i}, f_{i})

6: end for

f_{f} = f_{i + 1}

8: return

f_{f}

Table 1. Feature Fusion Pipeline

View all Tables

Figure 3.The architecture of the feature fusion block (FF block). In this module, we iteratively fuse tactile features F_t into visual features F_p, ultimately obtaining the fused features F_f. Specifically, during the first fusion of tactile features, the input features are the initial point cloud features F_p.

In Algorithm 1, we present the basic operations that are employed to build our feature fusion block. We first feed the low-resolution point cloud feature $f_{p}$ and $α$ tactile features $f_{t i}$ into the proposed module, where $α$ represents the number of tactile iterations. Each iteration can be formulated as follows: $F_{m} = MLP [Concat (f_{ti}, f_{i})],$ (2) $f_{f} = Deconv [Concat (F_{m}, f_{p})] .$ (3)

As discussed in Sec. 3.2, in order to provide better assistance to visual information using tactile cues, the dimension of the visual features is 4 times that of the tactile features. Specifically, when adding the first tactile iteration, the feature $f_{p}$ is directly used as the input feature $f_{1}$ . Next, we concatenate $f_{p}$ with $f_{t i}$ to obtain the refined point feature. A common approach is to apply MLPs followed by max-pooling. Then we concatenate this feature with the low-resolution point cloud feature and then perform a deconvolution to obtain the fused feature. As we merge the subsequent tactile features, this fused feature will be fed into the network as input $f_{i}$ . By repeating this step, we obtain the final fused feature map.

The design of the fusion module combines low-resolution point cloud features with tactile features for more accurate feature correction. In each fusion process, the module utilizes low-resolution point cloud features as the basis and gradually introduces tactile features as auxiliary information. This gradual fusion not only helps improve the accuracy of the overall features, but also effectively corrects possible errors. At the same time, the number of input and output channels of the module is set to a fixed value, and this design makes the module highly flexible and scalable. As the number of tactile touches increases, we can easily connect multiple fusion modules by cascading to further enhance the fusion effect of the features.

3.4. Transformer encoder

The PU-Transformer^[29] encodes the initial feature mapping into a comprehensive feature map via a cascaded transformer encoder containing a positional fusion block and a shifted-channel multi-head self-attention (SC-MSA) block. It fully utilizes the transformer’s power in feature representation. So, we chose to employ the same cascaded transformer encoder as the PU-Transformer to encode a more comprehensive feature map. After extracting the fused features, we used the fused feature map along with the inherent 3D coordinates as the input to the transformer encoder.

The transformer encoder utilizes a positional fusion block to encode and combine both the provided 3D coordinates $P \in R^{N \times 3}$ and the fused feature map $F_{f} \in R^{N \times C}$ for a point cloud. This process follows the local geometric relationships between scattered points. The positional fusion block not only encodes positional information for a set of unordered points, facilitating the transformer’s processing, but also aggregates comprehensive local details for precise point cloud upsampling. The main purpose of the SC-MSA block in the transformer encoder is to enhance inter-point relations in a multi-head fashion and strengthen channel-wise connections by introducing overlapping channels between consecutive heads.

3.5. Coordinate reconstruction

We reconstruct points from the latent space to the coordinate space, resulting in the desired denser point cloud of size $r N \times 3$ , where $r$ represents the upsampling scale. The output of the last transformer encoder serves as the input and undergoes a periodic shuffling operation^[31] to reorganize the channels and create a dense feature map. Finally, another MLP is applied to estimate the 3D coordinates of the upsampled point cloud $P \in R^{r N \times 3}$ .

4. Experiments

In order to validate the effectiveness of tactile information in point cloud super-resolution tasks, we reconstructed a tactile-enhanced point cloud super-resolution dataset, named the tactile super-res point cloud dataset (TSR-PD), based on the 3D reconstruction dataset introduced by Smith et al. (2020). We also conduct ablation studies to show the benefits of the proposed modules. To be specific, there are two key components: the feature extraction block and the feature fusion block.

4.1. Datasets

Smith et al.^[19] built a new dataset that aims to capture the interactions between a robotic hand and an object it is touching for 3D reconstruction. They loaded example objects into the 3D robotics simulator Pybullet^[41], placed the hand randomly on its surface, and then closed its fingers to attempt to produce contact between the sensors and some point on the object using inverse kinematics. Touch signals were ultimately obtained through the simulation of the DIGIT principle^[15]. In this paper, we chose 8192 points from the object model in ABC Dataset^[42] as the ground truth, which were then downsampled to 512 points to produce the low-resolution point cloud. Figure 4 depicts an object from our dataset. Tactile point clouds are generated by tactile sensors, and the number of points in a tactile point cloud depends on the area of contact with the tactile sensor. A tactile point cloud containing at least 1000 points will be regarded as a successful touch. Initially, these tactile point clouds were aligned with the object point cloud using Euler angles and then downsampled to 512 points to serve as the tactile input for our network. The dataset comprises a total of 12,732 samples, covers a large semantic range of 3D objects, and includes simple, as well as complex, shapes to evaluate the generalization capability of the model.

Figure 4.An object from the TSR-PD, where (a) represents the high-resolution point cloud (GT), (b) corresponds to the low-resolution point cloud (blue) and tactile information (red) for 5 touches, and (c) depicts the point cloud from one tactile interaction.

4.2. Loss function and implementation details

4.2.1. Loss function

We use the Chamfer distance (CD) loss to minimize the distance of the predicted point cloud and the referenced ground truth in our experiments: $C (P, Q) = \frac{1}{| P |} \sum_{p \in P} \min_{q \in Q} {‖ p - q ‖}_{2}^{2} + \frac{1}{| Q |} \sum_{q \in Q} \min_{p \in P} {‖ p - q ‖}_{2}^{2},$ (4)where P is the predicted point cloud, Q is the ground truth, $p$ is a 3D point from P, and $q$ is a 3D point from Q. The first term represents the sum of the minimum distances from any point $p$ in P to Q. The second term represents the sum of the minimum distances from any point $q$ in Q to P. A larger CD indicates greater differences between the two sets of point clouds, while a smaller inverse CD indicates that the predicted point cloud is closer to the ground truth point cloud, resulting in better prediction accuracy.

4.2.2. Implementation details

We train our network with a single NVIDIA RTXA6000 running on the Linux OS in all the experiments. In terms of the hyperparameters for training, we have a batch size of 32 for 300 training epochs, optimized using Adam with an initial learning rate of 0.001 with a 0.7 decay rate. Similar to previous work, we perform point cloud normalization and augmentation^[29] (rotation, scaling, and random perturbations). We report results using a $\times 16$ upsampling rate, i.e., $r = 16$ .

4.3. Quantitative and qualitative results

Table 1 shows the quantitative comparisons on the dataset under upsampling rates $r = 16$ . The input low-resolution point cloud consists of 512 points, each tactile point cloud contains 512 points, and the output high-resolution point cloud comprises 8192 points. The evaluation metrics used include CD, Hausdorff distance (HD), and Earth-moving distance (EMD). We can see that, for this task, incorporating tactile information yields better results compared to not using tactile information.

Require: a low-resolution point cloud feature

f_{p}

touch point cloud features

f_{t i}

i = {1,2,…, α}

Ensure: a fused feature

f_{f}

1: for

i \in {1,2,…, α}

2: if

i = = 1

then

f_{i} = f_{p}

4: end if

f_{i + 1} = FeaFus (f_{p}, f_{t i}, f_{i})

6: end for

f_{f} = f_{i + 1}

8: return

f_{f}

Table 1. Feature Fusion Pipeline

View all Tables

However, we found that it is not the case that upsampling performs better as the number of tactile points increases; as the number of tactile iterations increases, the rate of decrease in CD and HD slows down. However, when adding a fifth tactile iteration, CD and HD increase compared to four tactile iterations. When adding a second tactile iteration, the EMD reaches its minimum value. This observation emphasizes the existence of an upper limit to the beneficial effects of tactile feedback on vision. We consider that the reason for this result may be that the tactile data is not perfect, and the tactile point cloud we obtained by simulating the DIGIT principle may contain errors. Within a certain range, the gain brought by adding tactile information is greater than the mistakes brought, so the performance of upsampling is improved as the number of touches increases. Beyond this range, adding tactile information multiple times increases the mistakes, which affects the overall point cloud features and makes the upsampling performance degrade. In addition, multiple tactile inputs cause the network to focus excessively on local tactile information and reduce the sensitivity to visual information, which should dominate the features of the visual point cloud in the point cloud upsampling process.

The qualitative results of different point cloud upsampling models are presented in Fig. 5. Three object models were selected for visualization, and local regions were magnified for closer examination. From left to right, the three columns represent the joint, arch, and lamp post, respectively. The first row shows the GT with 8192 points. The second row shows the input low-resolution point cloud to the network (512 points), with the red regions indicating four densely arranged tactile point clouds. Each tactile point cloud also consists of 512 points. The third row shows the upsampling points (8192 points) without incorporating tactile information. The fourth row shows the upsampling points (8192 points) produced by the proposed module and incorporates four instances of tactile information.

Figure 5.Comparing point set upsampling (16×) results from sparse inputs with and without tactile information using 512 input points. Among them are (a) joint, (b) arch, and (c) lamp post. The first row is the input low-resolution point cloud, the second row is the reconstructed point cloud without tactile information, the third row is the reconstructed point cloud with tactile information, and the fourth row is GT.

Comparing the dense points produced in the cases of incorporating tactile information and not incorporating tactile information, we can see that the method without adding tactile information tends to introduce excessive noise [e.g., Fig. 5(a)], cluster points together with a non-uniform distribution [e.g., Fig. 5(b)], or destroy some tiny structures [e.g., Fig. 5(c)] in the results. In contrast, our method incorporating tactile information produces the most similar visual results to the target points, and our dense points can well preserve tiny local structures with a uniform point distribution. Consequently, it can be inferred that the addition of tactile information not only impacts the local effects during point cloud upsampling but also integrates tactile cues as part of the global information, influencing the overall results. This can be particularly observed in the magnified views depicted in Fig. 5.

To further validate the effectiveness of tactile fusion and the superiority of the proposed algorithm, we employed several point cloud super-resolution algorithms for training and testing on the TSR-PD dataset, including PU-GAN^[27], PU-GCN^[28], Grad-PU^[43], and PU-Transformer^[29]. To provide a more intuitive representation of the differences in high-resolution point clouds obtained by different algorithms, we conducted 16× point cloud super-resolution experiments using models trained in this section on the test set and performed qualitative analysis.

As shown in Fig. 6, we selected a buckle for visualization. Figure 6(a) represents the original high-resolution point cloud, containing 8192 points. Figure 6(b) depicts the low-resolution point cloud, containing 512 points, with shaded regions representing dense tactile point clouds incorporated using our algorithm, where each tactile point cloud comprises 512 points. Figure 6(c) shows the high-resolution point cloud reconstructed using the PU-GCN. Figure 6(d) shows the high-resolution point cloud reconstructed using the Grad-PU. Figure 6(e) shows the high-resolution point cloud reconstructed by the PU-Transformer^[29]. Figure 6(f) shows the high-resolution point cloud reconstructed by TAPSR and incorporates four instances of tactile information.

Figure 6.Visualization results of different algorithms for upsampling on the same objects (a). We show the 16× upsampled results of (b) input point clouds (512 points) when processed by different upsampling methods: (c) PU-GCN^[28], (d) Grad-PU^[43], (e) PU-Transformer^[29], and (f) TAPSR.

We can observe that several comparative algorithms demonstrate a certain degree of super-resolution effect compared to the input low-resolution point cloud, but the super-resolution effect is unsatisfactory. For example, as shown in Fig. 6(c), the PU-GCN algorithm can reconstruct the buckle’s outline, but it generates numerous outlier points on the left side of the buckle. Conversely, Fig. 6(d) generated fewer outlier points, but there are some missing parts in the outline of the buckle. In comparison, our algorithm can better reconstruct the buckle’s outline with fewer outlier points. This indicates that the additional local information provided by the tactile point cloud effectively assists in the point cloud super-resolution task.

As shown in Table 2, we compared the quantitative performance of different algorithms in point cloud super-resolution on the TSR-PD test set. Compared to other point cloud super-resolution algorithms that do not incorporate tactile information, our model achieves optimal super-resolution performance, exhibiting the lowest CD, HD, and EMD. This indicates the effectiveness of utilizing tactile information to assist point cloud super-resolution through iterative fusion.

	CD	HD	EMD
PU-GAN^[27]	4.634	9.219	13.672
PU-GCN^[28]	3.009	8.751	10.576
Grad-PU^[43]	2.464	6.308	9.582
PU-Transformer^[29]	1.162	3.724	5.421
Ours (Number of touch = 4)	0.671	3.291	5.099

Table 2. Quantitative Comparisons to Other Methods on the TSR-PD^a

View all Tables

Table 3 shows the quantitative comparisons on the dataset under different upsampling rates. Different upsampling rates use the same model, while the network we designed is mainly used for point cloud upsampling at $16 \times$ upsampling rates. Thus, the best CD values were obtained at $16 \times$ . More importantly, we can see that at different upsampling rates, for this task, incorporating tactile information yields better results compared to not using tactile information.

Rate	PU-Transformer^[29]	Ours
8×	1.096	0.832
16×	1.162	0.671
32×	1.184	0.895

Table 3. Quantitative Comparisons Under Different Upsampling Rates Between the State-of-the-Art Work and Our Present Work^a

View all Tables

4.4. Ablation study

To evaluate the effectiveness of the major components in our framework, we conduct ablation studies on the feature extraction block and feature fusion block. All the models are trained and evaluated on TSR-PD. The results in Table 4 show the effectiveness of our fusion module when incorporating tactile information. Specifically, we remove the feature extraction module and the feature fusion module under different tactile iterations. In the first row, we removed the feature extraction and feature fusion modules and directly concatenated the tactile point clouds. This led to a decrease in the quality of upsampling. Due to the disruptive effect of dense local information on the overall structure of the point cloud, the performance without incorporating tactile information was better than those with tactile information. In the second row, we removed the feature fusion module and directly concatenated the sparse features with the tactile features. Clearly, our complete pipeline consistently achieves the best performance with the lowest CD value across different numbers of tactile iterations. Furthermore, removing any component leads to a reduction in overall performance, meaning that each component in our framework contributes.

FE block	FF block	Number of touches
FE block	FF block	0	1	2	3	4	5
×	×	1.162	1.242	1.271	1.220	1.253	1.256
✓	×	—	1.018	0.858	1.123	1.094	1.117
✓	✓	—	0.953	0.791	0.716	0.671	0.778

Table 4. Comparing the Upsampling Performance of Our Full Pipeline with Various Cases in the Ablation Study (r = 16)^a

View all Tables

4.5. Training and testing efficiency

Table 5 shows the time required for the model to train each epoch and the inference time required to reconstruct from a low-resolution point cloud to a high-resolution point cloud with different numbers of touches. It can be seen that with the increase in the number of touches, the training time of the model has a certain increase, and the increase is also within a reasonable range.

Number of touches	Training speed (per epoch)	Inference time (per sample)
0	76.68 s	15.8 ms
1	78.35 s	23.9 ms
2	79.39 s	24.0 ms
3	80.35 s	24.1 ms
4	82.18 s	24.1 ms
5	83.45 s	24.2 ms

Table 5. Training Speed and Inference Time for the Model With Different Numbers of Touches (r = 16)

View all Tables

5. Conclusion

In this paper, we propose a method of point cloud upsampling assisted with tactile information. Specifically, we design a feature fusion module that effectively leverages information from both tactile and visual modalities. With the assistance of tactile information, our approach can significantly improve the quality of point cloud upsampling both quantitatively and qualitatively compared with using visual information only, since the tactile point cloud data contains both local and global features of objects.

The network in this paper is built on the PU-Transformer^[29] network architecture, but this idea of utilizing multimodal information to assist in the task of point cloud upsampling can be applied to other networks for point cloud upsampling as well. We note that a number of feature extraction methods have been proposed; in the future, we plan to explore new feature extraction methods^[44]. Additionally, we expect to further explore the application of visual and tactile fusion, expanding its adaptability in high-level 3D visual tasks.

References

[1] H. Liu, J. Luo, P. Wu et al. People perception from rgb-d cameras for mobile robots. 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2020(2015).

[2] F. Endres, J. Hess, J. Sturm et al. 3-D mapping with an RGB-D camera. IEEE Trans. Rob., 30, 177(2013).

[3] L. Caltagirone, M. Bellone, L. Svensson et al. Lidar–camera fusion for road detection using fully convolutional neural networks. Rob. Auton. Syst., 111, 125(2019).

[4] R. Qian, X. Lai, X. Li. 3D object detection for autonomous driving: A survey. Pattern Recognit., 130, 108796(2022).

[5] F. Lafarge, C. Mallet. Creating large-scale city models from 3D-point clouds: a robust approach with hybrid representation. Int. J. Comput. Vis., 99, 69(2012).

[6] C. R. Qi, H. Su, K. Mo et al. PointNet: deep learning on point sets for 3D classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652(2017).

[7] C. R. Qi, L. Yi, H. Su et al. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems(2017).

[8] Y. Wang, Y. Sun, Z. Liu et al. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph., 38, 1(2019).

[9] H. Su, V. Jampani, D. Sun et al. Splatnet: Sparse lattice networks for point cloud processing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2530(2018).

[10] Q. Hu, B. Yang, L. Xie et al. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11108(2020).

[11] C. R. Qi, X. Chen, O. Litany et al. ImVoteNet: boosting 3D object detection in point clouds with image votes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4404(2020).

[12] Y. Zhou, O. Tuzel. VoxelNet: end-to-end learning for point cloud based 3D object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4490(2018).

[13] N. Carion, F. Massa, G. Synnaeve et al. End-to-end object detection with transformers. European Conference on Computer Vision, 213(2020).

[14] W. Yuan, S. Dong, E. H. Adelson. GelSight: high-resolution robot tactile sensors for estimating geometry and force. Sensors, 17, 2762(2017).

[15] M. Lambeta, P.-W. Chou, S. Tian et al. DIGIT: a novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Rob. Autom. Lett., 5, 3838(2020).

[16] M. Björkman, Y. Bekiroglu, V. Högman et al. Enhancing visual perception of shape through tactile glances. 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 3180(2013).

[17] G. Z. Gandler, C. H. Ek, M. Björkman et al. Object shape estimation and modeling, based on sparse Gaussian process implicit surfaces, combining visual data and tactile exploration. Rob. Auton. Syst., 126, 103433(2020).

[18] J. Ilonen, J. Bohg, V. Kyrki. Three-dimensional object reconstruction of symmetric objects by fusing visual and tactile sensing. Int. J. Rob. Res., 33, 321(2014).

[19] E. Smith, R. Calandra, A. Romero et al. 3D shape reconstruction from vision and touch. 34th Conference on Neural Information Processing Systems(2020).

[20] E. Smith, D. Meger, L. Pineda et al. Active 3D shape reconstruction from vision and touch. 35th Conference on Neural Information Processing Systems(2021).

[21] W. Yang, X. Zhang, Y. Tian et al. Deep learning for single image super-resolution: a brief review. IEEE Trans. Multimed., 21, 3106(2019).

[22] Z. Wang, J. Chen, S. C. H. Hoi. Deep learning for image super-resolution: a survey. IEEE Trans. Pattern Anal. Mach. Intell., 43, 3365(2020).

[23] C. Dong, C. C. Loy, K. He et al. Learning a deep convolutional network for image super-resolution. Computer Vision–ECCV 2014: 13th European Conference, 184(2014).

[24] B. Lim, S. Son, H. Kim et al. Enhanced deep residual networks for single image super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 136(2017).

[25] Y. Zhang, K. Li, K. Li et al. Image super-resolution using very deep residual channel attention networks. Proceedings of the European Conference on Computer Vision (ECCV), 286(2018).

[26] Y. Zhang, H. Wang, C. Qin et al. Aligned structured sparsity learning for efficient image super-resolution. Advances in Neural Information Processing Systems(2021).

[27] R. Li, X. Li, C.-W. Fu et al. PU-GAN: a point cloud upsampling adversarial network. Proceedings of the IEEE/CVF International Conference on Computer Vision, 7203(2019).

[28] G. Qian, A. Abualshour, G. Li et al. PU-GCN: point cloud upsampling using graph convolutional networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11683(2021).

[29] S. Qiu, S. Anwar, N. Barnes. PU-Transformer: point cloud upsampling transformer. Proceedings of the Asian Conference on Computer Vision, 2475(2022).

[30] R. Li, X. Li, P.-A. Heng et al. Point cloud upsampling via disentangled refinement. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 344(2021).

[31] W. Shi, J. Caballero, F. Huszar et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016).

[32] M. Alexa, J. Behr, D. Cohen-Or et al. Computing and rendering point set surfaces. IEEE Trans. Vis. Comput. Graph., 9, 3(2003).

[33] H. Huang, D. Li, H. Zhang et al. Consolidation of unorganized point clouds for surface reconstruction. ACM Trans. Graph., 28, 1(2009).

[34] Y. Lipman, D. Cohen-Or, D. Levin et al. Parameterization-free projection for geometry reconstruction. ACM Trans. Graph., 26, 22(2007).

[35] Y. Li, R. Bu, M. Sun et al. PointCNN: convolution on x-transformed points. Advances in Neural Information Processing Systems(2018).

[36] L. Yu, X. Li, C.-W. Fu et al. PU-Net: point cloud upsampling network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2790(2018).

[37] W. Yifan, S. Wu, H. Huang et al. Patch-based progressive 3D point set upsampling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5958(2019).

[38] S. Wang, J. Wu, X. Sun et al. 3D shape perception from monocular vision, touch, and shape priors. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1606(2018).

[39] L. Rustler, J. Lundell, J. K. Behrens et al. Active visuo-haptic object shape completion. IEEE Rob. Autom. Lett., 7, 5254(2022).

[40] O. Ronneberger, P. Fischer, T. Brox. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, 234(2015).

[41] E. Coumans, Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning(2016).

[42] S. Koch, A. Matveev, Z. Jiang et al. ABC: a big CAD model dataset for geometric deep learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9601(2019).

[43] Y. He, D. Tang, Y. Zhang et al. Grad-PU: arbitrary-scale point cloud upsampling via gradient descent with learned distance functions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5354(2023).

[44] X. Ma, Y. Zhou, H. Wang et al. Image as set of points. The Eleventh International Conference on Learning Representations(2023).

微信扫一扫：分享

微信扫一扫：分享