Compression and denoising of time-resolved light transport

Exploiting temporal information of light propagation captured at ultra-fast frame rates has enabled applications such as reconstruction of complex hidden geometry and vision through scattering media. However, these applications require high-dimensional and high-resolution transport data, which introduces significant performance and storage constraints. Additionally, due to different sources of noise in both captured and synthesized data, the signal becomes significantly degraded over time, compromising the quality of the results. In this work, we tackle these issues by proposing a method that extracts meaningful sets of features to accurately represent time-resolved light transport data. Our method reduces the size of time-resolved transport data up to a factor of 32, while significantly mitigating variance in both temporal and spatial dimensions. © 2020 Optical Society of America

Transient imaging methods [1] typically exploit time-resolved data on the order of nano- [2] to femtoseconds [3], involving spatiotemporal data structures to represent light propagation. Related applications such as reconstruction of hidden geometry [4,5] require exhaustive scans of the scene at multiple camera and light locations, resulting in five-dimensional data. Monte Carlo methods for transient rendering [6,7] allow to accurately simulate time-resolved light transport. As such, they have become a helpful instrument for analysis and benchmarking, and for use as a data source for machine learning approaches [8,9]. This increased dimensionality and high temporal resolution yield massive discretized representations of light transport that hamper the efficiency in practical applications. While methods to increase computational performance exist [10], memory and bandwidth are still limiting constraints. Moreover, these sorts of time-resolved signals are degraded by either the attenuation of captured light or the variance in Monte Carlo simulations. Therefore, noise removal and reconstruction algorithms become key to develop robust imaging methods. Feature extraction and representation in alternative domains have been extensively used for reconstruction and compression of different types of signals. There exist a wide variety of encoding and fast decoding methods for low-dynamic-range image and video data, where exploiting frequency characteristics predominates in most widespread compression algorithms [11]. Closer to our domain of application, representing time-resolved light transport by a combination of Gaussians and exponential functions has been proved useful for applications such as illumination decomposition [12] and imaging in scattering media [13].
However, while compression and denoising methods have been extensively researched for steady-state images and video, time-resolved light transport has distinctive properties that we exploit in this Letter. First, light propagation is heavily structured in both time and space: the magnitude and frequency of the signal decrease over time due to multiple convolutions and attenuations of scattered light (see Fig. 1, right); moreover, temporal propagation is strongly correlated to spatial features of the scene, since light time-of-flight depends in part on the optical paths through the scene. Second, due to temporal delays in light propagation, similar temporal patterns can occur at different times. In Fig. 1 (blue, red, yellow), we can see how the temporal delay of the initial peak is directly proportional to the depth at different points of the scene. Finally, time-resolved transport is particularly prone to noise, due to either signal attenuation in captured data or slow convergence rates in simulation (see Fig. 1, right). These characteristics pose several challenges when finding alternative representations of time-resolved light transport. We take into account all these aspects to design a method for compressing and recovering transient light transport data based on encoder-decoder neural networks. We leverage existing databases [8] to learn sets of spatiotemporal features and build lightweight representations of time-resolved transport up to 32 times smaller than the original signal. This work is a formalization and continuation of our preliminary results [14].
Let L ω (t), t ∈ [0, ∞) be a function that represents timeresolved radiance in a scene from a viewing direction ω. While L ω (t) is continuous, this function does not have closed-form solutions for general scenes. As a consequence, in practice, L ω (t) is represented by a discrete set of radiance values   (Fig. 1, middle). For simplicity, we will use L i j (t) to refer to these discretized radiance profiles at positions {i, j } of a transient image H. In order to obtain accurate but small representations of time-resolved pixels L i j (t) ∈ R T , we analyze and exploit the aforementioned properties of transient light transport to introduce a compression and denoising method. Recent works [8,15,16] explicitly described the strong spatiotemporal correlation and convolutional nature of light transport. Inspired by this, we propose to use convolutional encoder-decoders to learn two mappings. First, we learn an encoding function E(·) to extract a set of features f L from some discretized input data X : (1) The function g (·) represents a transformation function applied to the input X . Second, we learn a decoding function D(·), such that which estimates the target time-resolved radiance L i j (t) ≈ L i j (t) based on the feature vector f L . The resulting f L of the encoding function will be the compressed representation of the signal L i j (t). The choice of X is key to ensure that the encoding function E has enough information to obtain a feature vector f L representative enough for the decoder D to accurately estimate L i j (t). Functions g , E, and D must account for the aforementioned challenges of timeresolved radiance: exponential decay and reduced frequency over time, arbitrary propagation delays, and signal noise.
Finally, since the data can have arbitrary temporal resolution, it is desirable to handle temporal profiles of arbitrary length with the same compression ratio. We thus introduce several design choices on the input data X , the transformation function g , and the encoder and decoder operations E, D.
Input data. To leverage the local spatiotemporal coherence of light transport, we propose to use a time-resolved spatial neighborhood X ≡ L i j centered at L i j as input for the feature extraction step [Eq. (1)]. Time-resolved signal has a high dynamic range with exponential decay over time due to recursive light bounces. To prevent the encoding step from ignoring low-valued radiance features, we define a logarithmic transformation g over the input data as The threshold ε and offset log 10 (ε) ensure all resulting values are above zero and prevent input values close to zero going to infinity. In our experiments, not applying a logarithmic transformation made our optimization fall into local minima resulting in zero-valued outputs. We set a threshold of   Fig. 2. Our proposed architecture. The encoder extracts a total of T/32 features f L from a 9 × 9 × T spatial neighborhood in logarithmic space X = g ( L i j ), centered at the time-resolved pixel L i j to compress. The decoding step uses these features to recover the time-resolved pixelL i j = g −1 (Y ) with a set of deconvolutions and residual convolution blocks. ε = 1e − 7 based on radiance value distributions of our training and validation datasets. In practice, a neighborhood of size 9x9 allowed us to find enough spatio-temporal features while significantly mitigating noise in the recovered signal. Encoding step. To extract a set of representative features from the spatial neighborhood L i j , we design a fully convolutional learnable encoding function E [Eq. (1)]. The function is composed of 3D convolutional filters (see Fig. 2, left) that operate over both spatial and temporal dimensions. These filters exploit spatiotemporal structures of light transport while simultaneously discarding noise in the signal. The fully convolutional nature of this function allows us to keep a constant compression ratio over arbitrary temporal resolutions. To enable this, the filters simultaneously perform the following operations: a) progressively reduce the size of the spatial dimensions to 1 × 1 in the innermost layer (i.e., the compressed signal) by controlling the padding over the fixed-size spatial neighborhood L i j ; b) sequentially apply strides of size two in the temporal dimension. Each layer of this function works similarly to a downsampling operation. However, since the filters are optimized based on a minimized loss between the estimated and reference signals, the encoding learns to extract the most representative features. Each element of the resulting vector f L encodes features from a bounded time interval of the input L i j (see Fig. 3, left). Note that while our encoding function is computationally expensive due to 3D convolution operations, it needs to be run only once per each time-resolved pixel when compressing our signal. We design this function with five convolutional layers that generate a feature vector f L 32 times smaller than the original signal L i j (t) to be compressed. This compression ratio can be varied by retraining with different numbers of convolutional layers, but in practice, we found that this number provides a good trade-off among size reduction, denoising, and preservation of features.
Decoding step. Given a set of features f L , we aim to learn a decoding function D [Eq. (2)] that estimates the target uncompressed signal L i j . Note that we do not want to estimate the whole input L i j , but just the central time-resolved pixel L i j . We design the function D to perform a set of 1D temporal deconvolutions and convolutions that operate over the features f L extracted by the encoding step [Eq. (1)]. This step works as an upsampling operation with learnable 1D filters. Following previous works on deep residual nets [17], we apply residual connections between deconvolution blocks (see Fig. 2). The key aspect of our decoding function is that, by construction, it learns a nonlinear mapping between every feature and a corresponding time interval t over the recovered signal. This ensures that our method can handle arbitrary propagation delays that yield similar radiance patterns placed over the temporal dimension. In Fig. 3, left, we illustrate this by changing the value of a single feature at different positions of f L , resulting in equivalent temporal profiles over the corresponding time intervals. More importantly, the convolutional blocks in our decoder (see Fig. 2, right) ensure each time instant t is covered by multiple features, and therefore its radiance value L(t) is the sum of multiple nonlinear mappings of the features that cover that time instant, allowing for increased complexity in the recovered signal. This is illustrated in Fig. 3, right, where adjacent features map to overlapping time intervals in the decoded radiance. Training and loss function. As in classic encoding-decoding architectures, we perform simultaneous training of E and D parameters. We optimize these by minimizing an error function L between the reference L i j and the decompressed timeresolved radianceL i j . Since our encoding function operates over a logarithmic transformation of radiance [Eq. (3)], the features f L handled by the decoder D and in consequence the resulting output Y = D( f L ) [Eq. (2)] are also in logarithmic space of radiance. To keep a good trade-off between estimating peak direct illumination and indirect illumination, we apply an exponential transformation over both the decoding output D( f L ) and the log-space central pixel g (L i j (t)), and minimize the mean squared error over these, having where b is the base of the exponential function. In practice, we found that choosing b = 2 provides good results for successfully decompressing both direct illumination peaks and smooth indirect bounces (see Fig. 4). Dataset. For training and validation, we rely on the publicly available Zaragoza-DeepToF transient dataset [8], which contains a sufficiently large number of complex scenarios to prevent overfitting in our approach. It contains 1050 time-resolved simulations for a wide variety of architectural scenarios, with a spatial resolution of 300 × 300 and a temporal resolution of 4096 pixels at 16.6 ps/pixel. For training, we randomly select a total of 860,000 pixel neighborhoods of size 9 × 9 from 145 scenes. For validation, we select a total of 370,000 inputs from 37 completely different scenes. While global illumination introduces correlation between patches, our validation set is uncorrelated with the training set, since the patches come from different scenarios. Our training is unsupervised, where our target L i j is the central pixel of the input neighborhood L i j . Although the simulations in the dataset are not completely noise

Reference
Ours Loss function in log-space Fig. 4. Results of the Altar scene (see Visualization 1), with reference frames (left). Training with our exponential transform MSE loss [center, Eq. (4)] is able to recover strong direct peaks, while a MSE loss applied over the logarithmic-space of the output (right) fails to recover these features.
free, our method based on 3D convolutions is capable of extracting spatiotemporal features while simultaneously removing high-frequency variance from noisy data. Figure 6 shows reference frames of the Room scene from the validation set (top row), and the resulting frames after compressing each reference time-resolved pixel to 128 features and decompressing them back to 4096 pixels (second row). The bottom row shows the full time-resolved signal at the marked location, with the reference (blue) and our recovered radiance (green), and the timestamps of the frames. Our trained decoder successfully recovers most radiance features of the scene using a compressed representation of the radiance 32 times smaller than the original. Table 1 compares compression ratios for three standard high-dynamic range (HDR) compression libraries-RGBE, OpenEXR using wavelet/Huffman compression, and HDF5 with gzip-for all the validation scenes shown in this article, showing that our method yields smaller representations (3.1% of the original signal) than other approaches (8.8% to 28.4%). Please refer to Visualization 1 for the entire frame sequences.
One of the pathological problems in transient light transport data is the presence of different types of noise in the signal. In particular, Monte Carlo-based transient rendering methods suffer from high variance due to uneven distributions of samples over time [6]. Our fully convolutional encoder is capable of extracting the most significant features by performing 3D spatiotemporal convolutions. In Fig. 5, we can observe the results of the denoising in two extreme cases with higher-order indirect illumination in the Building and Balcony validation scenes. Our approach does not force the compressed features (shown in red) to retain light transport properties. However, while the samples at the target time-resolved pixel L i j (blue) present a lot of variance, the spatiotemporal neighboring samples (brown color scale) contain relevant information that our encoder uses to extract the most significant features to decode our reconstructed
In conclusion, we have presented a new method for compressing and denoising transient light transport data. By observing the characteristics of light transport in the temporal domain, we have demonstrated how spatiotemporal 3D convolutions are capable of extracting most meaningful features even in extremely noisy conditions. This leads to a compressed signal, from which the original can be recovered with significantly less variance by means of a convolutional decoder. Transient imaging methods and hardware present critical trade-offs between capture time and signal noise. Our method can mitigate this, while reducing the computational time required to post-process the data. We believe that our pipeline could be applied to large captured datasets, once acquisition processes become faster. Reference Ours Fig. 7. Results for real data (blue) captured on a non-line-of-sight setup (left) [5]. The plots show our results (green) at different points of the captured grid.