Describes the algorithm used to infer surface normals from the depth buffer.

Overview

We estimate the normal vectors of all pixels on the screen by finding the gradient $\nabla f$ of the depth buffer $f(x,y)$ (in screen coordinates), and then using this gradient to calculate the normal vectors, according to the following equations:

$\begin{eqnarray*} p_{x,y} & = & f(x,y) r_{x,y} \\ p_{x,y}^x & = & \left( f(x,y) + \frac{\partial}{\partial x}f(x,y) \right) r_{x+1,y} \\ p_{x,y}^y & = & \left( f(x,y) + \frac{\partial}{\partial y}f(x,y) \right) r_{x,y+1} \\ v & = & \left( p_{x,y}^x - p_{x,y} \right) \times \left( p_{x,y} - p_{x,y}^y \right) \end{eqnarray*}$

Here, $r_{x,y}$ is the 3D ray (in world coordinates) along which the pixel at the given coordinates was traced, so that $p_{x,y}$ is the coordinates of the intersection, and $p_{x,y}^x$ and $p_{x,y}^y$ are those for the adjacent pixels in the horizontal and vertical directions, respectively, based on the gradient. The resulting normal vector $v$ is calculated using the vector cross product $\times$ .

There are two main challenges, both having to do with how one finds the image gradient:

We're drawing voxels, so the depth buffer is very rough. Indeed, there are only six possible values for the "true" normal vectors (one for each side of a cube). Hence, we must smooth the gradient in order to get a visually-pleasing result.
In general, but particularly when using smoothing, it's important that we ignore occlusion boundaries when calculating gradients. If we have two distinct objects, one in front of the other, then we don't want the rear object to have its normal vectors "pulled up" near where it is occluded by the front object.

In order to address these issues, gradients are estimated using a MRF.

Basic model

For the moment, forget about "depths" (which actually require a little bit of special handling), and imagine that $f$ is just a function of which we want to find a smoothed boundary-respecting gradient. At every pixel, our MRF will attempt to estimate two quantities:

$\bar{w}_{x,y}$ , the smoothed value of .
$\bar{\Delta}_{x,y}$ , the smoothed gradient of .

Assume that there is a Gaussian probability distribution over these quantities. The mean of this distribution is what is computed and stored for every pixel (that is, we perform something like expectation propagation), while the covariance is a fixed parameter to the model. To be more specific, $\bar{w}_{x,y}$ and $\bar{\Delta}_{x,y}$ are derived from the neighbors of $x,y$ as:

$\begin{eqnarray*} \mathrm{Pr}\left\{ \bar{w}_{x,y}, \bar{\Delta}_{x,y} \right\} & \propto & \exp\left( -\frac{1}{2\sigma^2} \left( \bar{w}_{x,y} - f(x,y) \right)^2 \right) \\ & & \times \prod_{i\in\mathcal{N}} \exp\left( -\frac{\delta^{(i)}_{x,y}}{2\gamma^2} \left( \bar{w}_{x,y} + \bar{\Delta}_{x,y}^T d^{(i)} - \bar{w}_{x+d^{(i)}_x,y+d^{(i)}_y} \right)^2 \right) \\ & & \times \prod_{i\in\mathcal{N}} \exp\left( -\frac{\delta^{(i)}_{x,y}}{2\kappa^2} \left\Vert \bar{\Delta}_{x,y} - \bar{\Delta}_{x+d^{(i)}_x,y+d^{(i)}_y} \right\Vert^2 \right) \end{eqnarray*}$

Here, $\delta^{(i)}_{x,y}\in[0,1]$ is the "weight" assigned to the $i$ th neighbor edge in the MRF. If it is zero, then this neighbor makes no contribution, and if it is one, then it makes a full contribution. Along occlusion boundaries, $\delta^{(i)}_{x,y}$ should be zero. The $d^{(i)}$ vectors indicate the relative positions of the neighbors. For example, for neighbors in the four cardinal directions, we may take $\mathcal{N}=\{u,r,d,l\}$ and $d^{(u)},d^{(r)},d^{(d)},d^{(l)} = (0,-1),(1,0),(0,1),(-1,0)$ for the neighbors which are up, right, down and left of the current pixel, respectively. The three terms making up the distribution have the following interpretations:

The "data term", with variance $\sigma^2$ , causes the smoothed value $\bar{w}_{x,y}$ to be close to the observed value .
For each neighbor, the first "smoothness term", with variance $\gamma^2$ , causes $\bar{w}_{x,y} + \bar{\Delta}_{x,y}^T d^{(i)}$ , which is what the smoothed value and gradient at indicate the th neighbor's value should be, to be close to the smoothed neighbor's value $\bar{w}_{x+d^{(i)}_x,y+d^{(i)}_y}$ . This is, in a sense, the most important term, since it causes the smoothed values $\bar{w}_{x,y}$ and gradients $\bar{\Delta}_{x,y}$ to be consistent with each other.
For each neighbor, the second "smoothness term", with variance $\kappa^2$ , causes the smoothed gradient $\bar{\Delta}_{x,y}$ to be close to that of the neighbor, $\bar{\Delta}_{x+d^{(i)}_x,y+d^{(i)}_y}$ .

To actually find $\bar{w}_{x,y}$ and $\bar{\Delta}_{x,y}$ , we take logarithms, differentiate, and set the result equal to zero, yielding:

$\begin{eqnarray*} \left( \frac{1}{\sigma^2} + \frac{1}{\gamma^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \right) \bar{w}_{x,y} + \frac{1}{\gamma^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \bar{\Delta}_{x,y}^T d^{(i)} & = & \frac{1}{\sigma^2} f(x,y) + \frac{1}{\gamma^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \bar{w}_{x+d^{(i)}_x,y+d^{(i)}_y} \\ \frac{1}{\gamma^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \bar{w}_{x,y} d^{(i)} + \frac{1}{\gamma^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \left( \bar{\Delta}_{x,y}^T d^{(i)} \right) d^{(i)} + \frac{1}{\kappa^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \bar{\Delta}_{x,y} & = & \frac{1}{\gamma^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \bar{w}_{x+d^{(i)}_x,y+d^{(i)}_y} d^{(i)} + \frac{1}{\kappa^2} \sum_{i\in\mathcal{N}} \delta^{(i)}_{x,y} \bar{\Delta}_{x+d^{(i)}_x,y+d^{(i)}_y} \\ \end{eqnarray*}$

Notice that the second equation, resulting from differentiating with respect to $\bar{\Delta}_{x,y}$ , is two-dimensional. Hence, this is a $3\times 3$ linear system, which can be solved (reasonably) efficiently using e.g. Gaussian elimination. In fact, simply computing the coefficients of the system (i.e. evaluating the above equations) is more expensive that solving it.

This gives us our "basic model": simply iterate over all pixels evaluating and solving the above linear system, and repeat until convergence (this will be refined in subsequent sections–see Optimizing and Coarse-to-fine). At this point, I should point out that we only store a state at each pixel, rather than a message for each edge in the MRF. One will often use a message-passing algorithm (e.g. belief propagation) to optimize a MRF, in which the message sent from pixel $x,y$ to the neighboring pixel $x',y'$ is based on all of the incoming messages to $x,y$ except that from $x',y'$ . The reason that message passing is used is that the state of pixel $x,y$ should not depend on its state at an earlier iteration, and the state at $x',y'$ depends on the the earlier state of $x,y$ , since these pixels are neighbors. Message passing removes these "first order dependencies", but other, longer dependencies still exist (e.g. $x,y$ to $x'',y''$ to $x',y'$ )–it only gives a "correct" algorithm on a DAG (i.e. the model is a Bayesian network), which our MRF is not. I did experiment a little bit with message passing, but it is more computationally expensive than simply storing a state for each pixel (roughly four times so in a four-neighbor model), and didn't improve performance enough to justify the cost.

While we're on the subject of trade-offs, one may have noticed that I didn't talk too much about the edge weights $\delta^{(i)}_{x,y}$ . The "traditional" way to deal with occlusion boundaries, which I've encountered in stereo vision and image denoising algorithms, is to use latent variables. More precisely, one assumes that each edge has some probability $p$ of being an occlusion boundary, and then modifies the distribution to be:

$\begin{eqnarray*} \mathrm{Pr}\left\{ \bar{w}_{x,y}, \bar{\Delta}_{x,y} \right\} & \propto & \exp\left( -\frac{1}{2\sigma^2} \left( \bar{w}_{x,y} - f(x,y) \right)^2 \right) \\ & & \times \prod_{i\in\mathcal{N}} \left( c + \exp\left( -\frac{1}{2\gamma^2} \left( \bar{w}_{x,y} + \bar{\Delta}_{x,y}^T d^{(i)} - \bar{w}_{x+d^{(i)}_x,y+d^{(i)}_y} \right)^2 \right) \exp\left( -\frac{1}{2\kappa^2} \left\Vert \bar{\Delta}_{x,y} - \bar{\Delta}_{x+d^{(i)}_x,y+d^{(i)}_y} \right\Vert^2 \right) \right) \end{eqnarray*}$

Where $c$ is derived from $p$ and the normalization constants of the Gaussian distributions. It is possible to use this model, although solving for $\bar{w}_{x,y}$ and $\bar{\Delta}_{x,y}$ in a four-neighbor model requires solving $15$ different $3\times 3$ linear systems (one for each term which results from multiplying out the above, not including the trivial $c^4$ term). Once more, this is simply too expensive. It is much better to choose the values of $\delta^{(i)}_{x,y}$ beforehand, using a heuristic (see Applying to a depth buffer).

Optimizing

I've already roughly described how we optimize the MRF: find the state for each pixel, and repeat until convergence. For version of the MRF which I most commonly use, the state of each pixel depends on the states of its neighbors in each of the four cardinal directions (see Coarse-to-fine for an example in which the neighbors are taken to be in the four diagonal directions). We could sequentially step through the screen in, say, row-major order, and update each pixel when we reach it. However, this model admits a large amount of parallelism, and we'd be foolish not to exploit it.

The simplest method for performing the updates in parallel is probably this: maintain two buffers, and update each pixel in parallel, reading the states of its neighbors from the first buffer, and writing the result to the second. After doing this for all pixels, simply go the other way, reading from the second, and writing to the first. The biggest advantage of this strategy is that will naturally tend to use coalesced memory accesses on GPUs which benefit from them. However, it has the disadvantage that information will propagate through the graph very slowly. When a pixel changes, only its neighbors will "be aware" of the change after one iteration, only their neighbors after two, and so on.

Addressing this problem by using a more sophisticated strategy is probably not advisable on a GPU. On a multi-core processor, however, we can benefit from using an in-place update, for which we keep only one copy of the states, and specify a mixed parallel/sequential updating strategy which both ensures that no thread will be writing to a memory location that another is reading, and causes imformation to propagate rapidly through the graph.

One such strategy is based on the idea of imagining that the graph is a checkerboard. We first update all of the black squares:

Here, the arrows point to the nodes which are being updated, from those on which they depend. Next, we update the white squares:

Alternation between the two continues until convergence. The computation of each black square does not depend on the result at any other black square, and likewise for the white squares. Hence, all squares of each color can be considered concurrently.

This is a reasonable "baseline" method, and will be the foundation of the coarse-to-fine updating strategy (see Coarse-to-fine). However, it is not the best. An alternative is to parallelize over the even-numbered rows, and, within each row, sequantially visit the pixels from left-to-right:

Here, the numbers within each row indicate the sequential order in which the updates are performed. We next update the odd-numbered rows:

We then do to the same from top-to-bottom, right-to-left and bottom-to-top, updating each pixel a total of four times, once for each direction. This approach results in information potentially propagating all the way across the graph in a single pass (that is, a change to a pixel on the left side of the screen will affect a pixel on the right side), whereas after visiting all pixels four times in the "checkboard" order, information will have propagated a distance of at most eight pixels.

Coarse-to-fine

Of course, we need to have a model which not only gives high-quality results, but also one which can be optimized very quickly. The best way that I know of to speed up convergence is to perform the optimization in a coarse-to-fine manner.

Let $s$ be a "coarseness level", and imagine that each node in the coarsened grid corresponds to a $2^s \times 2^s$ "block" of nodes in the finest grid (for which $s=0$ ). We will treat this block as a single node in the coarsened grid, with the corresponding data term $f(x,y)$ being the value at the center of the block, and the probability distribution defining the update being modified to account for the change in scale.

Every edge in the coarsened grid corresponds to $2^s$ edges in the finest grid, so we must raise the two smoothing terms of the distribution to the $2^s$ th power. Similarly, because there are $(2^s)^2$ nodes in each coarsened block, we must raise the data term to the th power. Because the centers of the blocks are $2^s$ times further apart, we must scale $d^{(i)}_{x,y}$ by a factor of $2^s$ . Finally, the variances of the two smoothness terms must be scaled, but it's unclear how this should be done–one could make a case for scaling by $2^s$ or $(2^s)^2$ , depending on whether we prefer to view the accumulated errors along a stretch of $2^s$ nodes as being independent (the former case) or identical (the latter). I use the former:

$\begin{eqnarray*} \mathrm{Pr}\left\{ \bar{w}_{x,y}, \bar{\Delta}_{x,y} \right\} & \propto & \exp\left( -\frac{(2^s)^2}{2\sigma^2} \left( \bar{w}_{x,y} - f(x,y) \right)^2 \right) \\ & & \times \prod_{i\in\mathcal{N}} \exp\left( -\frac{\delta^{(i)}_{x,y}}{2\gamma^2} \left( \bar{w}_{x,y} + \bar{\Delta}_{x,y}^T d^{(i)} - \bar{w}_{x+d^{(i)}_x,y+d^{(i)}_y} \right)^2 \right) \\ & & \times \prod_{i\in\mathcal{N}} \exp\left( -\frac{\delta^{(i)}_{x,y}}{2\kappa^2} \left\Vert \bar{\Delta}_{x,y} - \bar{\Delta}_{x+d^{(i)}_x,y+d^{(i)}_y} \right\Vert^2 \right) \end{eqnarray*}$

Finding the expectation of this distribution is essentially the same as before (see Basic model). The more difficult question is how one transitions from a coarser grid level to a finer one. The approach taken is as follows: starting from the following coarse grid (dark circles correspond to nodes which have been computed):

We begin by performing a diagonal update:

This is essentially the same as the "checkerboard" update, with the difference being that the neighbors, represented by $d^{(i)}$ , lie not in the four cardinal directions, but rather the four diagonal directions. We next perform a "checkboard" update in the cardinal directions:

At this point, the entire finer grid will be populated. The overall coarse-to-fine optimization procedure is to start at some predetermined coarseness level $s$ , perform some number of smoothing passes (see Optimizing), and then transition to level $s-1$ using the procedure described above. We then repeat this until $s=0$ .

Applying to a depth buffer

I mentioned earlier that applying this model to a depth buffer requires "special handling". The reason for this is that, in a depth buffer, the rays are diverging. If one considers a scene consisting of a single plane meeting the horizon, then although the plane itself is linear, the depth buffer will not be–close to the horizon, adjacent pixels will have very different depths, while far from the horizon, they will be much closer. Hence, we do not smooth depths, but rather reciprocal-depths, which do vary linearly when they correspond to the depths of a plane. One can derive the gradient $\nabla f(x,y)$ from $\nabla (1/f(x,y))$ using the chain rule: $\nabla f(x,y) = - (f(x,y))^2 \nabla (1/f(x,y))$ . So, we smooth the reciprocal-depths, apply the chain rule to find the gradients of $f$ , and calculate the normal vectors.

The heuristic which I use to calculate $\delta^{(i)}_{x,y}$ is simple soft-thresholding of the difference between adjacent depths:

$\delta^{(i)}_{x,y} = \max\left(0,\min\left(1,\frac{r_h - \left\vert f(x,y) - f(x+d^{(i)}_x,y+d^{(i)}_y) \right\vert}{r_h-r_l}\right)\right)$

If the depths differ by at most $r_l$ , then $\delta^{(i)}_{x,y}=1$ , while $\delta^{(i)}_{x,y}=0$ if they differ by at least $r_h$ , with $\delta^{(i)}_{x,y}$ varying linearly in-between. The thresholds $r_l$ and $r_h$ are not constant–rather, they are based on the size $r$ of the voxels at the depth $f(x,y)$ . For example, if we're drawing a $256 \times 256 \times 256$ scene (which lies in a $1 \times 1 \times 1$ cube), then $r \ge 1/256$ . Due to mip-mapping, however, $r$ might be larger: the ray-casting routine performs mip-mapping in such a way that the smallest voxels are always at least one pixel in size when rendered, so $r \ge c f(x,y)$ , where $c$ is a proportionality constant measuring the size of a pixel, and depends on the screen resolution and field of view. I take the smallest $r$ which satisfies these two equations, and then define $r_l = \alpha r$ and $r_h = \beta r$ , where $\alpha$ and $\beta$ are parameters satisfying $\alpha < \beta$ .

Example

I'll conclude by including some screenshots which illustrate the various parts of this process, on a simple test scene containing a $512 \times 512 \times 512$ voxel representation of a bunch of superquadrics. The first image shows the heuristically-found edges (see Applying to a depth buffer):

From the depths, optimizing the MRF (see Optimizing) and solving for the normal vectors (see Overview) gives the following result. Here, each color channel contains the magnitude of one component of the 3D normal vector:

One can whip up some simple "fake" diffuse lighting by taking the illuminance at each pixel to be the positive part of the inner product of the normal vector with a vector pointing towards the light source: