Real life occlusion of rendered content [Kinect]

In my quest for a better augmented reality experience I got to a point where I needed my rendered content to interact more with the environment and especially me… the USER.

This is where this articles comes in and describes a method you can use to blend your content with the environment captured by the RGB camera, not just overlapping them.

After my previous article (Kinect: 3D view & projection matrix to match the RGB camera) I had a system able to render stuff in a space that appeared to match the real environment captured by the RGB camera. But still, they were 2 separate layers carefully overlapped. The key to merging them together lies in the depth image provided by the Kinect.

 

The Kinect depth image

The SDK allows the developer to register a callback which gets called every time there is a new depth image available (around 30fps).

The depth image is basically a 320×240 (although it can be smaller) matrix where each pixel value represents the distance from the camera to that point in space (expressed in mm). This is how a depth image looks rendered where white= closest, black= farthest.

If you notice in the second image thre are some red areas I marked. These regions appear when the Kinect “sees” a region but can’t determine its depth so it simply writes zero in those pixels.

 

So what do we do with this depth image?

The trick is to initialize the  Z-Buffer (OpenGL or Direct3D) with the depth values taken from the kinect depth image before rendering any 3D content. By doing this, when your content is rendered it will be occluded  by what it appears to be real-life objects. It’s like simulating a render of the entire environment in 3D and using the resulting z-buffer values.

 

Converting a depth value to the correct z-buffer value

Before we jump into writing in the z-buffer we need to consider what kind of projection we use  (ortho or perspective) and the different coordinate spaces involved in the process (and how we convert between them). Basically we’re duplicating the graphics library pipeline limited to the depth values.

This book (Essential Mathematics for Games and Interactive Applications) was of REAL help for me and I recommend it to anyone looking for a good understanding of the math involved in graphical applications (vectors, points, matrices, projections, etc). As the title says, it’s all essential information. Also, I’ve checked out other books before settling to this one and for me this was the best. Very good combination of math facts and explanations. I ended up going through the entire book not just the Projections chapter.

We have 3 coordinate systems we must consider and use:

- Kinect depth image. values in this space are represented in millimeters, and is what you find in each pixel of the depth image.

- 3D space: values in this space are represented in units. By using the system determined in my previous article, 1 unit = 1 meter.  To convert from the depth image space to 3D space it’s very easy. Considering our camera is located in the 3D space origin (0,0,0) looking down the negative axis, we only have to divide by 1000.0f to convert from millimeters to 3d units (meters). For example,   1234mm (depth image value) = 1.234 meters (3d space depth from the camera).

- Normalized Device Coordinates (NDC): this is an intermediary space used by rendering libraries to deal with multiple device resolutions and formats. Values in this space are usually between -1 and 1 for X, Y,Z.  To convert from 3D space to NDC is a bit more complicated. The conversion formula depends on the type projection used. In my case I am  using a plain perspective projection so the formula is:

  • z = input depth value
  • far = far plane distance from the camera
  • near = near place distance from the camera
  • z’ = the output normalized value

This formula produces output values between [-1, 1]. Note that Direct3D uses [0,1] as NDC values so if you’re using it (XNA included) then you must translate the z’ value into the [0,1] interval. This is very easy:

So, now after these conversions we finally have a value ready to be written into the z-buffer. Let’s see how we do this.

 

Rendering the depth image into the z-buffer

To get the depth image into the z-buffer we will be drawing a texture quad as large as the screen using an orthographic projection. Thanks to programmable pipeline we can do this fairly easy by writing a custom shader that reads the values from the texture, does the conversions mentioned above and in the end writes the depth value of that pixel.

There’s a trick in how to pass the depth data from Kinect to the shader. We want to do as little as possible per-pixel operations using the CPU (the GPU is far better at this). Considering our depth image is a short array, I am creating the texture in a BGRA4444 format. This way, I don’t have to change anything in the array received from kinect. This is how my depth image callback looks like:

1
2
3
4
5
6
7
8
9
10
void kinectSensor_DepthFrameReady(object sender, ImageFrameReadyEventArgs e)
{
    PlanarImage p = e.ImageFrame.Image;
    kinectDepthFrameBits = p.Bits;
 
    if (kinectDepthFrame == null)
        kinectDepthFrame = new Texture2D(graphics.GraphicsDevice, p.Width, p.Height, false, SurfaceFormat.Bgra4444);
 
    kinectDepthFrame.SetData(kinectDepthFrameBits);
}

So, now that we have our texture ready let’s see how we could render that full-screen quad (the code below is not optimized):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
kinectDepthEffect.Projection = orthoMatrix;
kinectDepthEffect.Texture = kinectDepthFrame;
kinectDepthEffect.World = Matrix.Identity;
kinectDepthEffect.World *= Matrix.CreateScale(-kinectDepthFrameSize.X, -kinectDepthFrameSize.Y, 1.0f);
kinectDepthEffect.World *= Matrix.CreateTranslation(kinectDepthFrameCenter.X - ScreenW / 2, kinectDepthFrameCenter.Y - ScreenH / 2, 0.0f);
 
VertexPositionTexture[] pointList = new VertexPositionTexture[4];
pointList[0] = new VertexPositionTexture(new Vector3(-0.5f, 0.5f, 0), new Vector2(0.0f, 1.0f));
pointList[1] = new VertexPositionTexture(new Vector3(0.5f, 0.5f, 0), new Vector2(1.0f, 1.0f));
pointList[2] = new VertexPositionTexture(new Vector3(-0.5f, -0.5f, 0), new Vector2(0.0f, 0.0f));
pointList[3] = new VertexPositionTexture(new Vector3(0.5f, -0.5f, 0), new Vector2(1.0f, 0.0f));
 
foreach (EffectPass pass in kinectDepthEffect.CurrentTechnique.Passes)
{
    pass.Apply();
    graphics.GraphicsDevice.DrawUserPrimitives(PrimitiveType.TriangleStrip, pointList, 0, 2);
}
 
graphics.GraphicsDevice.Textures[0] = null;

 

And finally, this is the shader I am using for the rendering. You will see that it’s a little more complicated to extract the depth from the R,G,B,A of the texture but not rocket-science.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
//=============================================================================
// 	[GLOBALS]
//=============================================================================
 
float4x4 World;
float4x4 Projection;
Texture2D KinectTexture;
float nearPlane;
float farPlane;
 
sampler2D KinectSampler = sampler_state
{
	Texture = ;
};
 
//=============================================================================
//	[STRUCTS]
//=============================================================================
 
struct VertexPositionTexture
{
	float4 Position : POSITION0;
	float2 UV       : TEXCOORD0;
};
 
struct PixelShaderOutput
{
	float4 color : COLOR0;
	float depth: DEPTH;
}; 
 
//=============================================================================
// 	[FUNCTIONS]
//=============================================================================
 
VertexPositionTexture TexturedVertexShader(VertexPositionTexture input)
{
	VertexPositionTexture output;
	output.Position = mul(input.Position, World);
	output.Position = mul(output.Position, Projection);
	output.UV       = input.UV;
	return output;
}
 
PixelShaderOutput TexturedPixelShader(VertexPositionTexture input)
{
	float4 texcolor = tex2D(KinectSampler, input.UV);
 
	// Get the depth value and restore its z-value
	int realDepth = (int)(texcolor[2] * 16) + ((int)(texcolor[1] * 16) * 16) + ((int)(texcolor[0] * 16) * 256) + ((int)(texcolor[3] * 16) * 4096);
 
	// Transform from depth image to 3d space values
	float zz = (float)realDepth / 1000.0f;
 
	// Transform from 3D space to NDC space value
	float zdepth = (farPlane + nearPlane) / (farPlane-nearPlane) + (-2 * nearPlane * farPlane) / ((farPlane - nearPlane) * zz);
 
	// Clamp the z just to be sure
	if (zdepth < -1)
                zdepth = -1;
        if (zdepth > 1)
		zdepth = 1;
 
	// Direct3D uses [0,1] as NDC range so we convert to this interval
	zdepth = (zdepth + 1.0f) / 2.0f;
 
	// Calculate a color intensity just in case we want to visualize our z-buffer output
	texcolor[0] = zdepth / 2.0f;
	texcolor[1] = zdepth / 2.0f;
	texcolor[2] = zdepth / 2.0f;
	texcolor[3] = 1.0f;
 
	PixelShaderOutput output;
	output.color = texcolor;
	output.depth = zdepth;
 
	return output;
}
 
//=============================================================================
//	[TECHNIQUES]
//=============================================================================
 
technique DefaultEffect
{
    Pass
    {
        VertexShader = compile vs_2_0 TexturedVertexShader();
        PixelShader  = compile ps_2_0 TexturedPixelShader();
    }
}

 

Conclusion & the shader in action

Here are some screenshots of a demo app I made to test this idea.

As you can see the Kinect depth camera has a low resolution and combined with the noise and invalid areas make this technique very hard to implemented nicely in a project. It all depends on the accuracy needed for that project of yours but so far I haven’t got a chance to use this commercially.

If you have any suggestions or any comments you can leave one below.