Aerial Vision

Lewis Anderson

After a long delay, we are proud to announce that we succeeded in our project. We achieved 90% accuracy in character recognition (using Chen's method), a big accomplishment. Because we did not rectify our images (undistort them), the shapes were skewed. We got 74% accuracy on shapes.

As a comparison to the much simpler Hu moments, Chen's method improved letter recognition by a significant 25 percentage points (65% vs 90%). Shape recognition was unchanged.

Our final report is here: http://cseweb.ucsd.edu/classes/wi10/cse190-a/reports/wgrant_lkander.pdf

We may post the matlab code for Chen's method, which involved radon transforms, dual-tree complex wavelet transforms, and fourier transforms, as well as k-means clustering for image segmentation.

Any questions, just send a message.

Also, we got an A. :)

Shane Grant

So I took a slightly different approach to segmenting the letters and used Mean Shift to cluster colors together. The algorithm I am using is called EDISON from Rutgers. I did a little searching and found a really nice Matlab wrapper written for it on a page by Shawn Lankton. The files required an end of line conversion to run on Windows, but other than that, it was pretty easy to get working.

Mean shift accomplishes more or less the same thing as k-means clustering but in a different fashion. Instead of having to choose (or guess) k, mean shift iteratively reduces the number of colors in the image as they are "shifted" towards each other. I haven't read up too much on the actual math behind it yet though.

So with this new tool in hand, I set off to try and get automatic segmentation of letters. My approach is the following:

I first take an input image and rescale it to 64x64. This reduces computation time, reduces some noise, and helps standardize feature extraction. The next step is to convert the image into the LAB color space, which is used for all image processing.

Once we have the reduced image, it is fed through a few different parameterizations of mean shift (though we do not need to pick a K for mean shift, there are still several parameters to deal with). The EDISON code automatically labels the output images based upon their color. I take the area of the largest connected region found by EDISON, and use it to threshold the image. My prior attempts began with trying to find the smallest region and assuming it was the letter, but there tended to be a lot of other small regions of distractors - I found that a the shape takes up a majority of the image in most cases.

After we have this binary image, we can perform a few morphological operations on it to try and get just the letter. Often at this stage we can expect two connected components - the letter and the background. I then make the assumption that the letter will have a smaller area than the background, and use that information to remove all regions larger than the assumed letter.

I have also developed a fairly reliable way of extracting just the shape from the image. After the input image is fed through the first parameterization of mean shift, I take the number of regions it has found and use that to perform k-means clustering on the original image. This is then further reduced by setting K=2, resulting in an image with just two colors - more often than not, this has the shape as one color and the background as another. To take care of cases where smaller components are classified to the same color as the background, similar morphological operations are performed to reduce the binary image to just a solid shape.

With that being said, here are some results. As can be seen, it is far from perfect, but can work quite well at times. I think it is certainly a good start, but it may overfit the data I was using to create it. When it comes time to actually testing this, I'll have a much better idea of how well it works.

Shane Grant

Cropped target images:

Early results of automatic thresholding:

Shane Grant

I'm playing around with methods to perform automatic thresholding of our input images. I am manually cropping targets with bounding boxes that give a pretty decent amount of background information as well.

My initial thought was to look at color histograms to compute the three most prevalent colors and then look at their distributions in order to isolate the letter. I am using some K-means code to reduce every input image to just three colors - the hope is that we see the letter as one color, the target as another, and the background as the third color.

Restricting k to 3 is be too strict of a bound for the problem - in some cases I am seeing that the background gets assigned the same color as the alphanumeric. In addition, the code I have doing this is using euclidean distance in RGB space, which isn't a very good way to group colors - I expect much better results using the LAB color space. Using LAB I will likely just group based upon A* and B* and disregard the L channel, since the color of the object is the same even though the lighting fluctuates.

I changed the Hu code to use euclidean distance since I am still having trouble with the mahalanobis distance code - it gives very undesirable results.

Shane Grant

I wrote some code that computes Hu moments for labeled data and then gives a label to a new image based upon this data. Hu moments are rotation and scale invariant features of an image. Currently I am manually thresholding input images since I haven't written code to do it automatically.

Hu moments are calculated by first finding the Central moments of an image, which in turn are calculated by subtracting the centroid and then finding the raw image moments. Central image moments are made scale invariant by scaling them by a scaled 0th moment, which corresponds to the area of an object in a binary image. Images are made translation invariant by subtracting the centroid.

We use seven rotation invariant moments (so called Hu moments) which are based upon the scale invariant moments discussed earlier. Each one is some (usually nonlinear) combination of the scale invariant moments.

The following are some of the images I've been playing with. The original is from an actual target and then the rotated versions are synthetically created.

The moments for the two previous figures are:

R R_271
0.4463 0.4441
0.0553 0.0571
0.0041 0.0038
0.0140 0.0129
0.0001 0.0001
0.0031 0.0029
0.0001 -0.0001

As you can see, even though the images themselves look very different, the features extracted are very similar.

The values for the following rotations of E (0, 22, 90, 132 degrees) are:

1.3308 1.3185 1.3308 1.3336
0.6832 0.6780 0.6832 0.6911
0.1851 0.1730 0.1851 0.1905
0.1071 0.1059 0.1071 0.1228
0.0057 0.0057 0.0057 0.0090
-0.0003 0.0027 -0.0003 0.0153
-0.0017 -0.0109 0.0017 -0.0160

Once again you can see that the values are all very similar.

Once the algorithm has been given labeled data, we simply feed it a new image, calculate its Hu moments, and perform a nearest neighbor classification based upon Mahalanobis distance.

This distance metric differs from euclidean distance in that it takes into account correlations of the data sets and is scale invariant.

The code I have to do this is acting a little strangely right now so the results aren't very reliable. However, from manual inspection we can see that the Hu moments give a surprisingly good signature for a letter, even under rotation. I will have to include more letters and different versions of each letter to see whether it will be suitable enough for our purposes or whether we will require something more complex.

Lewis Anderson

We are starting to implement the methods found in a paper we posted a couple weeks ago, about invariant pattern recognition.

The paper outlines its methods as follows.

"1. Normalize the pattern of size n×n so that it is translation- and scale-invariant.

2. Discard al

l those pixels that are outside the surrounding circle with center (n/2,n/2) and radius n/2.

3. Project the pattern in 2n different orientations (θ) to get the radon transform coefficients.

4. Perform a 1D dual-tree complex wavelet transform on the radon coefficients along the radial direction.

5. Conduct a 1D Fourier transform on the resultant coefficients along the angle direction and get the Fourier spectrum magnitude.

6. Save these features into the feature database."

I found code online that does a radon transform and returns results in a fairly usable manner here. It only calculates using 8 angles, so the data isnt as precise as we would like, and we will likely modify it to do far more angles, then port it to CUDA (run on the GPU).

We have gotten to step 3, though we are not projecting in as many orientations as our source paper calls for, so this is a modification we will need to make to the code. Here are some preliminary results. Note that all pixels in the gray area are discarded, the angle is displayed on the vertical axis, and the translation is displayed on the horizontal axis. I discarded 40% of the far left and far right translations because they had no worthwhile information.

As you can see, with this information, the B and the A appear to be identical. It is possible that putting these results through the last two steps, or analyzing the specific numbers, would distinguish the two. I feel that the biggest problem here is the small number of orientations in the radon transform. This is a problem with the current implementation, and not the theory, and its something we can fix fairly easily.

In unrelated news, I hate the blogger posting system.

Shane Grant

Today we got on top of EBU I and took some video of targets on the ground. This was from a height of somewhere around 100 to 150ft, with the zoom varying from 1x to about 4x.

Here are a bunch of pictures (The ones with a lot of green grass are from our flight last week - these are just frames taken from the videos):

The following are some examples of running the images through our saliency program running on the GPU. It currently only takes color channels into account and isn't using more complex filtering like Gabor right now:

Shane Grant

I'm using Virtualdub to preview a bunch of different methods for deinterlacing the content. Here are a few samples (all of this is on our more zoomed out footage - still working on fixing the other file, though it might not be possible to recover the data):

This is using "discard," which cuts the vertical resolution in half and only uses one set of fields:

Next up is "double," which takes either the even or odd scan lines and doubles them, ignoring one set of scan lines (discard with 2x the vertical resolution):

Next is "blend," which takes the average of both field lines:

Increasing in complexity, next is the "bob" algorithm:

Finally there is the "yadif" algorithm, which is similar to "bob" but slighty more complex (the information on these is rather sparse, however):

From looking at the stills alone it seems that the bob and yadif methods produce the best results - it's difficult to judge them in actual video sequences since everything in the field of view is constantly moving. The yadif algorithm is apparently fairly good at dealing with static scenes which we currently have none of.

Shane Grant

We finally got around to flying and recording video this past weekend. We put out targets and recorded two videos - one about eight minutes long and the other slightly longer. Unfortunately the second file seems to have become corrupted so we can't view it for now. We've tried hex editing the header of the avi file to fix it but we haven't had any luck so far.

The flight tests were going very well up until one of our flight batteries that powers the motor failed on us. This caused the motor to shut off and made for a rather rough emergency landing so we won't fly for another one to two weeks while the plane is being repaired.

The video files are very large (several gb) and I haven't compressed them to upload yet.

We did realize one very important thing though - we need to deinterlace our video. We use a Sony FCB-HD11 camera which has the capability to capture HD or SD. Right now our transmitter is only for SD so we have to use the NTSC format, which is an interlaced 29.97 fps video.

This video looks great when viewed on a device that is designed to display interlaced video, like an older television. However, when viewed on a progressive monitor, like a computer display, it looks terrible.

Here's an example of what interlaced video looks like when displayed progressively:

We can use video editing software to deinterlace the recorded footage, but we need a real time solution to use until we can acquire an HD transmitter and use 720p.

There is a lot of literature on the subject and many approaches ranging in complexity and effectiveness. This site has a good overview on the subject. Deinterlacing—An Overview is a great paper on many approaches to solving the problem.

Our first efforts will be to implement a simple deinterlacing scheme that takes either the odd or even field lines and interpolates the values in between them (code can be found to do this in both OpenCV and GLSL). Our videos have a lot of motion so likely we will need something more robust than this.

More advanced techniques use both temporal and motion information to deinterlace, such as that found in A Method of De-lntcrlacing with Motion Compensated Interpolation, Motion and Edge Adaptive Interpolation De-interlacing Algorithm, or Deinterlacing with
Motion-Compensated Anisotropic Diffusion.

Deinterlacing concerns aside, I was not too happy with the video of targets we did acquire. Currently we are using a fixed downward looking camera and have no control over zoom once the plane has taken off. In about a week we will have the capability to control the camera via a wireless serial interface, but actually pointing the camera is a ways off. We should also have access to telemetry data at the same time, which will allow us to perform image rectification.

At this point it is premature for us to discuss object/character recognition algorithms in detail since what our final imagery will look like is still in flux. It is likely that since we won't be flying for about two weeks that we will acquire "still" shots of the targets on campus from a high vantage point so we can make some progress on the recognition front.

Shane Grant

We attempted to acquire a bunch of data last weekend but faced a new challenge in regards to acquiring aerial images. We control the plane over a 2.4GHz wireless link which is the same band as our video transmission. There were a few issues where the video was causing us to lose control of the aircraft so we decided it safest not to fly.

To address this we are moving the airplane to 72MHz which should get rid of the issue. On the upside, the video comes through very clearly in all of our tests so far.

We have written a program to capture the video data and save it to disk using OpenCV and are working on a more advanced interface that renders the video in OpenGL.

The weather should be fine to perform a flight this weekend. If the flight is delayed for any reason, we are going to lay the targets out and capture them from a high building on campus to simulate the airplane so we can get started working with the images.

The first thing we plan on doing, after labeling the images, is to compute L*A*B* histograms of the targets and see what types of observations we can make from that. For actually segmenting the data in a real image, we will compare something like a simple sliding window approach using color histograms versus a more complex saliency based approach which we have already implemented to run on the GPU.

Lewis Anderson

Three researchers at Concordia University, in Canada, published a paper that Shane found, describing a very accurate method of character recognition. When run on a set of chinese characters, it had a high accuracy rate even under various rotations and significant noise.

Chen, Bui, and Krzyzak (the authors) reported 98.8% accuracy or higher on the chinese characters present in Figure 1. Since our set of possible characters is approximately a third of theirs, their method would be extremely useful for us. Additionally, their 98.8% accuracy was obtained with high white noise (see Figure 2, SNR = 0.5). With a SNR of 1.0 or higher, their accuracy was 100% under all rotational angles.

Their approach involves the Radon transform, dual-tree Complex Wavelet, and Fourier transforms. Both the Radon transform and the Fourier transform benefit highly from the GPU programming we will be using, and likely the dual-tree Complex Wavelet will as well.

I also found another student who had experience implementing a Radon transform in C++, though he did not use the GPU.

Figure 1:

Figure 2:

Shane Grant

The planned flight for today had to be delayed because of rain, so no sample imagery will be available until this weekend. One of our goals is to determine the color of targets. Since this project is under the assumption that targets will be reasonably segmented (we can expect some noisy brush or grass in the background), one approach we are going to try as soon as we have imagery is to use color histograms to figure out the colors.

Targets are two colors, the majority of which is the background color. The alphanumeric makes up only a small portion of any given color, so we can expect the distribution of color to have its highest peak in the background color, followed by the alphanumeric. We will have to check this theory once we have actual images.

Shane Grant

This past Sunday we created a lot of sample targets to use in acquiring images. We have almost every letter/number currently and should have the remainder finished next weekend. The targets are made out of plywood and painted solid colors.

Currently we plan to fly on Wednesday and acquire video of many of the targets from various angles and extract frames to create individual images.

For the Wednesday update we will have video and images of the sample targets.

I have also started to review some papers on shape/character recognition, specifically: Invariant pattern recognition using radon, dual-tree complex wavelet and Fourier transforms. It is a fairly complex paper that presents rotation invariant recognition for even highly noisy images. I am currently in the process of becoming familiar with the Radon Transform, which in some cases is very similar to the Hough transform.

Shane Grant

The aim of this project is to figure out a way to recognize characters and shapes in targets used in the 2010 AUVSI Student UAS competition. The targets consist of a solid colored alphanumeric character painted on a differently colored, solid background that is a "simple geometric" shape. Based upon past experience we take this to mean shapes in the realm of rectangles, triangles, crosses, stars, circles, and semi-circles.

More information on the target specifics can be found on page 5 of the rules for the competition.

Our first goal is to create replica targets and acquire imagery of them from our plane. We do have some prior footage to use but it will likely not be very useful since the imagery system we are using this year is more sophisticated. We will try to create enough targets to capture each possible character we could encounter, as well as cover the basic shapes we expect to see.

Images will be acquired by capturing frames from a Sony FCBH11 camera, which we are currently using in standard definition mode.

We expect to have this footage within one to one and a half weeks.

Aerial Vision

Wednesday, May 19, 2010

Final Results

Tuesday, March 2, 2010

Thresholding is Difficult

Wednesday, February 24, 2010

Target Pictures

Friday, February 19, 2010

Thresholding

Wednesday, February 17, 2010

Some Progress

Thursday, February 11, 2010

Radon Transform

Wednesday, February 3, 2010

Some Target Data

Friday, January 29, 2010

Interlace Update

Monday, January 25, 2010

Flying and Interlace Problems

Friday, January 22, 2010

Data Delay

Wednesday, January 13, 2010

Pattern and Character Recognition

Rain on Our Parade

Monday, January 11, 2010

Target Making

Wednesday, January 6, 2010

Introduction

Followers

Blog Archive

Contributors