Automatic target recognition (ATR) is the ability for an algorithm or device to recognize targets or other objects based on data obtained from sensors.
Target recognition was initially done by using an audible representation of the received signal, where a trained operator who would decipher that sound to classify the target illuminated by the radar. While these trained operators had success, automated methods have been developed and continue to be developed that allow for more accuracy and speed in classification. ATR can be used to identify man made objects such as ground and air vehicles as well as for biological targets such as animals, humans, and vegetative clutter. This can be useful for everything from recognizing an object on a battlefield to filtering out interference caused by large flocks of birds on Doppler weather radar.
Possible military applications include a simple identification system such as an IFF transponder, and is used in other applications such as unmanned aerial vehicles and cruise missiles. There has been more and more interest shown in using ATR for domestic applications as well. Research has been done into using ATR for border security, safety systems to identify objects or people on a subway track, automated vehicles, and many others.
Target recognition has existed almost as long as radar. Radar operators would identify enemy bombers and fighters through the audio representation that was received by the reflected signal (see Radar in World War II).
Target recognition was done for years by playing the baseband signal to the operator. Listening to this signal, trained radar operators can identify various pieces of information about the illuminated target, such as the type of vehicle it is, the size of the target, and can potentially even distinguish biological targets. However, there are many limitations to this approach. The operator must be trained for what each target will sound like, if the target is traveling at a high speed it may no longer be audible, and the human decision component makes the probability of error high. However, this idea of audibly representing the signal did provide a basis for automated classification of targets. Several classifications schemes that have been developed use features of the baseband signal that have been used in other audio applications such as speech recognition.
Radar determines the distance an object is away by timing how long it takes the transmitted signal to return from the target that is illuminated by this signal. When this object is not stationary, it causes a shift in frequency known as the Doppler effect. In addition to the translational motion of the entire object, an additional shift in frequency can be caused by the object vibrating or spinning. When this happens the Doppler shifted signal will become modulated. This additional Doppler effect causing the modulation of the signal is known as the micro-Doppler effect. This modulation can have a certain pattern, or signature, that will allow for algorithms to be developed for ATR. The micro-Doppler effect will change over time depending on the motion of the target, causing a time and frequency varying signal.
Fourier transform analysis of this signal is not sufficient since the Fourier transform cannot account for the time varying component. The simplest method to obtain a function of frequency and time is to use the short-time Fourier transform (STFT). However, more robust methods such as the Gabor transform or the Wigner distribution function (WVD) can be used to provide a simultaneous representation of the frequency and time domain. In all these methods, however, there will be a trade off between frequency resolution and time resolution.
Once this spectral information is extracted, it can be compared to an existing database containing information about the targets that the system will identify and a decision can be made as to what the illuminated target is. This is done by modeling the received signal then using a statistical estimation method such as maximum likelihood (ML), majority voting (MV) or maximum a posteriori (MAP) to make a decision about which target in the library best fits the model built using the received signal.
Studies have been done that take audio features used in speech recognition to build automated target recognition systems that will identify targets based on these audio inspired coefficients. These coefficients include the
The baseband signal is processed to obtain these coefficients, then a statistical process is used to decide which target in the database is most similar to the coefficients obtained. The choice of which features and which decision scheme to use depends on the system and application.
The features used to classify a target are not limited to speech inspired coefficients. A wide range of features and detection algorithms can be used to accomplish ATR.
In order for detection of targets to be automated, a training database needs to be created. This is usually done using experimental data collected when the target is known, and is then stored for use by the ATR algorithm.
An example of a detection algorithm is shown in the flowchart. This method uses M blocks of data, extracts the desired features from each (i.e. LPC coefficients, MFCC) then models them using a Gaussian mixture model (GMM). After a model is obtained using the data collected, conditional probability is formed for each target contained in the training database. In this example, there are M blocks of data. This will result in a collection of M probabilities for each target in the database. These probabilities are used to determine what the target is using a maximum likelihood decision. This method has been shown to be able to distinguish between vehicle types (wheeled vs tracked vehicles for example), and even decide how many people are present up to three people with a high probability of success.
CNN-Based Target Recognition
Convolutional neural network (CNN)-based target recognition is able to outperform the conventional methods. It has been proved useful in recognizing targets (i.e. battle tanks) in infrared images of real scenes after training with synthetic images, since real images of those targets are scarce. Due to the limitation of the training set, how realistic the synthetic images are matters a lot when it comes to recognize the real scenes test set.
The overall CNN networks structure contains 7 convolution layers, 3 max pooling layers and a Softmax layer as output. Max pooling layers are located after the second, the forth and the fifth convolution layer. A Global average pooling is also applied before the output. All convolution layers use Leaky ReLU nonlinearity activation function.