Video Descriptors

this is my M.tech thesis on video interest point detectors and descriptors . It also has evaluation of image descriptors
View more...
   EMBED

Share

Preview only show first 6 pages with water mark for full document please download

Transcript

Performance Evaluation of Video Interest Point Detectors A Thesis submitted in partial fulfillment of the requirements for the award of the degree Master of Technology by Tammina Siva Naga Srinivasu Roll No: 124102011 under supervision of Dr. Prabin Kumar Bora and Dr. Tony Jacob Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati July 2014 DECLARATION I hereby declare that the work which is being presented in the thesis entitled, “Performance Evaluation of Video Interest Point Detectors”, in partial fulfillment of the requirements for the award of degree of Master of Technology at Indian Institute of Technology Guwahati, is an authentic record of my own work carried out under the supervision of Dr. Prabin Kumar Bora and Dr. Tony Jacob and duly listed other researchers works in the reference section. The contents of this thesis, in full or in parts, have not been submitted to any other Institute or University for the award of any degree or diploma. Tammina Siva Naga Srinivasu July 2014 Guwahati Dept. of Electronics & Electrical Engineering, IIT Guwahati, Guwahati, Assam, India - 781039. CERTIFICATE This is to certify that the work contained in the thesis entitled Performance Evaluation of Video Interest Point Detectors is a bonafide work of Tammina Siva Naga Srinivasu, Roll No. 124102011, which has been carried out in the department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati under my supervision and this work has not been submitted elsewhere for a Degree. Dr. Prabin Kumar Bora and Dr. Tony Jacob July 2014 Guwahati Dept. of Electronics & Electrical Engineering, IIT Guwahati, Guwahati, Assam, India - 781039. “Difficulties in your life do not come to destroy you, but to help you to realize your potential and power, let difficulties know that you too are difficult.” Dr. A. P. J. Abdul Kalam Abstract This thesis presents an evaluation and comparison of various spatial and spatio temporal interest points detectors. The popular spatial interest point detectors such as SUSAN detector, Harris corner detector, Harris-Laplace corner detector and Scale Invariant Feature Transform (SIFT) are studied. Their performance is evaluated under affine transformations and JPEG compression. It is found that SIFT performs better than other spatial interest point detectors. Extension of these spatial interest point detectors into spatio temporal domain for videos like n-SIFT, MoSIFT, ESURF,STIP and LIFT are studied. LIFT detects visually similar interest points called feature tracks. We propose extension for LIFT which call ELIFT to detect interest points on feature tacks, by this we obtain interest points with motion. Performance of the interest point detectors is evaluated under geometric, phototrophic transforms and MPEG compression. Repeatability and the time of execution are considered for evaluation. n-SIFT performed better when compared to other detectors but it has background points. MoSIFT and ELIFT detects lower number of interest points which are stable compared to the other algorithms. Acknowledgements I would like to thank my project supervisors Dr. Prabin Kumar Bora and Dr. Tony Jacob sincerely for giving me the opportunity to work on the project Performance Evaluation of Video Descriptors. I must say that they always came up with new innovative ideas to help me when I was stuck at some point. I am very thankful to them for spending their valuable time for clarifying my doubts. I would also like to thank the Head of the Department and the other faculty members for their kind help in carrying out this work.I am always indebted to Indian Institute of Technology Guwahati for giving me a wonderful opportunity to study my in a world class university. My heartfelt gratitude to my family for being on my side at all times. I will always remember my beloved friends who made my stay at IIT Guwahati, a pleasant and memorable period of my life. Tammina Siva Naga Srinivasu Indian Institute of Technology Guwahati, July 2014. v Contents Abstract iv Acknowledgements v Contents vi List of Figures viii List of Tables ix 1 Introduction 1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Image Interest Point Detectors 2.1 SUSAN Detector . . . . . . . . . . . 2.2 Harris Corner Detector . . . . . . . . 2.3 Harris Laplace Corner Detector . . . 2.4 Scale Invariant Feature Transform . 2.4.1 Basic Algorithmic Principle . 2.4.2 Key-Point Detection . . . . . 2.4.3 Key Point Localization . . . . 2.4.4 Key Point Orientation . . . . 2.4.5 Key Point Descriptor . . . . 2.4.6 Image Matching . . . . . . . 2.5 Evaluation . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . 1 1 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 8 8 8 9 11 12 13 14 15 18 3 Video Interest Point Detectors 3.1 n-SIFT . . . . . . . . . . . . . . . . . . . 3.1.1 Interest Point Detection . . . . 3.1.2 n-SIFT Descriptor Computation 3.2 MoSIFT . . . . . . . . . . . . . . . . . . 3.2.1 Interest Point Detection . . . . . 3.2.2 Descriptor Computation . . . . . 3.3 STIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 21 21 23 24 25 26 vi . . . . . . . . . . . . Contents 3.4 3.5 3.6 3.3.1 Interest Points in the Spatio-Temporal Domain . 3.3.2 Scale Selection in Space-Time . . . . . . . . . . . ESURF . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Hessian based Interest Point Detection and Scale 3.4.2 Implementation Details . . . . . . . . . . . . . . ELIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Feature Track Extraction . . . . . . . . . . . . . 3.5.2 Interest Points on Feature Track . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Evaluation Criteria . . . . . . . . . . . . . . . . . 3.6.2 Experimental Results . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 27 29 29 29 31 31 32 35 35 35 4 Conclusions and Future Scope 38 4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 Results of Harris corner detector on (a) Synthetic image (b) Texture image SIFT DoG Pyramid and Extrema Detection . . . . . . . . . . . . . . . . SIFT descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image Matching Using SIFT Features (a) shows matching with 0.3 threshold and (b) shows matching with 0.6 threshold . . . . . . . . . . . . . . . Three Image Classes from SIPI Data set . . . . . . . . . . . . . . . . . . Average Repeatability with Intensity . . . . . . . . . . . . . . . . . . . . . Average Repeatability with Rotation . . . . . . . . . . . . . . . . . . . . . Average Repeatability with Scale . . . . . . . . . . . . . . . . . . . . . . Average Repeatability with Compression . . . . . . . . . . . . . . . . . . . n-SIFT Image pyramid and DoG images [1] . . . . . . . . . . . . . . . . n-SIFT each of the 4x4x4 regions for Descriptor computation [1]. . . . n-SIFT Example: Frames with interest points marked . . . . . . . . . . MoSIFT Flow Chart [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . MoSIFT Example: Frames with interest points marked . . . . . . . . . STIP Example: Frames with interest points marked . . . . . . . . . . . The two types of box filter approximations for the 2 + 1D second order partial derivatives in one direction and in two directions [3] . . . . . . . ESURF Example: Frames with interest points marked . . . . . . . . . . Figure (a) Track with no motion and * Figure (b) Track with motion and deviation in path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure (a) shows the traced features(LIFT) and * Figure (b) shows the modified Interest points(ELIFT) . . . . . . . . . . . . . . . . . . . . . . Brightness change Vs Repeatability . . . . . . . . . . . . . . . . . . . . Scale change Vs Repeatability . . . . . . . . . . . . . . . . . . . . . . . Compression change Vs Repeatability . . . . . . . . . . . . . . . . . . . viii . . . . . . 7 11 13 14 15 16 17 18 19 21 22 23 24 26 28 . 30 . 31 . 33 . . . . 34 36 36 37 List of Tables 3.1 Execution Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 37 ix Chapter 1 Introduction Digital Videos are important source of information and entertainment. Analyzing video and extracting information from it is important for applications like event detection in a surveillance video, piracy video detection and video classification to manage large data bases . These tasks can be achieved by using local features or interest points. Many robust techniques are proposed for interest point detection in images can be used in frame by frame basis treating video as a group of frames. The main difficulty is that the content in video is in megabits or gigabytes with temporally covariant information which is redundant. Hence we need to find true spatio temporal interest points. In capturing a scene video camera is free to rotate, change zooming factor or can move. During Video capturing illumination can vary and camera and environment can add noise to video captured. Videos are compressed for Storing and transmission. The challenging task is to obtain a stable, robust detector which addresses the above problems. This task can made a bit easier by fixing camera during capturing. Many methods have been proposed for detection of interest points in video for specific applications. There is need for comparative study of the interest point detector’s stability under geometric and photometric variations. Their performance on other applications need to be evaluated. 1.1 Literature Survey Local features are image patterns which are different from the immediate neighborhood [4]. They are marked by changes in image properties such as intensity, colour, and texture. Examples of local features are points, edge points and blobs. A detector is used to extract features from image. Two examples are the Harris corner detector [5] which 1 Chapter 1. Introduction 2 detects corners and the SIFT [6] which is a blob detector. It is good if local features represent meaningful object parts. It requires a high level knowledge of image content which is not available in early stages. Therefore detectors use underlying image intensity variations. A feature detector should have good repeatability, informativeness and accuracy. It should be invariant or covariant for common image deformations such as rotations, translations, intensity variations including brightness and contrast and so on. Invariance means that detected features should not change with image transformations. This property of detector is very important for image processing applications such as image registration, object recognition where images can be taken at different scales, view point, or in the presence of occlusion. For example, the SIFT [6] descriptor for an interest point will not change with rotations or brightness and contrast variations. Detectors can also be covariant under certain transforms which means that it can vary location but it holds it structure. For example, an edge at a particular orientation for an image changes its orientation direction with rotation of the image. So most of the interest point detectors are mainly covariant the descriptors derived from them are invariant due to normalization. Corners are interest points whose locations are invariant under rotations and translations of the image. Harris and Stevens [5] detected corner points from an image using a Gaussian smoothed intensity gradient popularly known as Harris corner detector. Smith and Brady [7] had considered points which are extreme in their circular neighborhood as SUSAN corners . Both of these corner detectors are rotation and translation invariant but vary with scale changes. Harris corner detector was extended by Mikolajczyk and Schmid [8] to find interest points and their scales, by using the scale space theory [9] proposed by Lindbergh. In 1999, David Lowe presented a new algorithm called Scale Invariant Feature Transform or SIFT [6] which was based on scale the space theory and used the scale space representation of an image to find interest points and describe them using a robust descriptor based on the local gradient structure around the interest point. The algorithm was refined by 2004 to account for sub-pixel accuracy in key point location and also described ways to match the descriptors between images for object matching, registration and recognition under affine transformations. This thesis deals with implementation and evaluation of Harris corner detector, Harris-Laplace corner Detector, SUSAN corner detector and SIFT performances under intensity and geometric transforms. Similar to image interest point detector, video interest point detectors detects spatio temporal interest points present in a video. The existing image detector can be extended to spatio temporal domain by changing image saliency measure. On Space Time Chapter 1. Introduction 3 Interest Points (STIP) [10] proposed by Laptev is an extension of the Harris Laplace corner detector. He extended 2D second moment matrix to a 3D second moment matrix using gradients in spatio temporal domain and he determined spatial and temporal scales of the detected points by maximizing the output of a normalized spatio temporal Laplacian operator. In a similar fashion, SIFT which very robust technique in image processing was developed into an arbitrary n-dimensional extension by Cheung and Hamarneh as n-SIFT [1] towards applications in medical images including 3D MRI and dynamic 3D+time CT data. These techniques treat time as a third extension of space while detecting interest points. Other extension to SIFT was proposed by Chen and Hauptmann [2] which uses SIFT to find spatial interest points which are then filtered in the temporal domain using their optical flow. The motion in temporal domain is captured by optical flow and interest points with sufficient optical flow are considered as spatio temporal interest points. Hence this technique is called Motion SIFT (MoSIFT). SURF [11] is extend to into spatio temporal domain by Tuytelaars [3] by using 3DHessian as saliency parameter and improved the computation performance by approximation with box filter on integral video structure. Vasileios and Anastasios [12] proposed a method LIFT which tracks SIFT features present in each frame to from a feature tracks which are similarly moving, visually similar local regions. We extend the LIFT to find interest points on feature tracks by considering points with significant motion and rejecting most of background points which are redundant. 1.2 Scope of Project The thesis work mainly deals with the study of image and video interest point detectors and their evaluation. The image interest point detectors Harris corner detector, SUSAN detector, Harris-Laplacian corner detector and SIFT are studied in detail and their performance is evaluated. Video interest point detectors n-SIFT, MoSIFT, ESURF, STIP and LIFT are studied. We proposed a method to extract interest points on the tracks of LIFT. All the above video detectors are evaluated with respect to repeatability and time of execution. 1.3 Thesis Organization The Thesis is organized as follows: • Chapter 2 gives detailed study of image interest point detectors and their performance. Chapter 1. Introduction 4 • Chapter 3 presents a overview of video interest point detectors and their performance is evaluated. • Chapter 4 concludes the thesis with a summary of work done and suggestions for future work. Chapter 2 Image Interest Point Detectors Image Interest points which are invariant to rotation and translation as a part of affine class of transformations along with scaling, brightness and contrast variations are used in problems like image registration, object recognition etc. The literature on interest point detection in images is vast and different type of detectors are present. This chapter gives study and evaluation of corner and blob detectors. SUSAN [7], Harris corner detector [5] and Harris Laplace Corner Detector [8] are corner detectors and SIFT [6] is blob detector. SIFT is used for image matching and results are discussed. 2.1 SUSAN Detector Corners are important local features in images. Generally a corner can be defined as the intersection of two edges. They are high curvature points and lie in junctions of different brightness. SUSAN (Smallest Univalue Segment Assimilating Nucleus) detector [7] is proposed by SMITH for detecting corners using. The SUSAN detector considers a circular mask with the center of the mask as the nucleus. Intensity of every pixel is compared with the nucleus value. Pixels having similar intensity are considered as USAN (Univalve Segment Assimilating Nucleus) and if the area of USAN is less than half of the total mask area then the nucleus is considered as a corner point. SUSAN detector’s computational complexity is low but it is sensitive to image deformations. 5 Chapter 2. Image Interest Point Detectors 2.2 6 Harris Corner Detector The Harris corner detector [5] is based on the fact that the eigen values of second moment matrix represent the signal changes in orthogonal directions. The corner has large variation in intensity in both directions so, they have large Eigen values. For an image I(x,y), consider a window which is centered at (u, v) and the change E produced by a shift (x,y) is given by Equation 2.1 E(x, y) ≈ X w(u, v)[I(x + u, y + v) − I(x, y)]2 (2.1) u,v To maximize E(x,y) the term [I(x + u, y + v) − I(x, y)]2 is expanded in Taylor series and by neglecting the higher order terms we get E(x, y) = X u2 Ix2 + 2uvIx Iy + v 2 Iy2 u,v Matrix equation form of this equation is E(x, y) = Where M= h " x y i Ix2 Ix Iy Ix Iy Iy2 M " u v # # The derivatives are averaged by a Gaussian kernel of standard deviation σD and the image is smoothed with a Gaussian mask of standard deviation σ1 to remove noise. Matrix M is changed as given below M = g(σ1 )σD 2 " Ix2 (x, σD ) Ix (x, σD )Iy (y, σD ) Ix (x, σD )Iy (y, σD ) Iy2 (y, σD ) # Eigen values of the matrix M are required to determine the existence of corner. Instead of actually calculating eigen values, a new quantity cornerness is defined as cornerness = det(M) − k ∗ trace(M)2 (2.2) Where k =0.04. Cornerness is difference of the product of eigen values and the sum of eigen values. When there is change in both x,y direction the eigen values will be large and positive. So cornerness should be positive and large for a corner point. Chapter 2. Image Interest Point Detectors 7 Harris corner detector on texture images (a) Harris corner detector on texture images (b) Figure 2.1: Results of Harris corner detector on (a) Synthetic image (b) Texture image Example 2.1. Consider a synthetically generated checkerboard image shown in Figure 2.1(a) we can see that almost all corners have been detected in that image. The response of the Harris corner detector on a texture image gives the output given in Figure 2.1(b). We can notice that it performs better on this image except that it can’t detect some of the corner with less intensity variation. If we reduce the threshold to detect low contrast corners there are false detections. The main draw back of Harris corner detector is that it is variant to scale changes. It is invariant to intensity changes since it considers difference of intensity values not direct value. Chapter 2. Image Interest Point Detectors 2.3 8 Harris Laplace Corner Detector Corners can be detected by the Harris corner detector [8], but they are not scale invariant. Harris Laplace Detector provides a scale invariant features. It was developed based on Harris corner detector principle. This gives a scale adapted local features i.e. interest point along with the scale of occurrence. Harris Laplace Corner Detector is developed by Constructing scale space L(x, y; σ) which is a set of Gaussian smoothed images. The Laplacian detector detects interest points whose location and characteristic scale is given by the maximum of laplacian operator. The Laplacian of Gaussian scale space is constructed for finding scale of interest point. At every scale of image find corner points by second moment matrix and Harris corner principle (2.2). A corner is said to be a interest point if it attains an extrema in 8 neighbors in current scale and 18 neighbors present in scale above and below the current scale. 2.4 Scale Invariant Feature Transform The scale Invariant Feature Transform [SIFT] was developed by David Lowe [6]. The SIFT interest points are scale invariant and are robust under affine transformations. SIFT is based on concept of Scale-Space theory developed by Lindberg’s [9] . 2.4.1 Basic Algorithmic Principle From pyramid of images (images with increasing blur), SIFT detects a series of key points which are mostly blob like structures and accurately locates their locations and scales. Then it calculates dominant orientation θref over a region surrounding the key point. The knowledge of location, scale and orientation is used to find a descriptor for the key point.Descritpor is computed from the normalized patch around key point. The descriptor encodes the spatial gradient distribution around the key point by a 128 dimensional vector. This descriptor makes SIFT more robust to rotation, scale and translation variations. The main steps involved in SIFT as explained by Lowe are: • Key-Point Detection: A pyramid of Gaussian blurred images of different scales is formed and the extremas are located in the Difference of Gaussian (DoG) images. • Key-Point Localization: Each key point is fine-tuned by the Taylor series model to get subpixel accuracy and they are selected based on their stability. Chapter 2. Image Interest Point Detectors 9 • Key-Point Orientation Computation: One or more orientations are found for a key point based on local gradients. This orientation is used in future which makes SIFT rotation invariant. • Key-Point Descriptor Computation: From orientation normalized patch around a key point, descriptor is formed by local image gradients from selected scale and orientation. 2.4.2 Key-Point Detection Gaussian Scale Space Lindberg in his detailed mathematical analysis [9] of the multi-scale representation of images, showed that the Gaussian kernel was the unique scale-space kernel. Gaussian scale space is simply a stack of increasing blurred images. This blurring simulates the loss produced when photograph is taken with different zoom out factor. For an image I(x,y), the Gaussian blurring image L(x, y; σ) is defined as L(x, y; σ) = g(x, y; σ) ∗ I (2.3) Where g(x,y;σ) is the Gaussian kernel which is parameterized by standard deviation σ is given by g(x, y; σ) = 2 2 2 1 e−(x +y )/2σ . 2πσ2 the gaussian smoothing operator satisfies semi closed relation given by Equation 2.4. We can construct Gaussian scale space by using the Gaussian kernel and its semi closed relation. g(; σ1 + σ2) = g(; σ1) ∗ g(; σ2) (2.4) Digital Gaussian Smoothing the two dimensional Gaussian kernel is obtained by sampling and truncating the Gaussian function given by g(x, y; σ) = Ke−(x 2 +y 2 )/2σ 2 ∀ |x| , |y| ≤ [4σ] (2.5) Where [.] is the floor function which truncates a fraction to an integer and K is used for normalization. The scale space of images is constructed by 2-D convolution of image with digital Gaussian kernel. In his scale-space theory, Lindbergh [9] proposed scale normalized Laplacian for detecting scale. Lowe proposed Difference of Gaussian (DoG) function as an approximation for Chapter 2. Image Interest Point Detectors 10 scale normalized Laplacian and obtained stable key points of the image with scaling. The approximation of Laplacian can be derived from solution of diffusion Equation 2.6 ∂g = σ∇2 g ∂σ (2.6) g(x, y; Kσ) − g(x, y; σ) ∂g ≈ ∂σ Kσ − σ From diffusion Equation 2.6, we can approximate above equation as g(x, y; Kσ) − g(x, y; σ) ≈ (K − 1)xσ 2 ∇2 g The Difference of Gaussian DoG, D(x,y;σ) as given in Equation 2.7 is the difference of Gaussian smoothed images of standard deviation separated by a constant factor k. D(x, y; σ) = L(x, y; Kσ) − L(x, y; σ) (2.7) 1 The scale space is constructed by Lowe by step value k = 2 s where s is the number scales in an octave within which σ of the Gaussian kernel doubles. The scheme of constructing scale space is shown in Figure 2.2. Key points are extremas in DoG image which are extracted as follows: 1. Linearly interpolate the input image to twice its size. 2. Pre-smooth the image with a Gaussian filter of σ = 1.6 before giving it was given input to the first octave level. The input blur is assumed as 0.5. This must be considered getting initial blur 1.6. Use semi closed property of Gaussian scale space and calculate the extra blur to be added. 3. Blur the image with a Gaussian filter of variance σ, kσ, k2 σ and so on to create Gaussian smoothed images in the octave, where Lowe recommends s=3. The blur s o× 3 of in sth level in o octave is given by σ = 1.6 × 2 4. Subtract the Gaussian convolved images as shown in Figure 2.2 to get s+2 DoG images. 5. Detect all extrema in DoG image as shown in Figure 2.2. The extrema can be either the maxima or minima with respect to its 8-neighbours in the same scale and 9 neighbors in the scale above and below the present scale. 6. Once the octave is processed, sub-sample the image to half its size in both dimensions by taking every second pixel in row and column.The sub-sampled image is then processed in the next octave level with steps 3 to 6 and the process continues all octaves in the pyramid. Chapter 2. Image Interest Point Detectors 11 Number of octaves o = log2 (min(row, col)), where row and col represent the width and length of the image. Figure 2.2: SIFT DoG Pyramid and Extrema Detection [6] 2.4.3 Key Point Localization SIFT aims for a descriptor invariant to affine transformations which is necessary for image registration, object detection. Sub-pixel accuracy of the key-point is a desirable feature for addressing such problems. Till now location of the key point is confined to sampling grid Such coarse location is not suitable to reach full scale and translation invariance. SIFT refines the position of key point by using Taylor series expansion of the DoG, D(x,y;σ) given by D(X) = D + ∂2D ∂D T 1 X + XT X ∂X 2 ∂X 2 (2.8) where D and its derivatives are evaluated at the key-point location X = (x, y, σ)T . Then the local extremum can be obtained by setting the derivative of D(x) with respect to x ˜ ,given by to zero which yields displacement vector from the key-point position, X 2 ˜ = −∂ D X ∂X 2 −1 ∂D ∂X (2.9) ˜ < 0.5 in any of the dimension then X value is updated, else the key point is skipped If X ˜ in D(X) (2.8) and processes is continued with next point. Substituting the value of X the function value at extrema is calculated. Lowe used this value to eliminate weak unstable points. If function values is greater than threshold of 0.3, where the image intensity is in [0 1], key point is considered as interest point. Chapter 2. Image Interest Point Detectors 12 Eliminating Points on Edges The DoG extrema were found to contain strong response along edges which are not good interest points with respect to rotations or affine transformations. To detect edges, Lowe used the Hessian matrix (2.10). Since eigenvalues of Hessian matrix correspond to the principal curvatures, their ratio will be deviating from unity by a large factor if the point is an edge. If the eigenvalues are having opposite signs, then the point is a saddle point and hence the point is not an extremum. H= " Dxx Dxy Dxy Dyy # (2.10) Lowe considered the ratio of largest eigenvalue to smaller one α/β = r. For an edge key point it has a large value r > 10, T r(H)2 Det(H) = (α+β)2 (αβ) = (rβ+β)2 (rβ 2 ) = (1+r)2 r By considering cedge = 10 as threshold we decide whether it is an edge point or not. If T r(H)2 Det(H) 2.4.4 > (cedge +1)2 (cedge ) key point is discarded. Key Point Orientation Rotation invariant descriptors can classified into two categories. One them is based on image properties which are already rotation invariant. The other is that descriptors which are normalized with respect to reference orientation. SIFT belongs to second category. The Key-point orientation corresponds to the dominant gradient orientation in the neighborhood of the key point.Orientation can be found by the following procedure. 1. At each (x, y; σ) , compute the gradient magnitude and orientation using pixel difference as given below m(x, y) = q (L(x + 1, y; σ) − L(x − 1, y; σ))2 + (L(x, y + 1; σ) − L(x, y − 1; σ))2 θ(x, y) = tan−1 ( L(x,y+1;σ)−L(x,y−1;σ) L(x+1,y;σ)−L(x−1,y;σ) ) (2.11) Weigh the gradient magnitudes by a Gaussian window with σ = 1.6 times the scale of the key point. This will give more importance to the gradients closer to the key point. 2. Compute an orientation histogram of 36 bins covering the 360 degrees of gradient orientations with each sample contributing its gradient magnitude to the bin corresponding to its gradient orientation bin. Chapter 2. Image Interest Point Detectors 13 3. Fit a parabola to the 3 bins with the highest values to interpolate the peak position and thus determine the key point orientation.If there are any bins with 80% or more magnitude of the highest value bin, then they should also be used to create new key points at the same (x, y, σ), but with different orientations corresponding to these bins. Thus the key point(x, y, σ, θ) is localized and now the descriptor for this key point has to be computed from the gradients. 2.4.5 Key Point Descriptor The local patch of 16x16 is taken in L(x, y, σ) around the key point. Split the region into sixteen 4x4 regions and in each region, an 8 bin histogram is computed by the same way as the orientation histogram with each sample contributing its gradient magnitude to the corresponding orientation bin as shown in Figure 2.3. The sixteen 8-bin histograms are then concatenated to get the 128-bin key point descriptor.In addition to these steps, Lowe introduced 2 additional steps to minimize the effects of lighting on the descriptor. Figure 2.3: SIFT descriptor [6] 1. Normalize the descriptor to unit length. This reduces effect of contrast changes which adds a constant value to the image gradients in the neighborhood. 2. Threshold any bin value above 0.2 to 0.2 and normalize the descriptor again. This reduces the effect of directional illumination changes which can cause large gradient Chapter 2. Image Interest Point Detectors 14 magnitude changes in particular direction. By limiting the bin values to 0.2, effect of large gradient values on the descriptor is reduced. Thus each SIFT key point (x, y, σ, θ) is described by 128-bin histogram descriptor. 2.4.6 Image Matching SIFT has many applications like object recognition, image registration, and image matching of them we considered image matching. It has been done by following steps 1. Compute SIFT descriptors for both images 2. Pair the similar key points of two images with Euclidian distance between the descriptors of two key points is below a threshold. Lowe suggested 0.6 as threshold. 3. If ratio of distance of second best match to first best match is more than 80 then key point is discarded. sift features matching under scale variation with 0.3 as threshold (a) sift features matching under scale variation with 0.6 as threshold (b) Figure 2.4: Image Matching Using SIFT Features (a) shows matching with 0.3 threshold and (b) shows matching with 0.6 threshold Example 2.2. Two images with different scales are taken and applied for matching algorithm. SIFT detected 1022 and 939 key points in two images. Out of which for 0.6 as threshold we got 405 points as matched key points between the two images it is shown in Figure 2.4(b). If threshold is 0.3, it showed 122 points only. It shown in Figure 2.4(a). Depending on the application and nature of images we can vary threshold to get better performance in matching. Chapter 2. Image Interest Point Detectors 2.5 15 Evaluation This section deals with the performance evaluation of image interest point detectors. We considered repeatability as performance factor, which is used to test the detector’s performance in image deformations like compression, intensity variation, scale changes and rotation. The SUSAN detector, the Harris corner detector, the Harris-Laplace corner detector and the SIFT are evaluated. Since only SIFT has descriptor, only interest point detection part of the algorithms are compared. The images used for the evaluation were taken from the SIPI [13] data set. It contains three classes of images , namely textures, aerials and miscellaneous images. One image from each class is shown in Figure 2.5. This offers a wider range of data to test the algorithms. Figure 2.5: Three Image Classes from SIPI Data set Repeatability Criterion Repeatability Criterion test whether the detector is invariant to changes of imaging conditions and data storage. It is defined by Hendaoui and Abdellaoui in [14] as Equation 2.12 rid (I i , Iid ) = C(I i , Iid ) mean(mi , mdi ) (2.12) Where Ii , Iid are original image and deformed image respectively. C(Ii , Iid ) is number of matched interest points and mi , mdi are the number of interest points in Ii , Iid respectively. For performance evaluation , six images from each class of the data set were considered and the repeatability was averaged over the six images from each class according Equation 2.13. d average repeatability, ravg = 6 P i=1 rid 6 (2.13) Chapter 2. Image Interest Point Detectors 16 Intensity Variations The overall brightness of the image is varied from -40 to 40 by using relation Inew = Iimage ± d and image intensity is limited to [0 250]. The interest point positions ideally should be unchanged from the original image under intensity variations. But as shown from the plot, repeatability of the detectors falls with changes in intensity. Gradients being the difference of pixel intensities should ideally be invariant under additive intensity variations. However pixel intensities close to white (255) or dark (0) may be lost in the process and hence the neighborhood gradients get affected. With increasing additive values, more pixels become white or dark and so detectors performance comes down. All detectors response is affected by brightness variations in all three image classes. SUSAN gives a better response since it considers a 7 circular neighborhood to detect a point and majority the pixels turning into black or white due to intensity variation is less often. SIFT and SUSAN response is better than Harris and Harris Laplace corner detector response. Results of intensity variation are shown in Figure 2.6(a) for aerial images ,Figure 2.6(c) for texture images and Figure 2.6(b) for miscellaneous images. brightness variation of MISC images 90 90 80 80 70 70 Average Repeatability 100 60 50 40 30 SIFT Hrs−lp Harris susan 20 10 0 −40 −30 −20 60 50 40 30 SIFT Hars−lpt Haris SUSAN 20 10 −10 0 10 brightness variation 20 30 0 −40 40 −30 −20 −10 0 10 brightness variation (a) (b) brightness variation of TEXTURE images 100 90 80 Average Repeatability Average Repeatability brightness variation of AERIAL images 100 70 60 50 40 30 SIFT hrs−lp Harris susan 20 10 −40 −30 −20 −10 0 10 brightness variation 20 30 40 (c) Figure 2.6: Average Repeatability with Intensity 20 30 40 Chapter 2. Image Interest Point Detectors 17 Image Rotations The images and their rotated versions have been taken from SIPI [13] Image database. There are texture images with rotations spanning from 0 to 200 degrees. SIFT has shown better repeatability with descriptor since we normalize the orientation of interest point before calculating the descriptor which gives rotation invariance to it. In the absence of descriptor SIFT has shown less repeatability and it is almost constant for a any angle. Harris and Harris Laplacian detectors show almost no repeatability since they consider 1D gradients for detection which are sensitive to rotation. SUSAN detector has shown a better repeatability than Harris since it considers a circular neighborhood for detection, rotation of image mostly doesn’t change circular neighborhood. Results are shown in Figure 2.7 rotation variation 100 SIFT SUSAN 90 Average Repeatability 80 70 60 50 40 30 20 10 0 −80 −60 −40 −20 0 20 rotation variation 40 60 80 Figure 2.7: Average Repeatability with Rotation Scale Variations The images and their scaled versions from 50% to 200%, are used to evaluate detectors response to scaling. It may be noted that SIFT and Harris Laplacian detectors account for scale invariance while SUSAN and Harris corners are not scale invariant. All detectors show poor repeatability with scale except SIFT. Harris-Laplace has a better performance than Harris but lower than SIFT. Results of scale variation are shown in Figure 2.8(a) for aerial images , Figure 2.8(b) for miscellaneous images and Figure 2.8(c) for texture images. Compression Variations Images were compressed in JPEG format in a scale of [1 99] where scale represents inverse of compression ratio. As we performed lossy compression it changes values of pixels which in tern changes the value of gradients. Hence every detector is expected to show decrease in repeatability as compression increases. Detectors performed better in compression of aerial and texture images which have redundant information. SIFT Chapter 2. Image Interest Point Detectors 18 scale variation of AERIAL images scale variation of misc images 100 100 SIFT Harrs−lapt Harris SUSAN 90 80 70 Average Repeatability Average Repeatability 80 60 50 40 30 70 60 50 40 30 20 20 10 10 0 0 0.5 1 1.5 2 2.5 scale variation 3 3.5 SIFT Haris−lapt Harris SUSAN 90 0 4 0 0.5 1 1.5 (a) 2 2.5 scale variation 3 3.5 4 (b) scale variation of texture images 100 SIFT Haris−lapt Harris SUSAN 90 Average Repeatability 80 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 scale variation 3 3.5 4 (c) Figure 2.8: Average Repeatability with Scale and Harris Laplace detectors performed better than others as they consider scale space of images so they detect interest points in blurred or approximated scales also.Results of compression variation are shown in Figure 2.9(a) for aerial images, Figure 2.9(b) for miscellaneous images and Figure 2.9(c) for texture images. 2.6 Conclusion The investigations in this chapter shows that SIFT exhibits better repeatability than other detectors in most image deformations. In scale and rotation changes, SIFT also has poor repeatability. The descriptor is responsible for rotation invariance of SIFT, with it SIFT became more popular. The number of interest points detected is more for SIFT. The SUSAN detector has better response to rotation. Harris Laplace detector has better response than Harris corner detector in scale variations. Chapter 2. Image Interest Point Detectors 19 compression of aerial image compression of misc image 100 100 SIFT Hars−lap Harris SUSAN 90 80 90 80 average repeatability 70 60 50 40 60 50 40 30 30 20 20 10 10 0 20 40 60 compression scale 80 0 100 0 20 40 60 compression scale (a) (b) compression of texture image 100 SIFT Hars−lap Harris SUSAN 90 80 70 average repeatability average repeatability 70 0 SIFT Haris−lap Harris SUSAN 60 50 40 30 20 10 0 0 20 40 60 compression scale 80 100 (c) Figure 2.9: Average Repeatability with Compression 80 100 Chapter 3 Video Interest Point Detectors A number of video interest point detectors have been proposed for detection of spatio temporal interest points. some use robust images detectors frame by frame basis for applications like duplicate video detection [15] and video stitching [16]. These are not true video descriptors as they do not consider time dimension which is very critical for videos. Image detectors have been extended for videos in many different forms. SIFT [6] was extended to n-SIFT [1] by considering time gradients for saliency measure. MoSIFT [2] considers optical flow in time domain and SIFT for spatial domain. STIP [10] is extension of Harris Laplacian detector [8] which use 3D gradients. ESURF [3] is direct extension of SURF to videos which use integral video concept and box filters. The LIFT [12] is trajectory based techniques which utilizes the matching image descriptors between frames. Local Invariant Feature Tracks (LIFT) form another class of video which gives similarly moving and visibility moving features. we propose an extension of LIFT to ELIFT by picking interest points on feature tracks. This chapter deals with study of all these video descriptors and gives a comparative study of their performances under deformations like scaling,change in brightness and compression. Video alignment was done by using these descriptors and results were discussed. 3.1 n-SIFT n-SIFT [1] was proposed by Cheung and Ghassan Hamarneh for the detection and matching of interest points in medical images which includes 3D MRI images as well as 3D+time CT images. It was direct extension of SIFT from 2D to arbitrary nD images. 20 Chapter 3. Video Interest Point Detectors 21 Figure 3.1: n-SIFT Image pyramid and DoG images [1] 3.1.1 Interest Point Detection The nD image is progressively convolved with nD Gaussian and sub-sampled for next octave. Difference of Gaussian (DoG), is formed by difference between successive levels in an octave. Each points is an extrema when compared to neighbors in n-dimensions it is a interest point. The interest point detection algorithm as described by Cheung [1] is given in Algorithm 3.1. The pyramid construction is similar to SIFT and is shown in Figure 3.1. Steps shown in Algorithm 3.1. is described as follows: We convolve nD input images with nD gaussian of variance σ, kσ, k2 σ and so on to create s+3 Gaussian smoothed images in an octave. Their difference give s+2 DoG images. Interest points are the extrema in DoG images. A voxel pj (x1 , x2 , ...xn ) is defined as local extrema iff it satisfies the conditions |pj (x1 , x2 , ...xn )| ≥ |pj+i0 (x1+i1 , x2+i2 , ...xn+in )| or |pj (x1 , x2 , ...xn )| ≤ |pj+i0 (x1+i1 , x2+i2 , ...xn+in )| (3.1) ∀i0 = −1, 0, 1, ∀i1 = −1, 0, 1∀i2 = −1, 0, 1.....and∀in = −1, 0, 1 Subsample the image by half in all dimensions and is input to next octave. For a video number of octaves is o = log2 (min(m, n, t)) where m,n are frame dimensions and t is number of frames. However, it has been noted that no meaningful interest points are detected by n-SIFT after 3-4 octaves. So we take the number of octaves as min(4, o). Initial blur is taken as 1.6 and s=3 as suggested by lowe [2]. 3.1.2 n-SIFT Descriptor Computation The n-SIFT uses the nD gradients to create 25n−3 dimensional descriptor for every feature point which is similar to the descriptor of SIFT. The steps followed to calculate Chapter 3. Video Interest Point Detectors 22 Algorithm 1 n-SIFT Interest Point Detection [1] 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: I0 = I F =φ for i=0 to maxImageLevel do  Ii,j =GaussianFilter Ii , kj σ Gaussian Filter Image Ii with sigma =kj σ dIi,j = Ii,j − Ii,j−1 end for for j=1 to S+2 do PI=Find Extrema(dIi,j−1 , dIi,j , dIi,j+1 ) for p∈PI do F=F∪GenegateFeatures(p,Ii ) end for Ii+1 =scale(GaussianFilter((Ii , σ), 0.5i )) end for return F Figure 3.2: n-SIFT each of the 4x4x4 regions for Descriptor computation [1]. the descriptor are as follows: 1. Select a 16n hypercube around the interest point as shown in Figure 3.2. Then calculate nD gradients and express them in terms of the magnitude and the orientation direction. 2. The magnitudes of gradients are weighted by a Gaussian centered at the interest point location, it gives more weight to pixels near to the interest point. 3. Compute the (n-1) orientation bin histogram with each voxel gradient magnitude added to the bin corresponding to its orientation. 4. Find the bin with highest value. This gives dominant orientation and it is subtracted from neighboring voxel orientation to make it rotation invariant. 5. Divide the hypercube into 4n sub parts as shown in Figure 3.2, each part is described by 8 bins for each of the (n-1) directions. Thus each sub region is represented by 8n−1 bins and in total, 4n 8n−1 = 25n−3 dimensional feature vector is created. Chapter 3. Video Interest Point Detectors 23 Figure 3.3: n-SIFT Example: Frames with interest points marked 6. Normalize the feature vector as in the case of SIFT to account for scaling from Brightness variations. Example 3.1. We applied n-SIFT algorithm on a sample of walking sequence. As shown in Figure 3.3 the algorithm detects points around head, hands and legs where there is motion. As seen from Figure 3.3, the algorithm detects interest points with no motion also intensity variations in between the frames is the reason for such redundant points. The most noticeable trait for n-SIFT is that the feature detection detects a large number of spatial interest points without motion. Another problem detected with n-SIFT is its large memory usage. Since it treats a video as 3D image and builds octaves from it, it requires a large amount of memory. 3.2 MoSIFT The MoSIFT also known as Motion-SIFT [2] was proposed by Chen and Hauptmann to detect interest points which are used to describe motion and action recognition in videos. Chapter 3. Video Interest Point Detectors 24 The algorithm unlike the previous algorithms use optical flow for time motion instead of temporal gradient. A detail view of this method is given in subsequent sections. 3.2.1 Interest Point Detection MoSIFT uses SIFT for the detection of interest points in spatial domain on frame by frame basis and optical flow is used for temporal domain. the flow chart of this algorithm is given in Figure 3.4. We calculate interest points in spatial domain by using the SIFT by considering each frame at a time. We construct Gaussian pyramid of octaves and then taking their difference to get DoG. The interest points are the Extremas in DoG as explained in Chapter 2. These interest points must be refined to get actual MoSIFT interest points. The optical flow is considered as saliency measure for the time dimension. Figure 3.4: MoSIFT Flow Chart [2]. The optical flow [17] is used to find motion interest points in the video. Optical flow is based on brightness and smoothness. Let the image brightness at point (x,y) in an image plane at time t is denoted by I(x,y,t). The brightness constraint assumes that the brightness of particular point in the pattern is constant, which gives following: ⇒ dI dt = 0 ∂I ∂x ∂I ∂y ∂I ∂x ∂t + ∂y ∂t + ∂t =0 Ix ux + Iy vy + It = 0 (3.2) Chapter 3. Video Interest Point Detectors 25 where vx ,vy are the optical flow velocities in x and y directions in constitutive frames. There are many techniques for solving this equation. Here, the least-squares method proposed by Lucas-Kanade [17] is used. Figure 3.4 shows the block diagram to compute optical flow between the successive frames. Consider a pair of pyramids from the a successive frames to compute the optical flow at the same image pyramid level at which the SIFT interest point is detected. The magnitude of optical flow value at the SIFT interest point is above the threshold it is considered as spatio-temporal interest point in video. We can match SIFT interest points between frames and apply optical flow threshold on the matched points. This reduces number of interest points and also their stability across the frames. 3.2.2 Descriptor Computation MoSIFT uses the DoG and optical flow. So it is easier to acquire both spatial gradients and optical flow. MoSIFT uses the histogram of gradient (HoG) and also the histogram of optical flow (HoF) for describing an interest point. MoSIFT descriptor is obtained by concatenating HoG and HoF to get a 256 bin vector. SIFT descriptors is computed as described in Chapter2. For describing motion around the interest point, the descriptor is computed from the local optical flow in the same way SIFT descriptor is computed from the image gradients. The dominant orientation normalization is not present here since the actual direction of motion is important. The gradient optical flow has also both direction and magnitude of motion. Take a local patch of 16x16 around the interest point and divide it into 4x4 patches. Each of 4x4 patch is described with 8 bin orientation histogram. Each optical flow value contributes to the bin of its orientation direction with weight given by its magnitude. The sixteen 8-bin histograms are concatenated into a 128-bin optical flow histogram. Example 3.2. We applied MoSIFT algorithm on a sample of walking sequence . As shown in Figure 3.5 the algorithm detects points around head, hands and legs where there is motion between frames, along with some points near them were detected due to Gaussian weighing in the neighborhood . However the main noticeable trait of this technique is that the uniform motion points were captured in multiple frames which is redundant and the quantity of interest points detected is less . Chapter 3. Video Interest Point Detectors 26 Frame 1 Frame 2 Frame 3 Frame 4 Figure 3.5: MoSIFT Example: Frames with interest points marked 3.3 3.3.1 STIP Interest Points in the Spatio-Temporal Domain The Space-Time Interest Points (STIP) was developed by Laptev [10] using 3D gradient structure along with the scale space representation to find interest points in video. The video f (x, y, t) is convolved with 3D Gaussian g(.; σl2 , τl2 ) to give the scale space representation: L(x, y; σl2 , τl2 ) = g(.; σl2 , τl2 ) ∗ f where g(.; σl2 , τl2 ) = q 1 (2π)3 σl 4 τl 2 e 2 −(x2 +y 2 ) − t 2 2σl 2 2τl Similar to the Harris corner detector we consider the second moment matrix given by Equation 3.3. First spatial and temporal gradients are smoothed by Gaussian function. Chapter 3. Video Interest Point Detectors 27 The second moment matrix is given by  L2x  µ = g(; σi 2 , τi 2 )   Lx Ly Lx Lt Lx Ly Lx Lt L2y Ly Lt   Ly Lt   2 Lt (3.3) Where Lx ,Ly ,Lt are the first order derivatives of L(x, y; σl 2 , τl 2 ) and σi , τi are the integration scales related to σl , τl by σi = sσl , τi = sτl where s=2 is the preferred value. Interest points in video are the one which have large eigen values of the matrix µ. Instead of finding eigen values to find the interest point. we can apply indirect method followed by Harris to detect spatial interest points. The Harris corner function for videos is defined as Equation 3.4 H = det (µ) − k ∗ trace3 (µ) = λ1 λ2 λ3 − k(λ1 + λ2 + λ3 )3 (3.4) With k=0.005 and if H is greater than a positive threshold, we say it is a spatio temporal interest point. The spatial and temporal scales of the interest detected are important for capturing events with different spatial and temporal scales. 3.3.2 Scale Selection in Space-Time Interest points are determined as the maxima of scale-normalized Laplacian defined by 1 3 ∇2norm L = σ 2 τ 2 (Lxx + Lyy ) + στ 2 Ltt (3.5) Hence the interest points are the ones which attain maximum in normalized Laplacian Equation 3.5 corresponding to Harris corner function H given by Equation 3.4.The processes to find such points is given in Algorithm 2. Corner points are found in spatio temporal scale space by using Equation 3.4 and these points are tracked in scale space for maxima of ∇2norm L. Example 3.3. We applied STIP algorithm on a sample of walking sequence. As shown in Figure 3.6, algorithm detects points around head, hands and legs where there are motion between frames, along with some points near them. In addition to those points there other background pointa with no motion are being detected. This is due to intensity variation between frames. Chapter 3. Video Interest Point Detectors 28 Algorithm 2 Detection Scale adapted Spatio temporal Interest Points 2 , σ 2 ....σ 2 Require: Video f and spatial scales σl2 = σl,1 l,2 l,n and temporal scales 2 2 2 2 τl = τl,1 , τl,2 ....τl,n 1: Construct scale space for initial spatial and temporal scales and in integration scales.   2 , τ2 2: Find interest points pj = xj , yj , tj , σl,j , for j = 1, 2, ...., N over scale space l,j using maxima of Equation 3.4. 3: for each interest point j=1 to N do 2 , τ 2 and scale below 4: Compute ∇2norm L by using Equation 3.5 in present scale σl,j l,j and above the present scale. 5: Chose the scale which maximizes ∇2norm L 2 6= σ 2 ore 2 2 6: if σ ei,j i,j τi,j 6= τi,j then   2 , τf 2 2 f 2 7: Redirect interest point pi to pej = xej , yej , tej , σf where τf l,j l,j l,j ,σl,j are the  adapted scales and xej , yej , tej is new position of interest point close to (xj , yj , tj ) set pj = pej and got o step 4 9: end if 10: end for 8: Figure 3.6: STIP Example: Frames with interest points marked Chapter 3. Video Interest Point Detectors 29 There are cases where spatial corners have also been detected as spatio temporal interest points even though there is no motion around the point. This could be attributed to intensity variations between the frames which can cause a spatial interest point with high variation in both spatial domains to become a spatio temporal interest point with high gradient in temporal domain resulting form brightness variation or noise rather than any motion or event of interest around the same. Points surrounding corner are also being detected due to gaussian blurring of different scales. 3.4 ESURF Speeded Up Robust Features (SURF) [11] is extended to detect spatio-temporal interest points by Willems and Tuytelaars in [3]. It uses 3D Hessian as the saliency measure. Box filters and integral image concepts in SURF are extended to integral video to get better execution performance. 3.4.1 Hessian based Interest Point Detection and Scale Selection Spatio temporal interest points were detected by using the 3D Hessian matrix :  Lxx Lxy Lxt  H(.; σ 2 , τ 2 ) =   Lyx Lyy Ltx Lty   Lyt   Ltt If det(H) > 0 it is treated as interest point. This doest imply that all eigen values are positive as in 2D case, so in addition to blobs saddle points are also being detected. Scale Selection We use γ normalization to find correct scales σ0 and τ0 at the center of Gaussian blob.At the center of blob only first term in determinant exists and remaining vanish. Thus 5 Lγnorm Lγnorm Lγnorm = σ 5 τ 2 Lxx Lyy Ltt xx yy tt (3.6) For scale selection using Equation 3.6 we need to optimize two parameters σ and τ . Iterative method is used find the scales at which det(H) attains maxima over the scales. 3.4.2 Implementation Details Integral Video SURF uses integral image which simplifies computation of sum of values in a rectangular region to addition of 4 terms. similarly ESURF uses integral Chapter 3. Video Interest Point Detectors 30 video concept to compute sum of values within rectangular volume using 8 additions. For a video of MxNxT an integral video at location (x,y,t) is equal to sum of all pixel values over rectangular region spanned by (0,0) , (x,y) and summed over all frames [0,t].For a video V, integral video IV is defined as Equation 3.7. IV (x, y, t) = y X t X x X V (i,j, k) (3.7) i=1 j=1 k=1 ESURF uses determinant of 3D Hessian matrix as saliency measure, which have second order derivatives in spatio temporal domain Dxx ,Dyy ,Dtt ,Dxt ,Dxy ,Dyt . They are computed by rotated versions of box filters shown in Figure 3.7(a) and Figure 3.7(b) with 8 additions by using integral video. (a) (b) Figure 3.7: The two types of box filter approximations for the 2 + 1D second order partial derivatives in one direction and in two directions [3] We build a pyramid of 3 octaves each octave consisting of 5 scales and ratio of scaling between consecutive scales is 1.2 . This task can be parallelized due to box filter and integral video each scale is independent of other. Determinant Hessian matrix is computed over each scale in temporal and spatial domain. Combination of spatial and temporal octaves oσ andoτ results in pair of scales (σi , τi ) gives a cube structure with Hessian determinants. once cube is filled apply non maximum suppression to obtain extrema in five spatio temporal scale space(x, y, t, σ, τ ). The points at which extrema occurred is the location of interest point and scale of interest point is the scale at which it is detected . Example 3.4. We applied ESURF algorithm on a sample of walking sequence. As shown in Figure 3.8 algorithm detects points around head and jacket where there is motion between frames, along with some points in the background. ESURF detects blobs which are both in motion and are in background which gives redundant information for some of applications.This algorithm execution time is less Chapter 3. Video Interest Point Detectors 31 Figure 3.8: ESURF Example: Frames with interest points marked due to the integral video concept and approximation of derivatives with box filters . since each octave is independent of others it can be paralleled in execution which is not possible for all other detector which work in hierarchial way. 3.5 ELIFT Local Invariant Feature Tracks (LIFT) [12] was proposed by Mezaris,Ioannis and Dimou. Feature tracks are the set of visually similar interest points in video having temporal continuity.So, tracks are set point which are visually similar and similarly moving points. We extended this notion of tracks to get interest points in video of fixed camera by considering magnitude and direction of motion. 3.5.1 Feature Track Extraction Let S be video of T frames, S = {It }Tt=1 . We apply SIFT [6] on each frame It to get feature points and feature descriptors Φ = {φm }M m=1 where M is number of detected Chapter 3. Video Interest Point Detectors 32 features and the descriptor φm is defined as φm = [φxm , φym , φdm ]. φxm , φym represents the location of the interest point and φdm describes the interest point with 128 bin vector. The interest point φm from It can have temporal correspondence with the point in It−1 . It is found by a local search in the neighbors of interest point by taking a square patch of dimension 2δ + 1. A point φn ∈ Φt−1 is said to be tracked in It as φm if it satisfy Equation 3.8 |φxm − φxn | ≤ δ, |φym − φyn | ≤ δ,  d φdm , φdn ≤ dmin (3.8) δ = 7 and dmin = 0.6 are the values taken for evaluation.If there are more than one point being matched to current point, we consider the one with having less distance. When such an interest point φn exists, the interest point φm ∈ Φt is appended to the track where the φn is present else φm is considered as the first element of the new feature track. A set of feature tracks Ψ = {ψk }K k=1 is formed by finding the temporal correspondences of all interest points in all frames of S. The feature track ψk is defined as ψk = [ψkx , ψky , ψkd ], where ψkd is average descriptor for feature track which is obtained by element wise averaging of descriptors of elements of feature track and ψkx = [ψkx,tk1 , ψkx,tk1 +1 , ....ψkx,tk2 ] , tk1 < tk2 . ψky is defined in the similar way as ψkx . The trajectory of the feature is ξk = [ψkx , ψky ]. If the length of trajectory, tk2 − tk1 , is more than 5 it is considered as valid else it is discarded. 3.5.2 Interest Points on Feature Track Interest points are the points on feature track with special features like points where track changes its direction or motion. Select a feature track ψk with trajectory ξk . Algorithm 3 shows shows steps followed to detect interest points on feature track. Tracks with average displacement greater than threshold are only considered for detection, because background points which are stationary will have no motion and are not truly spatio temporal interest points as shown in Figure 3.9(a). Points which have nonuniform displacement are interest points on tracks and they become local maxima or minima in track.The points where a track changes it direction or where it deviates from normal path more than a threshold are said to be interest points as shown in Figure 3.9(b). Example 3.5. We applied ELIFT algorithm on a sample of walking sequence. As shown in Figure 3.10(b) algorithm detects points around head and jacket where there is motion between frames, along with some points in the background due to round off errors.Points which were tracked are shown in Figure 3.10(a) There are 1782 feature tracks with 49577 tracked points which were reduced to 2046 points. Chapter 3. Video Interest Point Detectors 288 33 103 102 287.5 y 101 287 100 286.5 99 286 95 98 97 94.5 40 30 94 20 93.5 x 10 93 0 96.5 54 53 96 52 51 95.5 t 95 (a) 50 49 (b) Figure 3.9: Figure (a) Track with no motion and Figure (b) Track with motion and deviation in path Algorithm 3 Detection Interest Points on Feature Tracks Require: Feature tracks Ψ Ensure: Ψ 6= φ 1: Interest points P = φ 2: for Feature Track ψk =1 to K do 3: Calculate displacement in X and Y directions xd and yd 4: Calculate average displacement xdavg and ydavg 5: if average displacement xdavg > 1 && ydavg > 1 then  6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: yd calculate direction of displacement d = tan−1 xd for Every point pi ∈ ψk do if pi is local maxima or minima in x or y then P = P ∪ pi break; end if if Direction di is opposite to di−1 then P = P ∪ pi break; end if if |di −di−1 | > 40 then P = P ∪ pi end if end for end if end for RETURN P The main advantage of this algorithm is that it detects points with motion without optical flow which is computationally expensive. There are some background points still persisting. This is due jumping of features between tracks which has arisen due to distance between them spatially is very small. They have almost same 16x16 patch considered for descriptor computation in SIFT. Their distance between their descriptors Chapter 3. Video Interest Point Detectors (a) (b) Figure 3.10: Figure (a) shows the traced features(LIFT) and Figure (b) shows the modified Interest points(ELIFT) 34 Chapter 3. Video Interest Point Detectors 35 is very small. This problem can be addressed by increasing dmin given Equation 3.8 but the actual interest points are being filtered so there should a tradeoff. 3.6 3.6.1 Evaluation Evaluation Criteria We considered repeatability and execution time as in [14] for performance measuring. Repeatability It determines how much percentage of spatio temporal interest points are being retained in spite of geometric transforms and protomorphic deformations. If this rate is higher detector is robust to the transformations. It is defined as the ratio of number of matched points to the mean of number of detected interest points of two videos given by Equation 3.9 r1,2 = c(V1 , V2 ) mean(n1 , n2 ) (3.9) where c (V1 , V2 ) is the number of matched pair of points and n1 and n2 represent number detected interest points in V1 andV2 respectively. Execution Time Execution time is time taken for detector for detection of interest points in the video. It measures the computational complexity of the detector. we also measure the number of points detected and determine if they are sparse or dense. 3.6.2 Experimental Results All programs are being executed on Intel Core2duo processor computer with 16 GB RAM, under Windows 7 professional operating system. Table 3.1 shows the amount of time taken for execution and number of interest points detected. A video of 1000 frames is taken with 1280x720 YouTube720p with frame rate 25fps is considered for evaluation. Brightness change video is subjected to uniform change in intensity by Inew = If rame + d, where d is a constant varying from -40 to 40 image intensity is limited to [0 255]. The corresponding performance characteristics of the video descriptors is shown in Figure 3.11. n-SIFT ,ESURF and STIP performed better than MoSIFT and ELIFT. The number of interest points are less for ELIFT and MoSIFT and due to saturation of intensity at lower or upper level may remove some of the points so they are sensitive to brightness change. Chapter 3. Video Interest Point Detectors 36 Brightness variation 100 90 80 Repeatability 70 60 50 40 30 ESURF nSIFT STIP MoSIFT ELIFT 20 10 0 −40 −30 −20 −10 0 10 brightness variation 20 30 40 Figure 3.11: Brightness change Vs Repeatability Scale Variation Vs Repeatability 100 ESURF n−SIFT STIP MoSIFT ELIFT 90 80 Repeatability 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 scale variation 3 3.5 4 Figure 3.12: Scale change Vs Repeatability Scale changes The videos and their spatially scaled versions from 50% to 200%, are used for evaluating robustness of the descriptors to scaling. The result is shown in Figure 3.12.The performance of all the algorithms is quite bad with down scaling of the video but with up scaling there is some improvement in performance. This does not augur well for matching or aligning images of different scales. Compression Video is compressed in MPEG format in a scale of [1 50] and results are presented in Figure 3.13. As we performed lossy compression it changes values of pixels which changes Chapter 3. Video Interest Point Detectors 37 Compression Vs Repeatability 100 ESURF n−SIFT STIP MoSIFT ELIFT 90 80 repeatability 70 60 50 40 30 20 10 0 1 2 3 4 5 compression variation 6 7 Figure 3.13: Compression change Vs Repeatability the value of gradients hence every detector is expected to show decrease in repeatability as compression increases. n-SIFT and STIP has shown better repeatability and ESURF which uses box filters and integral video concept to give additive losses of information and thus it shows low repeatability. Execution Time A comparison table of time of execution and number of interest points detected is given in Table 3.1.It is observed that ESURF has better performance in time of execution which has derived from integral video usage. MoSIFT and n-SIFT require more time of execution. MoSIFT and ELIFT detect less number of interest points with less number of background points. n-SIFT and ESURF detect a large quantity of points which are dense. n-SIFT, STIP and ESURF detect background points along with points with motion. n-SIFT STIP ESURF MoSIFT ELIFT Time of Execution(sec) 1404.954570 1127.339653 113.326833 1829.52802 737.707349 No.points detected 31288 130261 261960 72723 12046 Table 3.1: Execution Time Comparison Chapter 4 Conclusions and Future Scope 4.1 Conclusion This thesis mainly dealt with spatio temporal interest point detectors of video and their stability in geometric and photometric deformations. SUSAN detector, Harris corner detector, Harris Laplace detector and SIFT spatial interest points detectors have been studied and their performance has been evaluated. SIFT exhibits better repeatability compared to other detectors under image deformation. SIFT is a better package with robust interest point detection and descriptor along with suitable matching techniques. The spatio temporal extension of these image interest point detectors has been studied and their performance is evaluated in terms of repeatability and time of execution. nSIFT has better repeatability than others under video deformations. The proposed ELIFT has least number of interest points and less number of points without motion. MoSIFT detects second least number of points, while ESURF detects the maximum number of interest points. ESURF have less execution time and next is ELIFT. While ESURF has problems in detecting spatio temporal interest points with motion, n-SIFT detects spatial interest points with and without motion. 4.2 Future Scope The probable extensions of the work done in this thesis are following • Further study is required for analyzing stability temporal scaling i.e having different frame rates. • Performance evaluation of algorithms with nonuniform brightness variation need to be evaluated. 38 Bibliography 39 • Detectors performance along with corresponding descriptor need to be analyzed for different application like video classification, video alignment and action recognition. • ELIFT need to be extended for moving camera video applications. • Camera motion estimation and evaluation of the video descriptors on video sequences with a moving camera along with alignment of such sequences present another challenging research topic. Bibliography [1] W. Cheung and G. Hamarneh. N-sift: N-dimensional scale invariant feature transform for matching medical images. In Biomedical Imaging: From Nano to Macro, 2007. ISBI 2007. 4th IEEE International Symposium on, pages 720–723, April 2007. doi: 10.1109/ISBI.2007.356953. [2] Ming-yu Chen and Alexander Hauptmann. Mosift: Recognizing human actions in surveillance videos. 2009. [3] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scaleinvariant spatio-temporal interest point detector. In Computer Vision–ECCV 2008, pages 650–663. Springer, 2008. [4] Tinne Tuytelaars and Krystian Mikolajczyk. Local invariant feature detectors: a R in Computer Graphics and Vision, 3(3):177–280, survey. Foundations and Trends 2008. [5] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey vision conference, volume 15, page 50. Manchester, UK, 1988. [6] DavidG. points. ISSN Lowe. Distinctive image features from International Journal of Computer Vision, 0920-5691. doi: scale-invariant key- 60(2):91–110, 2004. 10.1023/B:VISI.0000029664.99615.94. URL http://dx.doi.org/10.1023/B%3AVISI.0000029664.99615.94. [7] StephenM. Smith and J.Michael Brady. level image processing. 45–78, 1997. Susan a new approach to low International Journal of Computer Vision, 23(1): ISSN 0920-5691. doi: 10.1023/A:1007963824710. URL http://dx.doi.org/10.1023/A%3A1007963824710. [8] International Journal of Computer Vision, 60(1), 2004. ISSN 0920 5691. [9] Tony Lindeberg. Linear spatio-temporal scale-space. In Scale-Space Theory in Computer Vision, pages 113–127. Springer, 1997. 40 Bibliography 41 [10] Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005. [11] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up ro- bust features. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors, Computer Vision ECCV 2006, volume 3951 of Lecture Notes in Computer Science, pages 404–417. Springer Berlin Heidelberg, 2006. ISBN 978-3-540-33832-1. doi: 10.1007/11744023 32. URL http://dx.doi.org/10.1007/11744023_32. [12] Vasileios Mezaris, Anastasios Dimou, and Ioannis Kompatsiaris. Local invariant feature tracks for high-level video feature extraction. In Analysis, Retrieval and Delivery of Multimedia Content, pages 165–180. Springer, 2013. [13] University of Southern California. Sipi image database. URL http://sipi.usc.edu/database/database.php. [14] R Hendaoui, M Abdellaoui, and A Douik. Synthesis of spatio-temporal interest point detectors: Harris 3d, mosift and surf-mhi. In Advanced Technologies for Signal and Image Processing (ATSIP), 2014 1st International Conference on, pages 89–94. IEEE, 2014. [15] Zhijie Zhang, Chongxiao Cao, Ruijie Zhang, and Jianhua Zou. Video copy detection based on speeded up robust features and locality sensitive hashing. In Automation and Logistics (ICAL), 2010 IEEE International Conference on, pages 13–18. IEEE, 2010. [16] Lin-na Li and Nan Geng. Algorithm for sequence image automatic mosaic based on sift feature. In Information Engineering (ICIE), 2010 WASE International Conference on, volume 1, pages 203–206. IEEE, 2010. [17] J. Barron and N. Thacker. Tutorial: Computing 2d and 3d optical flow. 2005.