REAL-TIME IMAGE REGISTRATION AND ITS APPLICATION IN MOTION-VISUAL HYBRID CONTROLLER















# Real-time Image Registration and Its Application in Motion-Visual Hybrid Controller

© Jun Zheng A thesis submitted to the School of Graduate Studies

in partial fulfillment of the

requirements for the degree of

Master of Science

Department of Computer Science

Memorial University of Newfoundland

September 2010

St. John's

Newfoundland

## Abstract

Mation based controllers, such as the Wi Remote, provide users with brand-new gaming experiences and are becoming more popular. However, the accuracy of the motion sensor limits their suspers in precision critical games. Instead of the motion based approach, image processing techniques could be used to provide higher accuracy due to their high pictor resolution.

The goal of this thesis is therefore to propose a highly accurate controller that utilityvisual lapper. User can control carries it 20 screen by swring the controller toward any place which has textures: The therein for proposes an image registration algorith much institution of the propheck hardware, then uses it to build a highly screenre visual hand controller through cancer. Real tracking, and finally forther improves the robotions of the controller ander fast motion by utilizing both motion and visual information.

Radiation images registrations in address of by implementing the invest-Compositions adjustmin in spatial least Compare United Decode Architecture (CUMA). A number of CUDA optimization suchasings have been supplied and evaluation. The first optimized implementation advances 150 intense queue queue on the sequential planetmation, more analytication (are advanced and queue the sequential planetmation, more analytication (are advanced and queue) and an advanced are planetmation the courses of the processing advance in a dwance planetmation (are advanced in the same qualitation advanced and new multi-resolution values of the instance qualitation advanced and and provided and new multi-resolution values of the instance qualitation advanced and sequence of the same qualitation advanced and sequence Experiments conducted demonstrate that, using the proposed real-time image registration algorithm, the visual based controller achieves much higher control accuracy than the motion-sensor based approach. The performance under fast motion can be further improved through using the liqued from the motion sensor as a priori knowledge to assist the image registration process.

## Acknowledgements

Foremost, I would like to thank my supervisor sincerely, Dr. Minglun Gong, for his continuous support, guidance and insightful advices.

I while the the School of Capitane Studies of Menorial Universe, who provides the finding for ny research. This then would not be possible without the help of the facely, students, would aff of the Spatement of Capange Facel Technology Technology (Section 1996). The Spatial School School Technology and Yangsana Wu who support me spirality throughout my life and my ann and her handard, Yanghana Wu and Yi Ding also provide finicity support for my foring sits to the clara competitive free on my research.

## Table of Contents

| Abstract                                                                         |      |
|----------------------------------------------------------------------------------|------|
| Acknowledgements                                                                 | iv   |
| List of Tables                                                                   | viii |
| List of Figures                                                                  | ix   |
| Chapter 1 Introduction                                                           | 1    |
| 1.1 The Evolution of Game Controller                                             | 1    |
| 1.2 Motivation                                                                   |      |
| 1.3 Organization of the Thesis                                                   | 4    |
| Chapter 2 Related Works                                                          |      |
| 2.1 Image Registration                                                           |      |
| 2.1.1 Feature-Based Approaches                                                   |      |
| 2.1.2 Intensity-based Approaches                                                 |      |
| 2.1.3 The Inverse Compositional Algorithm                                        | 8    |
| 2.1.4 Multi-Resolution Image Registration                                        | 9    |
| 2.1.5 Template Matching                                                          | 10   |
| 2.2 Motion Controllers and Motion Sensing                                        | 10   |
| 2.3 GPGPU Programming                                                            | 11   |
| 2.4 Optimization of CUDA programming                                             | 12   |
| Chapter 3 Image Registration using Inverse Compositional Algorithm and Its Paral | lel  |
| Implementations                                                                  |      |
| 3.1 The Inverse Compositional Algorithm                                          | 17   |
| 3.2 Sequential Inverse Compositional Augoritam                                   | 10   |
| 3.3 Implementing ICA for GPU Processing                                          | 22   |
| 3.4 Multi-resolution image Registration                                          | 25   |
| 3.5 Multi-resolution image Registration with Sciect Regions (Nucleo 3R)          | 26   |
| 7.5.1 Region Scienced to compare the complicate to the Conclusion                | 27   |
| 2.6.2.1 autimation for Comparison fine Reasing in the Input Invent               | 27   |
| 3.5.2 Locating the Conversionant Region in the input image                       | 28   |
| 2.4 Discussion                                                                   | 31   |
| 3.0 LPDC05909                                                                    |      |

| Charter 4 Optimization of the Im | age Registration Module on CUDA |  |
|----------------------------------|---------------------------------|--|
| 4.1 Introduction of CUDA         | 33                              |  |

| 4.2 implementation                                             |    |
|----------------------------------------------------------------|----|
| 4.3 Optimization Techniques                                    |    |
| 4.3.1 Balancing Workload Distribution                          |    |
| 4.3.2 Parallel Reduction                                       |    |
| 4.3.3 Sequential-Then-Parallel Processing                      |    |
| 4.3.4 Memory Access Pattern                                    |    |
| 4.4 Bottleneck Optimization                                    |    |
| 4.4.1 Local Array Buffered Approach                            |    |
| 4.4.2 Partial Reduction Approach                               |    |
| 4.4.3 Unrolled Register Buffered Approach                      |    |
| 4.5 Evaluation                                                 |    |
| 4.5.1 Evaluation of the General Optimization Techniques        |    |
| 4.5.2 Evaluation of the Optimization Approaches for Bottleneck |    |
| 4.6 Discussion                                                 |    |
| Chanter 5 Motion and Visual Controller                         | 51 |
| 5.1 Visual Based Controller                                    | 55 |
| 5.1.1 Threading image capturing and image registration         | 50 |
| 5.1.2 Mause control                                            | 61 |
| 5.1 Motion Based Controller                                    | 63 |
| 5.1.1 Introduction of Wilmote                                  | 63 |
| 5.1.2 Motion Estimation                                        | 61 |
| 5.1.3 Estimating Angle of Roll                                 |    |
| 5.1.4 Implementation                                           |    |
| 5.2 Hybrid Controller                                          | 65 |
| 5.2.1 Motion Based Pre-Knowledge                               |    |
| 5.2.2 Floating Window                                          | 63 |
| 5.2.3 Implementation                                           |    |
| 5.3 Accuracy Evaluation                                        |    |
| 5.4 Discussion                                                 |    |
| Charter 6 Other Applications and Conclusions                   | 75 |
| 6.1 Light-Gun for LCD                                          | 75 |
| 6.2 Real-Time Image Mosaic                                     |    |
| 6.3 Conclusions                                                |    |
| Chapter 7 References                                           |    |
|                                                                |    |
| Appendix A                                                     |    |
| A.1 Bank conflict                                              |    |
| A.2 Atomic Reduction Without Bank Conflicts on CUDA Device     |    |
| A.3 Optimized Parallel Reduction on CUDA Device                |    |

A.3 Optimized Parallel Reduction on CUDA Device A.4 Parallel Reduction for Vectors on CUDA Device

. 85



# List of Tables

## Table 4-1 Evaluation of bottleneck optimization approaches.



... 52

# List of Figures

| Figure 1-1 Prototype of the hybrid controller                                           | 4    |
|-----------------------------------------------------------------------------------------|------|
| Figure 3-1 GPU-based implementation of ICA algorithm                                    | .20  |
| Figure 3-2 Modules of the GPU-based ICA algorithm                                       | .21  |
| Figure 3-3 Multi-resolution image registration                                          | . 22 |
| Figure 3-4 Comparison between standard IR and MRIR using the same images                |      |
| Figure 3-5 Implementation of MRIR-SR                                                    | 30   |
| Figure 3-6 Comparison between different image registration approaches                   | . 31 |
| Figure 4-1 Hierarchy of CUDA threads                                                    | .34  |
| Figure 4-2 Structure of the image registration module                                   |      |
| Figure 4-3 Parallel reduction hierarchy built by the building block which is capable of | £    |
| reducing 512 elements                                                                   | 39   |
| Figure 4-4 Automatically scheduled strategy for aggregating kernels                     | . 40 |
| Figure 4-5 Sequential-then-parallel processing                                          | .41  |
| Figure 4-6 Local array buffered parallel reduction approach                             | .45  |
| Figure 4-7 Local array buffered approach                                                | . 46 |
| Figure 4-8 Partial reduction approach                                                   | . 47 |
| Figure 4-9 Unrolled register cached approach                                            | . 48 |
| Figure 4-10 Evaluation of the general optimization techniques                           | 50   |
| Figure 4-11 Performance of the bottleneck optimization approaches under different       |      |
| blocks per multiprocessor setting                                                       | . 54 |
| Figure 4-12 Performance comparison between the bottleneck optimization approache        | s 55 |
| Figure 5-1 Tracking focal movement by the registration result.                          | . 59 |
| Figure 5-2 Implementation of the visual based controller                                | . 60 |
| Figure 5-3 Wiimote motion sensing                                                       | . 62 |
| Figure 5-4 Motion estimation                                                            | . 63 |
| Figure 5-5 Block diagram of the motion based controller                                 | 64   |
| Figure 5-6 Motion based pre-knowledge                                                   | . 67 |
| Figure 5-7 Floating window                                                              | 69   |
| Figure 5-8 Implementation of the hybrid controller                                      | 70   |
| Figure 5-9 Procedure of the accuracy test                                               | . 71 |
| Figure 5-10 Testing result of the small motions                                         | . 72 |
| Figure 5-11 Testing result of the complex motions                                       | 73   |
| Figure 6-1 Real time image magain                                                       |      |

## Chapter 1 Introduction

Nonadays, his input methods for game console construition are net limited to game gash with puts hansaux. New types of game consults are consign to enhance caver' games console. Will, Winners has integrated accelerations and attached generospect to track motion. Uses non-ten do not strain the strain game console. Will, Winners has integrated accelerations and attached generospect to track motion. Uses non-ten do not strain the strain game console with the produced motion consolid wide agame. This control-forward motion generospec based approach lacks accesses, in this thenks, we explore an integrated system of visual hands approach and motion hand approach to motion larger for a strain that the vision discrime framework the track user limit, limit magnetized motion, and an emaintime of a combination of video consensestementary processes. Name date strand and profess with textures motions have not strain constraint of a strand.

#### 1.1 The Evolution of Game Controller

Since the investments of game pairs and analog stricks as the convertises for the first parameters consists, here, directions for each game results, were stress techniques are and neuro techniques are being simulated to enrich gaming emrine experiments. In 2000, Samy relaxed the Lys Top – a digital ensure down for the Flyshilton 2. For the first measurement of the transmission, such as games measurements are strength in constants, inside the performance of the annue and the proceeding orphility of Fly-om showed measurement of the context and the proceeding orphility of Fly-om showed measurements.

Three years later. Nintendo showed again her talent in creating brand-new gaming experience. As the main controller of her new console Wii, Wii remote controller (or Wilmote) was the most versatile input device in console history. Integrated with motion and infrared ray sensors. Wiimote is not only a conventional game pad but also a motion tracker and a light-gun. It brought out the prevalence of motion controlling games. Nevertheless. Wiimote is not perfect. While providing novelty experience for casual gamers, enthusiasts and professionals complain about the control accuracy. Until now, conventional eamerads are still the best choice for high competitive console video games The competitors of Nintendo are creating new tools to improve the accuracy. For more powerful consoles, new hybrid gaming control systems, Kinect for X-Box 360 and PS Move for Playstation 3 are on the way. Both introduce multiple input techniques, such as motion sensor and voice recognition. Though the hardware specifications of the two relative old consoles limit their performances, HD video games (abbreviate for High Definition video games, an unofficial term for video games with resolution larger than 1280 × 720) finally have the chance to get rid of the old conventional game pads and bring camers to the motion controlling world.

Both of the new products can take suntages of image processing techniques. A video camera is nalive integrated to Kinect for X-Box 560, whereas FX3 has an accessory camera device, called PS Eye, which can be added to PS Move system. Compared to motion sensor solution, image processing has higher accuracy up to pilot. Incurver, no both Kinect and PS Move systems, the video cameras are fixed near televisions and facing users. They track user gestures or the signals from the special devices, but won't be able to make use of the available motion information.

#### 1.2 Motivation

This thesis mades how to integrate sides cancers with motion controlleus to the bolt visual and motion inputs can be stilled for high precision ego-motion estimation. We are noticed by the following filters motion information granted by motion sensers in other write, which makes it annihish under smill motion due to the low signal to notice ratio. V(e), it provides rehout information under fram motion, Visual fundador, from camera, on the other hand, it as a directorate for estimating the questionic fittenges registrating images captured in adjacent fitance. Nevertheless, adjacent fitances may differ a be under family motion and the strength of the site of the direct sense and grantegiate by the fit motions, causing image registrations to full due to matching antipacities. Value directors and a strength estimation of the other sense matter from dirt.

Combining visual impart with motion impat, therefore, series to be a logical choice for providing robust and highly accurate motion information. To achieve this goal, we build an protopy controller using existing hardware: a Will motion controller, a high performance video camera, and a PC with programmable graphics card. We also implement image neglitation algorithm one graphics hardware to achieve reading the protopy of the structure of



#### 1.3 Organization of the Thesis

The remaining of this thesis is organized as the following: The next chapter discusses previous work related to this thesis. This includes image registration approaches, motion controller and motion sensing, GPU programming and optimization of CUDA code.

Chapter 3 curves the image registration algorithms. First, we introduce the lowers: Compositional Algorithm (FAA). It is an efficient intensity-based image registration algorithm, whose parallel implementation in used as the building block of the proposed registration system. We then propose our techniques to enhance EAA using extra-to-first processing antrage. We also propose the Mahi Resultion Image Registration with Shorts Results in strengts the anticipation of strength Protocel Results. Chapter 3 documes the databilit implementation and the optimization on CUDA. The segmination of the whole system is free prevented and the optimization toxingione are then discussed. The starbings involved historings suggestion and a practical workshot, parallel reduction, segmetrid-theoremulty proceeding, and memory access patterns optimized for CUDA docisies. Note importance, we modify the bottlenes of the LAS and our vision buffinge methods in moless high-harrowy memory access. Tents show that the handworking antifering approach yields the bren result. Second: cognetions traveling and the admitted on the process the theory and disclosized of the traveling optimized are buffered on the structure of the disclosized of the traveling optimized areastical as the structure of the disclosized method areastical structure of admitted areastical associations.

In Chapter 5, we compare three types of controllers that use different information for pointing task. Using a video camera only, we can build a high accuracy visual based pointing controller. If we only have a motion sensor, an efficient motion based air-movae can be built, by analyzing the pre-knowledge provided by motion sensors, the motionvisual-based controller works will even during the street of the sensor.

The proposed real-time image registration system not only has application in building controllers, it can also be applied to other real-vision applications. In Chapter  $\hat{e}$ , we will discuss how to use it to correct light-gun and gomente real-time image measic. At last, the thesis is concluded.

5

## **Chapter 2 Related Works**

#### 2.1 Image Registration

Image registration is the process that transforms different sets of images into one coordinate system. It is widely used in computer vision, medical imaging, and so on. Different image registration algorithms can be classified into two categories - the featurebased aerosch and the intentivi-based aerosch.

### 2.1.1 Feature-Based Approaches

The four-based approaches cardiant the transformation between the two images been on the extension of adams measuress or features in image [1]. The features should be adding, studies and and adamshafe in both images. Vision feature detection techniques are proposed to extend features, from the previous Meerary's commendances [2] is the recent Lowe's Adule baselines Features Transform (MITQL). The distribution features are the solid to be manuface Tenteers Transform (MITQL). The distribution features are the solid and the manuface address, Fixeding and MITQL. The distribution features are the solid and the manuface address, Fixeding and MITQL. The distribution features are the solid and the manuface address, the solid and the MitQL feature. Readem Sangele Commons (RAMSSAC) to taking in other work.

The extracted futures not only can be work to register image, but does to entite samera locations. Devision proposed the real-time single-camera-based localization through futures/based samples [15]. Division discussed matual information for the active search to extract useful features [15] he high frame-sate application, the active search is expected to be now accurate because of the relative small starch region. It also requires to proceeding times. The feature-based approaches do not compare intensity values and work well when illumination changes or images are sensed by the different devices. On the other hand, they cannot handle scenes that have limited number of distinct features.

#### 2.1.2 Intensity-based Approaches

The Internity-based approaches due net detect futures the compare the Timmity Enternase of the shorth samples are performed for impact to sensitive methods without the complex are the Normalized Cases Constraints (NCC) [7] and Sequential Nimilarly Detection Algorithm (SSGA) [8] that asseptimility search the optimum to estimate induction between impact how searched and box comparational card algorithms are also proposed, such as the Lease-Kanda algorithm [9], which can estimate strend projective transformations. The intensity-based approaches due not require strend atomaters in image, However, between of the principle cald adultistics, box hull due comparational complexities; and, beams of the immunity comparison they are also analysis to Illustration days.

Instituty-based approach approach guarantity strengt series or equires by a topic data series after to the following two resources: First, many sources by one cannels are constants in order. Secondly, adjacent france capitred by a cannen at full henre refer often data share adoptant constrained from the image transmits in a kill done reference in the second second

.7

Conventionally, intensity-based approaches work on image parches to reduce computational cost. However, the proposed GPU accelerated intensity based approaches are very efficient, even for large images. Therefore, we use the entire images to perform registration rather than using image parcheses.

#### 2.1.3 The Inverse Compositional Algorithm

For the proposed pointing system, we need an image regimention algorithm that can immune a 3D projector modification. The Lacas-Kando algorithm is a potential choice. This algorithm spalaes the warp parameters by instability initiating the intentity difference between one image and another warped image. In each loop, and algorithm mode to re-values a Storeget Decoset Image SOD1 to direct the lead parameters spalaring and the to accolation at 16 storage SOD in direct the storage sto

To address this problem, Balar and Matthews [11] proposes the investor Compositional Algorithm (ICA). They refer the Lasan-Kanade algorithm as the forward compositional algorithm issue the target many sequences of the second composition and back to the target image. Both the SDI and the lession matrix as pre-calculated at where all warp parameters are equal to zona and reased during the intention, methy that as constant of each interaction. Experiment shows the pre-calculated SDI and Hessian matrix taken about 7% execution fine when the algorithm interacts 100 times. If we done Lasan-Konned algorithm which performs the re-evaluating in tanch interains, there would be a share the dones. Both the Lucan-Kanade algorithm and the ICA require images to have adequate overlapping areas. for a successful registration, For adjacent frames cathered at a full frame rate by a slowly waving camera, this requirement is generally net. However, when the speed of movement is too fast, both adactions may fail.

In this thesis, we choose ICA because it can be efficiently implemented using graphics hardware. We also prepose techniques for enhancing the robustness of the registration under fast camera movement.

#### 2.1.4 Multi-Resolution Image Registration

A multi-rendering markets strategy, which is discussed by Burt [11], is others used to enhance the intensity-based image regularitation. This connects for transity markets images on a course level and the humes the rene that for the levels. Wang and that [12] apply this strategy as SIMA method to speed ap regularization speed. Constitu and Laplacian pyramide are used in Kenar et al.<sup>5</sup>(11)] works to register aerial video sequences. Baker and Multews [11] point on the finant convergence of ICA can be exhibited to research the intervideo transmit.

Beakes the advantage of reducing comparisonal time, the granular can potentially increase the robotantess. The course images contain large-scale features that into its yield fact convergence and they also remove the dotable leatures that may fact the algorithm to local minimums. In this thesis, we will combine the registration system with the granular line mebaness. Because of the high efficiency of the GPU exploration, we getter the system the conducting regulates.

#### 2.1.5 Template Matching

Templane making saveshar a sub-image from another template image. It can be abvisuly and at handle samalaring registration by central the template from the image to be nonfinemed. The sub-general matching algorithms samily house the browlead problem by finding the entitistion diluteriation or maximum contribution, between the template and the all possible anabising algorithms image. Examples for the memoring experime the fust our of Absolute dilitionness (AGA) and all sum of Squared Differences (SDR) softare cars to implement. For heter redunstons, Normalized Cross-Correlation is to from well 154-156.

Image registration using template matching can be implemented efficiently but it can only handle translation. In contrast, we perform Inverse Compositional algorithm, which can handle all kinds of projections.

#### 2.2 Motion Controllers and Motion Sensing

Wii remote controller, or Wiimote, is the first wide-used motion controller for mainstream gaming consoles. With integrated accelerometers, it can measure accelerations ranged from 3g to + 3g along three perpendicular directions.

Because it is easily programmable, its potential usages are being investigated by many researchers. Wong et al. [17] use it is hold an interactive marke performance system by analyzing acceleration patterns. Schlomer et al. [18] employ it to perform gesture resonation. Choose 19] uses it to build a interactive tenchies and learning platform. In 2009, Nimendo releases an add-on to Wiimste, called as MotionPius, which can measures angular velocities with bull-in geneoopes. Although currently there are very few decements about MotionPius, the gyrencopes combined with accelerometers have adready been researched for motion tracking, such as orientation estimation discussed by Lunga (20) and motion cupture discussed by Subagechi [21].

Unlike the above research that use motion information only, we here how to combine the motion based feedbacks from the Wilmote with the visual based feedback from a video camera.

#### 2.3 GPGPU Programming

As the advancing of streaming processors, the graphics processing unit (GPU) is being widely used to implement general-purpose parallel applications. As matrianed by Harris [22], GPUs provide much more and faster growing computational power than CPU's. Existing GPU applications involving image processing [23-25], video decoding 126, 27], objects cimulation [29], successing 100 and outing 131 due has the greeteristics.

Previous GPUs are not specifically designed for general-purpose computations. Programmers have to use a graphics programming model, such as shading languages, which is designed for graphics medically. This programming model is no experimized for other staggs and lacks efficient local communication mechanisms, which is a common case in enterol-structure sensitions.

Now, more and more general-purpose products are released to make GPU coding more enjoyable. They provide convenient high level programming languages to implement

11

pandial approximations, Oscalip namag dories are work to secretaria bacia communitations. On these docisies, implementations are now minimal and efficients. The examples of the GPU programming models are the Compart Unified Docise Architecture (dorfs of CUDA) decigned for the CPUs numericanaes by the imparted circuit suppler – AVAIA and the come-GPU OperCompose, which is the graphics APAI the bless graphic card will support in hardware, would become the standard for PC parsing industry, it will be integrated with Microsoft Directed VI. It is also highly possible to become the anadle operaming model for the actions of Microsoft. It is very likely data due consults providen subdirecture atomsfor of Microsoft. It is very likely that due consults providen subdirecture initial products to take advantage the brogeneous eff.

CDA is selected in this decis for implementing the image registration algorithm in parallel. The C-like programming interface is intuitive for programmers to decign parallel equipricosine, CDA complement allow care compared C+-objector of their that can be easily linked to the standard C++ program. This future rankets integrating GPU accelerating programs way constantic, CDA also provides futures to enable nutrities debugging with https://wast.to.its.theory.got maintenance.

#### 2.4 Optimization of CUDA programming

Previous research has shown that how well the implementation is optimized for graphics hardware can grantly affects the performance of the algorithm. Several papers have discussed how to optimize (CDA code. In [22], Harris discuss step by step about how to optimize the parallel reduction by both algorithmic optimizations and coe optimizations. The final result is an 2004 times fatts as the original straightforward version. We modify

12

Hanis' work to perform the interformal summation is not CDA kentoh. Hanis, Samppia and Oware discuss how to optimize particle prior for sums in [33]. Their order the optimized names altabine properties (milling priorides) the leight the optimized names altabine properties (a toppedia A. L and pramite handson for version (12). Si van AGC and propers the tachington is optimized in the depine transmission of the properties of the tachington is optimized in the depine (12). Si van AGC and propers the tachington is optimized their strenge bis proteed to target processors which leads to high immedia distribution to acquire high company of the target processors which leads to high immedia threadparts. In our experiments, see find that high coopensys due to altabase to and high performance, expectably for the memory baca-shall.

Based on our parallel implementation of the ICA algorithm, we experimented with various optimization techniques and evaluated their effectiveness. In the end, we not only have an optimized implementation that accelerates the processing speed by up to six times, but also obtained valuable insights on how to optimize other image-related CUDA applications.

## Chapter 3 Image Registration using Inverse Compositional Algorithm and Its Parallel Implementations

The Inverse Compositional Algorithm (ICA) is an image registration tool to estimate the homographic transformation between two images. The algorithm runs too slow on current PCs for real-time applications, it also has huge parallelism which could be accelerated by a GPU.

In this chapter, we discuss a GPU-based image registration algorithm designed based on ICA. Moreover, we discuss how to improve the registration robustness using multiresolution processing and to increase the versuality using selected region registration.

Compared to the suspendial ICA implementation, the proposed GPU-based image registration (R) algorithm achieves a specialog of up to 150 mers. After integrating the R1 equipation with currence for presensing subarts, the resulting multi-resolution image registration (RORR) algorithm in captude of registrating images with multi-revolveging registration and thus is improves robustness. To furthe improve the provesting special adaptability, see used and against aftering the multi-revolveging registration and adaptability, see used and a low resolutions, we can strate appropriate adaptability, see used and apperform local equivalence on the instances. The images from the import images and apperform local equivalence on the images. The images from the import images and apperform local equivalence on the images relation engines (RORS/RA), such as use small regions in the level registration with solecid registration (RORS/RA), such as used and regions in the level registration and here the registration and the images relation images and the level resolution in the images. The addition, although the underfining ICA algorithm can handle homographic transformation only, MRIR-SR can be used to align planar regions in arbitrary scene, making it more versatile for motion estimation applications.

#### 3.1 The Inverse Compositional Algorithm

The proposed GPU-based implementation is based on the ICA algorithm. It estimates the transformation matrix between two input images by iteratively minimizing the intensity difference between them.

Any two images of a planer surface or two images of arbitrary scene taken at the same vice point are linked by homographic transformation. Given two images to be registreed, are refer one of the images as the tamplate image (T), and the second image as the imput image (T), the coordinate lines the methate image can be warped to the coordinate system of the initiatement but methates image can be warped to the coordinate system of the initiatement but methates.

$$\epsilon = W(i) = W \cdot i$$

where  $t = [t_i, t_i]^T$  is the homogeneous coordinates of the pixel in T and  $x = [x_i, y_i]^T$  in the homogeneous coordinates of the corropording pixel in L W is a  $3 \times 3$  matrix with B unknown parameters which can be used for modeling all in-plane projective transformations. In-

$$W = \begin{bmatrix} 1 + p_0 & p_1 & p_2 \\ p_3 & 1 + p_6 & p_5 \\ p_6 & p_7 & 1 \end{bmatrix}$$
(3.2)

The result of multiplying W - I must be normalized in order to obtain the Cartesian, i.e.,

$$x = \frac{(1 + p_0)i + p_1j + p_2}{p_0i + p_2j + 1} \text{ and } y = \frac{p_0i + (1 + p_0)j + p_0}{p_0i + p_2j + 1}$$
(3.3)

Here, we use a vector to refer the eight unknown parameters,  $p = (p_1, p_1, p_2, p_3, p_4, p_3, p_4, p_7)$ . Let W(t; p) denote the warp with the parameters in vector p. The ICA algorithm estimates the vector p through iteratively performing:

$$\Delta p = H^{-1} \sum_{i} \left[ \nabla T_{i} \frac{\partial W}{\partial p} \right]^{T} \left[ I(W(i; p)) - T(i) \right] \qquad (1.4)$$

and,

$$V \leftarrow \Delta W^{-1}W$$

where  $\Delta W$  is 3 × 3 matrix. It is built by replacing all the  $p_{ij}$  in Equation 3.2 with  $\Delta p_{ij}$ calculated in Equation 3.4. The algorithm converges when all elements in  $\Delta p_{ij}$  are equal to zero or almost zero in a practical implementation. The *H* above is called Hessian matrix, which is not 8.0 matrix calculated by.

$$H = \sum_{t} \left[ \nabla T_{t} \frac{\partial W}{\partial p} \right]^{T} \left[ \nabla T_{t} \frac{\partial W}{\partial p} \right]$$

 $\nabla T_1 \frac{\partial W}{\partial p}$  is called the steepest descent image (SDI) and it is evaluated where all the parameters for the warp are equal to zero. According to Equation 3.3,

$$\frac{\partial W}{\partial p} = \begin{bmatrix} i & j & 1 & 0 & 0 & 0 & -xi & -xj \\ 0 & 0 & 0 & i & j & 1 & -yi & -yj \end{bmatrix}$$

If we refer the gradient of the template image as  $\nabla T_i = [\nabla T_i \quad \nabla T_j]^T$  then,

$$\nabla T_i \frac{\partial W}{\partial p}$$

 $= \begin{bmatrix} \nabla T_i i & \nabla T_j j & \nabla T_i & \nabla T_j i & \nabla T_j & (-\nabla T_i i^2 - \nabla T_j i j) & (-\nabla T_i i j - \nabla T_j j^2) \end{bmatrix}$ 

Since the Hessian matrix and SDI are independent on the warp parameters p, they are constant across iterations. We can pre-evaluate them and reuse them during the iterative process.

Notice that, different from the convertisonal image registration scenaris which warps input image to template image, the ICA estimates the warp matrix that warps template image to input image to avoid re-evaluating the Hessian. In the remainder of this thesis, we keep this convention that all the illustrations are based on warping template image to imat.

#### 3.2 Sequential Inverse Compositional Algorithm

The sequential implementation of ICA is by simply following Algorithm 3-1. We arrange the procedure of the algorithm into 3 steps.

#### Preprocessing

This step (line 1 to 4) traverses the whole template image to pre-calculate the image gradient **VT**, eight SDIs, and the Hessian matrix. For a given registration task, this step only needs to be performed once.

## Local parameters accumulation

Using the current warp matrix, this step (line 5 to line 7) computes the intensity difference between the template image and the warped input image. The difference, scaled by the SDI, is then used to compute the update vector  $\Delta p_s$  of the local warp narameters.

#### Warp updating

In this step (line 8 to 10), we update the current warp matrix using the update vector Ap,. We then check whether the stopping condition is met. If it is, the current warp matrix sent to output and the process terminates. Otherwise the process gives back to the local parameter accumulation step.

#### 3.3 Implementing ICA for GPU Processing

GPU programming uses stream processing model, where all the computation are put into kernel functions, which are applied to the data stream.

Figure 3-1 illustrates the whole procedure of the proposed system.


To simplify code maintenance and maximize code reusability, we arrange the kernels into four modules illustrated by Figure 3-2.



Input image handler

The input image handler needs to perform two busic tasks. The first one is transferring images from system memory to graphics memory, whereas the second task is coverrying the transferred images to the layout optimized for GPU addressing. The converting task is paralleled by the conversion kerned which runs the same studes of strengts on timesar size with other down orthog covering.

Template image handler

Beiden the two built tasks the same as the input image handler, the template image handler is in charge of the preprocessing described in lines 1 to 4, **Algorithm 3-1**. The preprocessing is parallel by both the SDI calculation kernel, which computer SDI for each pixel, and the Heasian accumulation kernel, which computers the Heasian marks from the SDI.

Local parameters accumulator

• This mode conducts the local parameter accoundation. The local parameter accoundating learned is introduced to calculate the warped imaged difference can defecter accoundate the body parameters of the warp matrix with local parameters. If the algorithm converges, it output the updates the warp matrix with local parameters. If the algorithm converges, it output the updated warp matrix so the final result. Otherwise, the updated one is passed back to local parameters accumulate to continue the incretion.





Inteinively, the ICA algorithm works by comparing the itempts difference between the surged input image and the template image and updating the wavey premotents to minimize the total intensity difference. The presentem updatin ignited by image preform, which provides information about how each parameter affects intensity difference through the cight SDs. When the two images contain detailed returns and are initially poorly aligned, the ICA algorithm may fail to register the two images topolers include the game gradient may galaxy deparameter work dotted informats.

To excurne this product and improve the substance of the trigitation, we have graph the coarse-to fine proceeding scheme. As illustrated by Figure 3.3, we first build be donate symmith y hardwords binning and donated product in the industry images to low resultation images. Then the image regularization is preformed from the lower translation bared to its higher exclusion lower. The estimated any most is data to be its interfamed to the ordinate symmetry of the industry of the data its instructions of the ordinate symmetry and the industry of the the initial solution. For example, assuming the warping matrix however the two images fixed at a construction of the initial solutions  $W_{m_{\rm eff}}^{-1}$  at the fixer level is a calculated with:

$$W'_{n-1} = \begin{bmatrix} 1/k & 0 & 0 \\ 0 & 1/k & 0 \\ 0 & 0 & 1 \end{bmatrix}^{-1} W_n \begin{bmatrix} 1/k & 0 & 0 \\ 0 & 1/k & 0 \\ 0 & 0 & 1 \end{bmatrix}$$

where k is the down-sampling ration between the finer level and the coarse level.

23

The down-sampled process removes the detail and the nisis which can potentially indicated bergination process. The lowest resolution images contain only the largecaled forators: that also boolster regaritorics over show the two images are poorly algoed. Using the samping matrix calculated at lower resolution to set the initial sampling parameters, the higher resolution regulations only used to take the switzing parameters hand on the detail information resolution.

Figure 3-4 shows the results using registering the same two images using standard IR and MRIR. While the standard approach fails, the MRIR approach still yields good result.



Standard image registration



Multi-resolution image registration

Figure 3-4 Comparison between standard IR and MRIR using the same images

3.5 Multi-resolution large Registration with Select Registra (MRIR, SR) blue imige the currence from scheme Registrations the relationst effect the registration, it also introduces additional comparation incre the registration, multiple resolution levels. To address that processing evels, have we proposed to use calcular argoins, instead of the whole mays, to protein registration and fast levels. For example, after registrating the two images at the sourcest level, where the resolution is  $n \times n$ , we and a smoot to the faster level with resolution of the Ack, where it is required as the source back hand and at any and a level As place is the registration is large as contentat arous the currence-fore earle, the compation inter an the word downardlike of the faster level.

Andret brefit of using selected region for registration is the issues: of truburstee for solutions: In many meta-ordia situations, only part of the scene: can be registreed by homographic transformation. For example, a scene may contain both planer and nonplaner objects, where only the planer adject perform can be registreed. We are scene contains both for away background and nearly foreground, where the foreground profor in affield by camera movement and home camers the registreed. Usdare these cases, schetzing the proper ngions af fore beel registration. Usdare these manys.

In the rest of this section, we will explain how to select the proper region for registration and how to adjust the transformation matrix when moving from one level to the next level.

25

#### 3.5.1 Region Selection

Assume two images are already registered at a conset level, where image resolution is  $n \times n$ . Before we perform registrations at the finer level, we want to select a region of size  $n_{f_k} \times n_{f_k}^{f_k}$ , which corresponds to an  $n \times n$  area at the finer level. The two criteria for selecting the region are:

- The region needs to contain sufficient detail to facilitate registration. That is, the gradient magnitude should be high within the region.
- The existing registration for the region should be successful. That is, the registration error within the region should be low.

To find the best region based on the above two criteria, we first calculate the gradient-toerror ratio for each pixel using the following equation:

$$R_{4} = \frac{\||\nabla T(\ell)\|}{\epsilon + |\ell(W(\ell)) - T(\ell)|}$$
(3.9)

where [VT] is the gradient magnitude of the coarse template image, W is the warp estimated by the coarse images, f(W(l)) indicates the intensity value in the coordinate system of image I with coordinate H(l), and f(W(l)) - T(l) is the different image between the coarse input image transformed by W and the coarse template image. x is a small matter to used within the rest.

After the per pixel gradient-to-error ratio is calculated, we aggregate the values within local  $n/_{b} \times n/_{b}$  windows using a box filter. That is:

$$\mathbf{A}(x, y) = \sum_{j=y}^{y+\frac{N}{n-1}} \frac{e^{iy}}{m} \mathbf{R}(i, j), x \text{ and } y \text{ are integer numbers form 1 to } N - n \qquad (3.10)$$

Because every element in aggregated ratio A represents the summation of a  ${}^{H}/_{L} \times {}^{H}/_{L}$ sub-region in  $\mathbf{R}$ , we can simply find the maximum element in A to locate the region that gives the highest overall gradient-a-ner ratio. For example, the top-left (p,q) of the deviced region in the free nemptate image is calculated usig:

$$(p,q) = k \times \arg \max_{(x,y)} |A(x,y)|$$
  
(3.11)

The coefficient k is used to convert the coordinate to the fine level.

## 3.5.2 Locating the Corresponding Region in the Input Image

The region selection process describe above is conducted in the image space of the template image. To find the corresponding region in the input image, we can simply transform the coordinate (a, b) to the input image using the warping matrix obtain at the coverse level. That it:

$$(a', b') = W'(a, b)$$
 (3.17)

The two stor of constitutes (is, is) and (is, is) are then used to extract two regimes regiments as the rest level from the input image and the samplast image, respectively. The additional super match is thereas the extract and rest is incompated by the weight, match the previously calculated using the current level. Assuming the origins of the constitute system of the images are to hybrid current and so keep this consention for the rest of the rhois. W<sup>2</sup> on the calculated by:

$$W' = \begin{pmatrix} 1/k & 0 \\ 0 & 1/k & 0 \\ 0 & 0 & 1 \end{pmatrix} \begin{bmatrix} 1 & 0 & p' \\ 0 & 1/k & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{pmatrix} 1 & 0 & p' \\ 0 & 1/k & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1/k & 0 & 0 \\ 0 & 1/k & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & p \\ 0 & 1 & q \\ 0 & 0 & 1 \end{bmatrix}$$
 (3.13)

where (p',q') is the position of the top-left corner of the extracted input image is the fine input image. To calculate (p',q'), we first locate the center of the selected input region in the coarse input image by,

|             | for Bud    |
|-------------|------------|
| 191         | UP + 21/K  |
| $ b  = W_c$ | (a + 1)(b) |
| 111         | V4 + 21/h  |
|             |            |

then, we compute (p', q') in the fine input image by:

$$p' = ka - \frac{n}{2}$$
  
 $q' = kb - \frac{n}{2}$ 
(3.15)

## 3.5.3 Implementation

The implementation of the XBHE/SE is illustrated by Tgener 3-2.6 Insolute the image registration component explained in Section 3.3, the system also contains of other from section description of the system and the system and the system samples. There registration explained is backwared to description of a well matrix, the arguingtion applicable of the down-samples thready target and there the applications near studies. A then, the accounding the surget and the study applications over studies, the data down-samples thready target and there the application strengt study. The data down application the study of the and finally the maximum nareching kernel locates the maximum point in the saggregation result. Alter mapping the bound points to the original image, two regions we extracted the the internet and strengths strengths are the study of the strength strengthstrength strengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstrengthstren again for registration at finer level. The output warp matrix at finer level is transformed back to the coordinate system of the original image as the final result.



Figure 3-6 illustrates the results using different image registration approaches to register two same images which have relatively small overlapped area. While both the standard IR and the MRIR fail, the MRIR-SR approach still outputs a good result.



#### 3.6 Discussion

In this Chapter, we discussed a parallel image registration approach and two of its variants. The straightforward IR approach implements the ICA algorithm in parallel using stream processing model. This allows us to take advantage of the processing power of modern programmable GPUs, which are much faster than the CPUs.

To improve the registration performance, we propose two multi-resolution variants. The MRIR approach helps to improve the robustness of the system through incorporating the coarse-to-fine scheme, but at the expense of additional comparational cost. The MRIR-SR approach uses selected region for fine level registration, and hence dramatically reduces the comparation out when the inject image resolution is high.

The two sprunches have their sen advantages under different securities. When their whole score can be registrated using homographic transformation and the processing processic sufficience, we recommond using the MRR approach, which can might no imputinguas accurately using the surging matrix educated to neutrate, the MRR AS growthevaluates the fault sampling matrix using local registro only, which could local to maingingeneral along the boundary that is for from the activated registro. However, MRR-SR has the advantage of higher presenting used and hole growthere in the matrix matrix and the distance transfers, local registro could be proceeding on the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built to local the distance transfers, local registro counter built local registro counter distance transfers, local registro counter built local registro counter distance transfers, local registro counter distance to local registro counter distance distance

# Chapter 4 Optimization of the Image Registration Module on CUDA

The provisor dupper discussed the parallel implementation of image majoritation and its too mendio-underlow strains. Because the first orace in the building tables for the two variants, see would like to optimize the polinizarian model: an (CDA). This locksche, the second polinica the image regrateration model: an (CDA). This locksche, the competitions, negleding parallel reduction to the CDU in a two work work to the CDU in the second polinic distribution of the competition of the CDU in the second polinic distribution of the competition of the CDU in t

### 4.1 Introduction of CUDA

The CUDA programming model is based on the hardware features of the CUDA architecture. CUDA devices can run thousands of threads in parallel and the threads are connized in a hierarchy. illustrated by Figure 4-1.



#### Figure 4-1 Hierarchy of CUDA threads

Threads are first proped into blocks and blocks are further proped into a pith. A parallel function, or a kared in turn of GPU programming, has only one pith. The blocks of the pith or distribution the cluster of the proceeding units called methodenesses at run time. The multiprocessors schedules and executes the allocated blocks. The threads in a block can use the fast on-chip shared memory to accelerate inter-threads communications.

Besides the shared memory, CUDA devices also provide other storage devices to speed up memory access or ease programming, such as texture memory for caching special random accessing, constant memory for accelerating global variable fetching, and local memory for holding temporary induced data.

The proposed optimization techniques are tuned for two series of CUDA devices. One (including G80 and G90) is the wide-used devices with Compute Capability 1.x (C1.x) architecture. Another (G100) is the new released devices based on Compute Capability 2.9 (C2.0) architecture. The C2.0 devices are designed more suphisticated for presentpurpose compating. They provide larger instruction throughput, faster memory accessing speed and more registers and shared memory per block. The device memory accessing is also accelerated by Lina IO 2 caches which are not provided in C1. devices.

#### 4.2 Implementation

To make full use of CUDA devices and reduce unnecessary high-latency CPU-GPU communications, we implemented most tasks on GPUs. The parallel tasks are assigned to 6 kernels labeled by Kernel 1 to Kernel 6. The rest sequential tasks are encapsulated by the CPU functions.

The functionalities of those kernels and functions and the relationships between them are shown by Figure 4.2.



#### 4.3 Optimization Techniques

#### 4.3.1 Balancing Workload Distribution

When humples a single foread, the performance of the CPU is much higher than that of an individual CPU processor. It is therefore beneficial to earlies simplement in this to the CPU and parallel takes CPU for maximum instruction throughput. However, if a sequential task appears in the middle of CPU executions, intermediate results will need to be transformed between CPU and CPU through graphics 100 hss. These TD transportations here high targes and damated transports the performance in this case, we should measure whether the singles thread upwel advantage of CPU conseights the high 100 scs. If the sequential tasks are not completed, we may run throu with weed thread (CPU as world CPU source)

For example, the tasks of the warp updater module are sequential-Hirsdly. They are either total executions or small matrix operations which have little parallelism. The CPU implementation tends to yield heter performance. However, the GPU version is twice as fast as the CPU version, since its removes the needs of frequently transporting the local agamenter from GPU to CPU under local explande updated warp multitudes locals (GPU explicit).

#### 4.3.2 Parallel Reduction

The ICA algorithm requires inter-thread summations (line 4 and line 7, Algorithm 3-1, Section 3.3.). Most current GPUs support atomic function for this kind of tasks. However, it is not the most efficient approach for large summation. Currently, the speed of atomic function is secural times slower than conduced operations. It will dramatically reduce the instruction throughout whom maxive used in parallel. Moreover, only current advanced graphic cards provide atomic operation writing to on-chip storage to each the intermediate result. If atomic operation is not supported, the results have to be written to off-chip graphics memory which has handreads of times longer latency. Therefore, we would like to find another method to seem of the tummation.

The parallel moderies [32] is an efficient choice for interdend automation. It uses the faintst calculated numeric operations. After leading inputs from parallels memory, in a parallel state of the state of the state of the state of the state numery disrupped, Marovers, parallel reduction only uses the standard features supported by attoost all CUDA-compatible graphics cards. There is its compatibility incost.

To make this out of the parallel relations, we should allocate a maximum number of dataska per block, based on the available resources the on-chap memory to exclude the maximum relation of the many determines as possible can be related in a single block as possible. Appendix A.3 Barmanne a possible relation be related in a single block as possible. Appendix A.3 Barmanne as possible relation by the complete based on the fully septimized varies presented in [23]. It is capable of relation and the fully septimized varies presented in [23]. It is capable of relation backware compare, capability 13. If the number of determines much to related its more than the resource available, we can assign multiple block is into a two level between the profession of the level as the physical block into a two level between profession fulls location and compute the isomendate results the application muscle and one-level muscle as value the bit three bittermingstreements between a substree bittermingstreement the intermediate results the transmission. Our test shows that the discussed parallel reduction is generally 40% faster than the atomic function. Even the atomic function used is the optimized version presented in Appendix A.2.



#### 4.3.3 Sequential-Then-Parallel Processing

A common strategy to parallel execution is subscring as many threads as meetingaccording to the topol data and hering GPUs to anomatically subscring their executions. This is an efficient auxiliants when the number of builds instead is not so to much larger than available processing units. Otherwise, GPU will subscribe a log execution queue, causing cases evolved of the complex context within the audition. It is strategible with the strategible programs which need as gargerate the outputs of different franks. Research the strategible queue gargerate the outputs of different franks. Research the strategible queue gargerate the outputs of different franks. Research the strategible queue is chosed as warres for the queue gargerate that the frank common excite intermediate results in choreal memory for the queueign directs. They have to be written to the graphics memory. Thus, the performance of those programs is highly depended on the eraphics memory handwidth which will limit the instruction throughput.



Figure 4-1 dimension the automatically quering cars: If a CPU is capable of norming burparallel threads and the pregime negatives 16 threads, 12 of them will be quered. They can be able to be able downstand of the datases of the threads the total free dimension. Here, we want is compare, intermediate result 16 times. Those outputs are graphics memory writing subsets interrupt is depended on the based-shift of the graphics memory, which is smallly handwork of their software that regimes or them remove.

We would like to apply another strategy to avoid thread scheduling and to enable inthread caching to make the kernel less depended on the graphic memory bandwidth. The valutaria hiere efforted as sequentide how-parteled processing. To apply the expensiontion parallel processing, the programs must be maximum samber of exists: Instanda and each head sequentity handles multiple task by boping. The H, for the task messaring in Figure 4.4, the programs must fair that hand and such thread heap for timos to program who shole data. Thus, the GPU' cally works to have the control of dimension of the program needs arguing/marginal, intermediate result, each of tested in softward and single and program needs arguing/marginal, intermediate result, each of the control of dimension programs of the program.



In practice, such as processing large images, applications usually require massive threads. The hybrid processing strategy will dramatically reduce the complexity of thread scheduline and makes agreenting kernels less decend on graphic memory bandwidth. We apply the hybrid processing to no tasks – the Heasian adulation (Heard 1 and Heard 1) and the head parameter accountainer (Kenrd 2 and Kenrd 6). (Kenrd 1 and Kenrd 1 are in those of reducing international translas. Kenrd 4 and Kenrd 6 hear aggregate the intermediate results. Meand path ten anismum atter the backs, can be infraesed for started 1 and Kenrd 1 are reducing and the path ten anismum atter the backs, can be infraesed for kenrd 1 and Kenrd 1 are reducing and ten anismum atter the backs can be infraesed for kenrd 1 and Kenrd 1 are reducing and ten anismum atter the back can be infraesed the tened 1 and Kenrd 1 are reducing and tened to the startest and the startest and the startest and tened to the startest and tened to the startest and tened to the demant.

To sum the output of Kernel 3, which is a chunk of 8 × 8 matrices, we use 64 threads in Kernel 4. Each thread is in charge of sequentially summing together the intermediate result for one element of the matrix.

For kernel 5, whose output is a chunk of vectors with 8 elements, we choose a more sophisic-and parallel reduction method. It uses idle threads of the standard parallel reduction to reduce multiple dimensions at the same time. Compared to adding vectors by looping standard parallel reduction fir each element, it requires much less adding operations. Appendix A shows its CUDA implementations.

The test shows that there is up to 43% performance gain after applying the hybrid processing.

#### 4.3.3.1Tuning Distribution of Blocks of Sequential-Then-Parallel Kernel

Tuning the sequential-then-parallel processing introduces two steps. The first step is to decide the maximum active blocks per multiprocessor (MAB). Because we use the maximum threads per block for reduction, the MAB is only decided by the shared memory and register usage. These two factors can be monitored by the compiler output. With the maximum thread per block, shared memory and register usage, we could use CUDA profiler, a load shapped with CUDA SDK, or reference to the specification of CUDA hardware to calculate the MAB.

The second step is to fully occupy each multiprocessor with the blocks in the number of MAB. Assuming there are total M multiprocessors in the current device, we can simply assign  $M \times MAB$  blocks because CUDA evenly dimitstates the blocks to each multiprocessor.

The experiment shows the peak of performance appears under the discussed workload distribution setting. Compared to less-occupied and over-occupied case, it is generally 10% factor.

### 4.3.4 Memory Access Pattern

To maximize the memory throughput of the CUDA device, we optimize the data structure and memory accessing pattern for the images, SDI, and Hessian matrix.

#### 4.3.4.1 Coalesced Memory Layout

Because the 32-bit memory layout is the fastest data structure for CUDA to perform coalescod access, we oreanize the 2D data in this way.

The data includes input image, template image and SDI. Each pixel or element of those data is represented by a single 33-bit float variable and all the pixels are stored in pitched row major order, i.e., rows are stored one after the other in the linear space. The reason why we use the redundant 32-bit variable is that it is the native length of the Arithmetic Logic Units and Float Point Units of CUDA devices, which yields the fastest arithmetic operation speed.

## 4.3.4.2 Texture Memory Caching

There could be a massive us-calcular accurate generation in the particle inverse compositional algorithm. It accesses the image with warped contracts. Nearone the appropriate is upgated and the state of the state of the state of the state of the two a contexed way. Therefore, we use the instant memory to cache image increasing. The speed up is highly depends on the practical condition. In general, it yields up to 3% hetter with.

#### 4.3.4.3 Constant Memory Broadcasting

The constant memory is ideal for broadcasting global constant for every thread became its content is cached and therefore finite than the un-cached device memory. In the proposed system we use the constant memory for broadcasting the warp matrix to every thread to perform the image transformation. It is 10.5% faster compared with proving ware mutrix by device memory.

#### 4.4 Bottleneck Optimization

If we monitor the execution of the parallel ICA implementation, Kernel 5 takes more than 90% execution times. It is obvious the bottleneck we would like to further optimize.

```
__hbred__flast s_g(1)://msptph hafter

//msmc1stg log

for (list s = 0; s = 1; s = 1;
```

The complex implementation of Kernel 5 is literatural in COLMard produceds in Appendix A.5. Here we extend the intervention directly related to the optimization and future them in Fagure 4.6. Solver the bulk lines of the Agardinth. The number of particle inductions model at 8 w H.7 we can accommand the buck way parameter with induced in real in the paradis-capital or a buffer and the parameter models in the accommandiz upon parameters and parallel relation B times. This will save a singlicitual amount of instruction cyclus. Hence, the key to optimize Kernel 5 to to find a work bulk fits.

#### 4.4.1 Local Array Buffered Approach

From the programming perspective, the most straightforward approach to buffer dp is by local array. It is illustrated by Finner 4-7.

```
___hard_____fast_____(())/rand arms for:

That dput() = ()/rand arms for thereing an annulation

//rand hard set = ()/rand arms ()/rand ()/ran
```

In this approach, every thread adds dp to the local array dbBuff[] and add together every elements of the array after the accumulating loop.

## 4.4.2 Partial Reduction Approach

Using local array to hulfin of p in set efficient. Stated in Station 5.222, 10,14 m bits due to many space realistic a doctore memory, and sense memory accesses have sense high latency and low handwidth as global sensory<sup>2</sup>. For each thread, there will let 3 v M global memory writing in accounting tops and 5 pacifield relations when relating the Hulferer sense. The each of these global memory, access may concern the sensing of rubations interactions and reachs in generative cases, and access the sense of rubations interactions and memory in performance drugs. Access the particular transformance drugs alter area (in particular). We would like in find fame buffeling operations. One possible solution is using almost memory but there is a limitation. In most current CUDM devices (CLA devices), the analytic short memory assume hold the de first all threads. For a block of 215 threads (the maximum threads per block to fully make use of the parallel reduction) and endthread need B that wateries for the local parameters vestor, the batch almost memory are per block in \$12 × 81 × 61 × 10588 bytes. It is the maximum valuable thread memory per block heating and the start of the maximum valuable thread memory per block. Backness since shared memory is pro-coupled for parameter paneling the special using is slightly smaller than \$1628 and thus not enough for the complete buffer-

buffering

shared\_\_ float s\_dp[8]; //ouptput buffer \_shared\_ float s\_sum[8][256]; //partial reduction buffer \_\_shared\_\_float s\_psum[256]; //reductin buffer for(int c = R: c < M: ++c)( for(int s = 0: s < 8: ++s)( float do = diff == 0 ? 0 : diff \* SDI(1, 1, 5); s nsum[threadIdx.x] = do: syncthreads(); if(threadIdx.x < 256) (s sum[s][threadIdx.x] += dp + s psum[threadIdx.x + 256]; syncthreads();) ... } //end of the accumulating loop //reduce the buffered result for(int 1 = 0; 1 < 8; ++1) float sum = reduce256(s sum[i]); if(threadIdx.x == 0)s dp[i] = sum;} Figure 4-8 Partial reduction approach

To address the problem, we reduce all the dp from 512 elements to 256 elements to decrease the shared memory usage. This approach is illustrated by Figure 4.8. The ore step reduction shows in bold is referred as partial reduction. It reduces the 512 elements to 256 elements. The total buffer size is decreased to 256 × 8 × 4 = 6KB which is much low than the available shared memory.

The test shows that this approach can accelerate the system up to 41%.

#### 4.4.3 Unrolled Register Buffered Approach

| sharedfloat_s_dp[8]; //ouptput_buffer<br>sharedfloat_sum[512]; //reductin_buffer |
|----------------------------------------------------------------------------------|
| float s0 +0, s1 +0, s2 = 0, s3 = 0, s4 = 0, s5 = 0, s6 = 0, s7 = 0;              |
| for(int $c = 0; c < M; **c)$ {                                                   |
| //unroll the accumulating loop                                                   |
| ((sdi) += diff == 0.0f ? 0.0f : diff* SDI(i, j, s);)                             |
| DELTA_P(s0, 0);DELTA_P(s1, 1);;DELTA_P(s7, 7);                                   |
| //reduce the buffered result                                                     |
| #define REDUCE_TO(sdi, 1) {\                                                     |
| <pre>s_psum[threadIdx.x] = (sdi);\syncthreads();\</pre>                          |
| <pre>(sdi) = reduce512(sum, (sdi));\</pre>                                       |
| if(threadIdx.x == 0){ s_op[1] = (sol);}_syncthreads();}                          |
| REDUCE_TO(s0, 0); REDUCE_TO(s1, 1);;REDUCE_TO(s7, 7);                            |
| Figure 4-9 Unrolled register cached approach                                     |

Another possible caching method is using registers which provides the fastest local accessing speed. To enforce buffering variables residing in registers, we unroll the accumulating loop with hard coding. Figure 4-9 illustrates this approach.

Variables 40 to 57 are used for buffering the local parameters. Macro DELTA, P(odl. 4) unrolls the loop. The CUDA compiler will use the register to reside the variables for the unrolled hard-coding, Macro REDUCE\_TO(udl. 4) is used for simplifying the coding for robating the buffered result.

Compared with the partial reduction approach, the unwilled register buffered approach requires no shared memory usage and removes the  $11 \times M$  partial reductions. More important, accessing the register buffer should be faster than accessing the shared memory buffer. This various is proved to be the most efficient approach by experiment which can accelerate the stratm on the 30%.

#### 4.5 Evaluation

The evaluation is to test out the effectiveness of the discussed optimization techniques. There are two sets of images to be evaluated – one with  $256 \times 256$  resolution and the other with  $512 \times 512$ . They simulate the low workload and the high work load conditions securetic.

Because the execution duration of the same kernel differs from times to times, we count, the average execution time to compare the performance difference, where the execution time is measured using the number of frames can be processed in one second (FPS).

# 4.5.1 Evaluation of the General Optimization Techniques

The algorithm independent optimization techniques including simulating sequential execution in GPU, parallel reduction and hybrid processing are evaluated in this section.

Figure 4-10 illustrates the performance of the different versions of the proposed implementation.

| GEFORE     | CE GTX 480, | Xeon E5440 | 2.53GHz CPL | J, 4GB Memor | ty .  |
|------------|-------------|------------|-------------|--------------|-------|
| Image size | 1           | 2          | 3           | 4            | 5     |
| 256*256    | 0.9         | 31.92      | 63.7        | 64.01        | 91.83 |
| 512 * 512  | 0.34        | 10.63      | 21.02       | 29.81        | 34.11 |

1: Sequential ICA

2: Parallel ICA using atomic functions and CPU warp matrix updater

3: Parallel ICA using atomic functions and GPU warp matrix updater

4: Parallel ICA using parallel reduction and GPU warp matrix updater





Version 1 is the sequential ICA. Both the high workload and the low workload results are less than 1 FPS and hence cannot meet the real-time requirement. From the comparison between Version 2 and version 3, we can clearly see that the GPU warp updater in much faster than the supported one, even though this warp updater models itself runs faster on CPU. This is because the updater is called repetitively where updating warp matrix, the 1O cost for transporting intermediates between CPU and GPU coversign the acceleration of CPU.

The comparison between version 3 and version 4 shows the power of parallel reduction. While the its advantage is not evident under low workload testing because it requires fewer instruction throughput, the parallel reduction yield 40% (8 FPS) speed up for high workload testing.

The performance is further speeded up after applying hybrid processing strategy to improve Version 4 to Version 5. The low workload testing gains 43% (27.81FPS) speed or while 14% (4.3FPS) for the high workload test.

#### 4.5.2 Evaluation of the Optimization Approaches for Bottleneck

The extantion of the splitituding approaches for botheneck starts at the previous isomolocal hybrid proceeding version as the original approach. Houses the memory access methods are different GUDA devices, we select two graphic earls as the directorizes. The Geb based Quadra YX4000 preposents the currently our C1 s architecture while the new released GID0-based GRI/BRCC GIX400 represents the more powerful C2 db advices. Notice that the davice memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them the devices memory accenting is called in C2 db advices bashes and them devices.

| Approaches        | Compiling                          | esults                  | Maximum active blocks per<br>multiprocessor<br>{ number (occupancy, limitation) } |             |  |
|-------------------|------------------------------------|-------------------------|-----------------------------------------------------------------------------------|-------------|--|
|                   | Shared memory<br>per block (bytes) | Registers<br>per thread | G90                                                                               | 6100        |  |
| Original          | 2212+16                            | 15                      | 2(100%)                                                                           | 3(100%)     |  |
| Array buffered    | 2212+16                            | 15                      | 2(100%)                                                                           | 3(100%)     |  |
| Partial reduction | 10304+16                           | 15                      | 1(50%, smem)                                                                      | 3(100%)     |  |
| Unrolled          | 2112+16                            | 22                      | 1(50%, reg)                                                                       | 2(67%, reg) |  |
| Unrolled(C2.0)    | 2080+0                             | 20                      | N/A                                                                               | 3(100%)     |  |

smem: shared memory, reg: registers

Occupancies

| Quadra FX480      | 0(C1.3, 24 mult<br>2              | iprocessors, 1<br>53GHz CPU, 4                                    | 5G8 384-bit G<br>G8 Memory     | PU memory), 1           | Keon ES        | 140   |  |
|-------------------|-----------------------------------|-------------------------------------------------------------------|--------------------------------|-------------------------|----------------|-------|--|
| Approach          | FPS un                            | FPS under different active blocks per<br>multiprocessor (256/512) |                                |                         | Maximum FPS    |       |  |
|                   | 1                                 | 2                                                                 | 3                              | 256                     |                | 12    |  |
| Original          | 36.48/12                          | 41.30/13                                                          | 43 39.22/1                     | 3.57 40.76              | 14.3           | 2     |  |
| Array buffered    | 36.70/14                          | 51 31.22/12                                                       | 53 28.10/1                     | 2.29 36.7               | 6.7 14.51      |       |  |
| Partial reduction | 47.84/18                          | 99 42.74/18                                                       | 123 39.04/1                    | 7.25 42.84              | .84 18.99      |       |  |
| Unrolled          | 59.82/26                          | 94 55.68/25                                                       | 37 50.33/2                     | 1.73 59.82              | 26.94          |       |  |
| GEFORCE GTX48     | 90(C2.0, 15 mul<br>bandwidth ), 1 | tiprocessors, 1<br>Seon E5440 2.5                                 | 5G8 384-bit 0<br>13GHz CPU, 40 | PU memory w<br>8 Memory | ith 177.4      | GB/s  |  |
| Approach          | TPS under d                       | IPS under different active blocks per multiprocessor<br>(256/512) |                                |                         | Maximum<br>FPS |       |  |
|                   | 1                                 | 2                                                                 | 3                              | 4                       | 256            | 512   |  |
| Original          | 57.07/17.44                       | 86.72/29.79                                                       | 91.82/34.11                    | 86.72/29.79             | 91.82          | 34.11 |  |
| Array buffered    | 77.80/28.54                       | 83.25/32.16                                                       | 85.46/32.43                    | 72.95/30.04             | 85.45          | 32.43 |  |
| Partial reduction | 76.86/28.55                       | 91.24/41.13                                                       | 93.14/48.12                    | 92.91/38.92             | 93.14          | 48.12 |  |
| Unrolled          | 90.88/37.56                       | 92.11/56.46                                                       | 92.14/61.41                    | 91.64/50.70             | 92.14          | 61.41 |  |
| Unrolled(C2.0)    | 90.97/37.75                       | 91.18/56.85                                                       | 94.51/61.62                    | 89.88/50.62             | 94.51          | 61.62 |  |

256: use 256 \* 256 images as inputs, 512; use 512\*512 images as input, underline: result under MAB, FPS, red/red: number per blocks over-saturated

Performance results

Table 4-1 Evaluation of bottleneck optimization approaches

Moreover, under C2.0 setting we found that the compiling result of the unrelied register buffered approach yields a higher MAB which could provide more parallelism. We list the result under this setting in a separate entry to text whether it helps to improve performance.

The occupancy (occupancy is defined as the ratio of the number of resident warps to the maximum number of resident warps[34] of the stream processor) and performance results of each approach are shown by Table 4-1.

# 4.5.2.1 Workload Distribution of Blocks

To test how the smaller of blocks per multiprecesson affects the porfermance, we record the results under different number stratings. The results show the same pattern, illustrated by Figure 4.11. Car dis the approaches, the portenance interastics first from the undersattrated blocks per multiprecessors to the maximum values and then decreases. The maximum values appear under the AMM number sating. This proves string the blocks



Rentral conferm









#### 4.5.2.2Performance Comparison

Compared to the original approach, all except the local array buffered approach gain faster speed. This approach frequently accesses the local array and therefore decreases

the entire performance. Even running on the C2.0 device which device memory accessing is cached, there is still a performance drop. It suggests that avoiding device memory access is still an important consideration even under this latest platform. The annular paper eached approach jubic the best recall under any confidion (or by they increase composed the origics) shallong their company of the annular regime cached approach is lower than other approaches under CLA devices (Occupantics, Table 44) Decome of the hancy usage of regimes, the gain of memory throughout approach the draw of interaction throughest and theories that is the imprevention of the performance. The result suggests that for memory handwidtheout applications, we are the increasion memory handwidth to interface throughout applications, we are the increasion memory handwidth to interface throughout applications, we are the increasion memory handwidth to interface throughout.

Particularly, compared to CLA compiling version, the C2.0 compiling version doesn't significantly improve the speed although the compiler reports higher occupancy under this setting. It is possible because the C2.0 device could optimize the CLA compiled kernel in metime.

#### 4.6 Discussion

This chapter discussed the techniques that optimize the parallel implementation of ICA on CUDA. They are capable of accelerating the parallel system form 10.63 FPS to 61.62 FPS, which accounts for 600% speed up.

Those optimization techniques are also applicable to other applications on CUDA. The low workload sequential execution is highly recommended being simulated in OPU with aingle threads or several threads to avoid the 1/O communication. The 1/O cost incurs a low enversish that will lowanticable resolute the utilization of CPU.

For massive inter-thread summation, the parallel reduction combined with sequentialthen-parallel processing is preferred because of its high efficiency and compatibility. If
atomic functions must be used because of lack of shared memory or other reasons, we recommend using the modified local function proposed in Appendix A.1.

For is dread caching, the register is the first choice for performance. If the caching task requires a small array, the array should be replaced by several hard-coding variables to somere the elements are resided in register instead of the drevice memory. Even if mass usage of register could pretorially reduce the occupancy of the GPU resources, it is still a good task-off for read-ing device memory accessing.

# Chapter 5 Motion and Visual Controller

Nonadays, controllera otter fran for attalistical gamepale are attaly velocited in anging contexts. They previous france and control for the destructure or will prepose a series of holding-and-pointing structures. The induced structures for visual based anometry are ness france france imperiations (R2) module presented in Control 4, the Nessearch metrics through controllera usa accomberness and growspore. Moreover, we will prepose how to hold a new accurate hybrid controller by utilizing both visual and metrics information.

### 5.1 Visual Based Controller

The visual based controller is the device that controls the cursor movement in 2dimensional space using images captured by a video camera. The key to the controller is to use image registration algorithm to track the local movement of the video camera. For any given two successive captured frames of the video camera, if they can be registrated toocher, then we can use the registrated buy to update the roution of the focus.

Figure 5-1 shows how we use the registration result to estimate the focal movement. The red rectangle indicates the image captured in last finane while the blue rectangle indicates the correct finane. The colored does are the center of the two images respectively. From the figure, we can clearly identify how the focus moves. After spdating the new focal position, the current learn is used as the previous financ for the next registration for repeating this step, we can track the focal movement and use it to control the cursor in real-time.



Figure 5-1 Tracking focal movement by the registration result

#### 5.1.1 Threading image capturing and image registration

The most important building block for the proposed device is the IR module. It iteratively fetches the image from the video camera, performs the image registration and control the cursor position using the mouse API with the registration result.

Video cameras usually capture the image stream in a constant frequency. If image registrations and image capturing are nn under the same thread, the final processing speed will be showed down. For example, if the sampling frequency of the video camera is vide and the R models in capable of performing 60 registrations per second. Ignoring the cost of other executions, the total execution time is 1000ms/30 + 1000ms/60 = 50ms, i.e., 20 registrations per second.

To address the problems, we implace a multithread source finding approach. Video screem copring is conduced by an independent frend. This first approach Video and a spectra of the foregarming of the sciences. If the source is upplied, it hadness an updating figure with "three" indicating the image is received. In another thread, IR model quarties whether the image is upplied, If yes, it will fields the image and then set the updating figure "three" indicating priorities. Otherwise, this thread will shop head first the updating figure "three" and constraint graphical sciences and the spectra of the value and two character query.

Resource locking is used for unintended image updating. When the IR module is fetching image, it will lock the image to disable updating from the input handling thread. After fetching, it will unlock the image to enable updating.

Figure 5-2 illustrates the multi-threading approach.



In this way, multithreading allows its meduic furth taster than image capturing. Notework, it parallels the image capturing and the image registration process, allowing the CPU to handle image capturing while the GPU performs registration.

## 5.1.2 Mouse control

After every registration, the result warp matrix is used to control the cursor of the mouse. If a warp matrix is,

$$W = \begin{bmatrix} P_0 & P_1 & P_2 \\ P_2 & P_4 & P_5 \\ P_5 & P_7 & 1 \end{bmatrix}$$

then, the displacement s of the focal movement is

$$\begin{split} s_x &= \frac{p_y f_x + p_y f_y + p_y}{p_y f_x + p_y f_y + 1} - f_x \\ s_y &= \frac{p_y f_x + p_y f_y + p_y}{p_y f_x + p_y f_y + 1} - f_y \end{split}$$

where  $(f_x, f_y)$  is the coordinate of the focus of the captured image. Usually, it is the center of the image.

These are no ways to control the outer. The first is pauling a, and a, as the displacement of currant to the manue API. This method is simply to implementation and these has proposed activate comparing with other discrimes to any enterthan second in simplicing a, and a, turne the shoulder position of the first and at it as the absolute constance of the current. The absolutes is hinger constitution of the current of the current of the current or distribution of the based of the current conditions of instage is shown in the current could be detended of the results of the application.

## 5.1 Motion Based Controller

In this section we will present an efficient wireless air-morse using accelerometers and groscopes. When using the controller, the user simply rotates it horizontally to control the enzyor movement in horizontal direction and vertically to control the corror movement in vertical direction.

## 5.1.1 Introduction of Wiimote

The device to provide acceleratories and generopes is the WF sensor controller (Winneck). A hubban ADXX339 acceleratories measures accelerations along the properdicular accelerations marge from 2 yeb 10 vg generationed from. If analoch ym and ac device called Wi Mosiellan (abort for MotionPlan), Winnet can sense angular velocities along 3 directions - yaw, pick and row with the MotionPlan incorporated to sensity taming fish generospe. **Figure 5.3** likettator the two sets of the motion data.



The Wilmote also provides the necessary buttons and the connection to PC. It can be paired with Bluenouth receiver and programmed by driver APIs to retrieve the motion readines.

#### 5.1.2 Motion Estimation

The movement of the cursor is controlled by the horizontal and vertical angular velocities when saving the Winness, If we define the h and v as the magnitudes of the current horizontal and vertical angular velocities, then the displacement of the cursor  $\mathbf{s}(s_x, s_y)$ can be calculated by

$$s_g = ah_c s_g = av$$
 (5.2)

where  $\alpha$  is the coefficient to decide how fast the cursor move.



Denote the current readings of yaw and pitch as y and p. If the rotation angle along the Yaxis of the Wiimote (angle of roll) is  $\theta$  (Figure 5-4), then we can calculate h and v by,

$$h = y\cos\theta + p\sin\theta$$
  
 $v = -y\sin\theta + p\cos\theta$  (5.3)

## 5.1.3 Estimating Angle of Roll

The only auknown in equation 5.3 in the angle of roll *B*. It can be measured by the acceleratoristics: If we arrange the readings of acceleratoristics to a vector (x, y, z), when hading with, this vector is smally the approximated wetters. The values of the elements are the projections of gravity on the three perpendicular news and the length of the vector is equal to 1. When the Wilness is patient for grand surface, the value of the vector is 10.1 for use ends the Wilness is a planted to grand surface, the value of the vector is 10.1 for use ends the Wilness is a shared and wetter. The values of the vector is the starts of the starts of the starts of the starts of roll is cannot be the starts of the starts in the start starts in the starts

$$= \arctan \frac{x}{x}$$

(5.4)



#### 5.1.4 Implementation

Figure 5.5 shows the implementation of the motion based controller. In each humition of the mouse controlling, the first step is to opdate the angle of rull according to Equation 10.5. Notice that the first step is to opdate the angle of rull, the updating is not repreferented for each intraction [35]. The typions will wank the langh of the accordant scenar to moure its infrared equal to one for a small period (e.g., secured instantion). If its, the Wilnish as assumed to be almost attaining yield therefore the accordantion. We not the Wilnish as assumed to be almost attaining yield therefore the accordantion to each or secure transmission want. A this tim, the adje of of all updated.

The second step is to estimate the motion by simply following Equation 5.3 to compute the horizontal and vertical angular velocities.

To provide stable feedbacks for mosse APF, we need to filter out unstable values. Shake of hands and dovice reading errors result in noises. If we let noises prox through, controlling cursor will be hand. Here, we add a threshold to filter out the noise and applicit the signation noised ratio.

At the last step, we output the motion parameters tuned by Equation 5.2 to mouse API to move the cursor.

#### 5.2 Hybrid Controller

The visual based controller provides high accuracy for slow or small range motion, but the accuracy degenerates under fast motion due to the low sampling frequency of the video camera. Captured at 20 FPs studer fast motion, two adjacent finnes may have very small overlapped region, which may cause the image registration process fail. Thus the tochrace of fast more speed in finding to a relative small range. Thus the controller provides reasonable real-time motion feedbacks, even for the fast movement, but it has a relatively low accuracy because of the error accumulation of the motion measurements and the estimation of the angle of roll. If we combine the merits of the two devices, we could build a more efficient controller.

In this section, we propose the hybrid controller. It uses the motion based estimation to provide pre-knowledge to the visual based image registration and thus yields a high accuracy even under fast motion.

## 5.2.1 Motion Based Pre-Knowledge

To improve the robustness of the visual based controller under fast motion, we need to provide images with larger everdaped region to the IR module. If we extend the field of vision of the vision sames and extends an areas as the current image to be registrared corresponding to the current fical moving direction which is estimated by the motion and approach, the region should have a larger everaphend prior with the next forms.



Figure 4.6 shoutes this case, where the ord recouple indicate the extended area in the extended apparent image shoure position is corresponding to the current field review of the extension. The takes compared indicates the experiment on the next finner which should be the same as the use extended image. Compared to the two firstners used in visual based approach such rise same confident (bits in Figure 5.4), they have larger extremely extremely extremely one of the extremely of the two firstners and the strength of the same same confidence (bits in Figure 5.4), they have larger extremely extremely extremely one of the same confidence (bits in Figure 5.4).

#### 5.2.2 Floating Window

Here we use term "floating window" to refer the approach for extracting the informative area using the motion based pre-sknowledge. It uses a fixed-size window floating in the extended region to decide which area to be extracted as the input image for the next explanation. The postion of the window represented by the condinate (x,y) of the topleft corner) is decided by the current focal velocity estimated by the motion based approach.

If the resolution of the camera is extended by 2m pixels both in width and height, the coordinate (x, y) is updated according to Equation 5.5

| $x = \begin{cases} m \\ m + \frac{h}{A} \\ m + \frac{ h }{h} m \end{cases}$ | $\begin{split} h &= 0 \\ h \neq 0, \left  \frac{h}{A} \right  < m \\ h \neq 0, \left  \frac{h}{A} \right  \geq m \end{split}$            |
|-----------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| $y = \begin{cases} m \\ m + \frac{y}{A} \\ m + \frac{ y }{y} m \end{cases}$ | $\begin{split} & \nu = 0 \\ \nu \neq 0, \left  \frac{\nu}{A} \right  < m \\ \nu \neq 0, \left  \frac{\nu}{A} \right  \geq m \end{split}$ |

where h and v is the horizontal and vertical focal velocities and A is a positive integer to reduce the range of the h and v. A should be set according to the screen resolution of the target application.

When holding cill, h and  $\tau$  are equal to 0, the floating window is located in the center of the image with (x, y) is equal to  $(\tau, m)$  (**Hgure 5.7**, a). If the floats is moving, the floating window will move to the coordinate propertient to the estimated flocal velocities, h and  $\nu$  (**Hgure 5.7**, b), but it cannot exceed the border of the estanded region (**Hgure 5.7**, c).



## 5.2.3 Implementation

Figure 4.8 also the flow chard of the hybrid controller. The provodent of each litting follows the arrows bladed from 1 to 6 is the figure. The humation begins when the video stands in the standing the current flows. See, the current area of the newly outputed image in stand as the integrate image and the region tractated from the previous flows in source and previous flow in the two of the region tractated from the previous. We use the regionarian reach to out the dBR models to perform the registration. We use the registration reach to out the care in the same may as the visual based approach. At last, we use the current fload angular velocities to heart the first windows to be used for createncies the level mode for the next registration.



## 5.3 Accuracy Evaluation

The evaluation is to true the accurscy of the visual based controller, the motion based controller and the hybrid-standing. During the tes, the tester halds the centrol har is an initial task. Thes, we let the basics to do a point and the p basek to the hindin star. The corner will more composeding to the controller motion. Using the truth almost, the corner handler at the matching based to the partice. However, there will be a drill is practice due to other the energy control and the partice. However, there will be a drill in practice due to other the energy control of the controller or the town fulder means and measure the more accumalization style are the image controller by the control and the hindin pointies. The south is added the controller back to the initial datas alter the genere. The south is added the controller back to the initial datas alter the genere. The south is added to be pointers hisks to the initial datas alter the datas of the neutral maps and the generes and so are databasen of the datas of the mediang positions that all the generes and so the databasen of the datas of the mediang positions to measure the accuracy of the tread controller. The procedure is illustrated by Fager 5-3. The store aims for constiller with the image captured video. At first, we let the tester aim at the image center indicated by a dot. Then we recet the curses the durating position and the tester du due goaters. After the genture, the tester should points the centrel brok again at the center dot. Then, we record the ending position of the curses. This procedure is repeated multiple times to down the distribution of the duration.



The first test uses gestures of small motion. The tester waves the Wiimote towards a random direction and waves back.



Figure 5-18 shows a set of the testing result in which 25 gentners were performed for each controller. The distribution of drifts of the motion based controller is much spacer and far away from the starting position while the distribution of the either two controllers are both located near the starting position. This shows that the image registration specoch provides much ladger accuracy that mesolution heads much.

In this text, we can hardly tell the difference between the visual based controller and the hybrid controller. The variances of the distribution may be both due to the operation error of the texters. We would like to introduce more complex motions to text the difference. In the second text, the texter shakes the Wiimste very fast for about two seconds before going back to initial ease. The texting result is shown in Figure 5-11.



In this text the distribution of the motion based controller is even worse and we can clearly identify the improvement of the hybrid controller over the visual based controller. The distribution is more dense and closes to the expected destination. Moreover, because of the registration failures, outliers appear in the visual based case but there is note in the hybrid case.

## 5.4 Discussion

We show in this chapter how to build a controller using the real-time IR module. The evaluation shows that, even using visual information only, the visual based controller can provide more accurate pointing control over the widely-used motion based controller. Threads combining both visual based and motion based information, the bybeid controller further improves the robustness of system under fast motion. Compared with the conversional sensor-har-based Wiinstee pointing, the hybrid controller presented here does not restrict users to pointing toward the sensor har. The hubfing and pointing controlling method provides users visid gaming experimence, which could make it a more preferred approaches where humfalioning analyse of the moute.

## **Chapter 6 Other Applications and Conclusions**

The proposed image registration system can not only be used to build the proposed hybrid game controllers, but also be applied to other real-time vision applications.

## 6.1 Light-Gun for LCD

The light pen has been a popular pointing device for shoring game. The realistional light pass only such skit CHT monitor since they use calculate my intring information. The statistical for pointing spotiation on LCD monitors, Winner uses an additional informed light entiting device, called same bar, which is pleted stare the system. Nevertheless, this approach endorg games because, in order to short an edget on strengthey have present on different location. Memory, the resultation of these devices are much loss that the strength of the strength of the strength present.

To improve the pointing precision, the proposed image registration system can be used. Basically, more we register the image captured by the handhed camera with image displayed on the secure, we will have where the entert of the camera points to. To reduce the computational not and improve the registration robustness, we can also porterm the registration using a registe cartiscal around the pointing position provided by the machinol attribute. The transformation for the resulting a limit does, instead of the their inset.

75

6.2 Real-Time Image Mosaic



#### Figure 6-1 Real time image mosaic

With the high registering speed, the proposed image registration system could be used to generate an image mesaic from real-time video input. That is, we can use the transformation matrix calculated to waap the current frame and stitch it with previous formers.

To warp the images in real time, we use DirectX 3D (D3D) to accelerate the processing speed. The quadrangle frames are represented by of the D3D geometries. During registration, the vertices of the geometry are warped by the warp matrix and then pixels are filled in through hardware texturing.

To remove the high cost of 100 transportation between device and host, we design an interface to access image in graphic memory directly. Currently, it is capable of reading images from DirectX 3D instances or OpenGL buffers. The captured images are sent to D3D first and then the image regularization system fitches the image from the interface. And reactivities, the images regularized on transfer for the image from the interface.

Figure 6-1 shows such a real-time image stitching result. Our current implementation is capable of handling more than 30 images per second in a PC with a mid-range graphic card.

#### 6.3 Conclusions

The thesis proposes to use both visual and motion information for designing hybrid controllers for next-generation game consoles. To achieve this goal, we implement the key building block, the image registration module, in parallel for real-time processing.

Besides implementing the image registration algorithm on GPUs, we also implement two multi-evolution variants, each has its own advantages under difference strations. MRIR is suitable for high accuracy beomgraphic image registration. MRIR-SR has lower computational cost and can be used to register the images that have small overlapped region.

Compared to traditional motion based controller such as Wilmote, the proposed motionvisual-hybrid controller uses image registration result to provide high accuracy egomotion information, as well as using input of accelerometer to provide pre-knowledge of the movement. Experimental results show that the hybrid controller performs much better than the controllers that use either motion or visual information only.

In summary, the following techniques/algorithms are discussed in this thesis:

· Real time image registration module

The proposed image registration module can register 512 × 512 images at 60 FPS with high accuracy. It can be integrated to other application requiring real-time image alignment.

Multi-Resolution Image Registration with Selected Regions

This technique dramatically reduces the computational cost of multi-resolution image registration. Moreover, it increases the adaptability of the homographic image registration to inhomogeneous images.

Discussed CUDA optimization techniques

We present a series of optimization techniques to improve the performance of the image registration module. Those techniques are not only applicable to image registration, but also useful for optimizing other CUDA applications. We also discuss how to design orderined data structure for accessing CUDA memory.

· CUDA in-thread buffering

We propose a method that unrolls small loops and hard codes the small vector with senarate variables to enforce using registers to buffer intermediate results (Section 4.3.3 ). By comparing with the array or shared memory buffering approach, we find that this method is much more efficient. Therefore, we recommend using this method to buffer small vector or array for other applications.

· Motion based pointing device

We propose how to design a pointing device using Wiimste based on only the motion feedbacks. Compared so the traditional infrared ray solution which needs pointing Wiimote towards a sensor bar, this approach requires no additional device. It also has very lwo comparisonal cost but lacks of pointing accuracy.

· Hybrid pointing device

We propose how to integrate the focal tracking based on image registration with motion information to build a high accuracy hybrid pointing device. It provides pixellevel resolution that may eastend usage of the motion controller to the high competitive games.

# Chapter 7 References

- Zitova, B.a.F., J. Image registration methods: a survey, in Image and Vision Computing. 2003. p. 977-1000.
- Moravec, H.P., Rover visual obstacle avoidance, in International Joint Conference on Artificial Intelligence. 1981: Vancouver, Canada. p. 785-790.
- Lowe, D.G., Distinctive Image Features from Scale-Invariant Keypoints, in International Journal of Computer Vision. 2004.
- Fischler, M.A. and R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, in Communications of the ACM, 1981, p. 381-395.
- Davison, A.J., Real-time simultaneous localisation and mapping with a single carnera. 2003.
- Davison, A.J., Active search for real-time vision, in Tenth IEEE International Conference on Computer Vision. 2005. p. 66-73.
- Pratt, W.K., Digital image processing (2nd ed.). 1991, New York, NY, USA: John Wiley & Sons, Inc.
- Barnea, D.J. and H.F. Siverman, A class of algorithms for fast digital image registration, in IEEE Transactions on Computing 21, 1972. p. 179-186.
- Lucas, B. and T. Kanade. An Iterative Image Registration Technique with an Application to Stereo Vision. in Imaging Understanding Workshop. 1981.
- Baker, S. and I. Matthews, Lucas-kanade 20 years on: A unifying framework: Part 1: The quantity approximated, the warp update rule, and the gradient descent approximation. International Journal of Computer Vision, 2004. 56(3): p. 221– 255.
- Burt, P.J., Fast filter transforms for image processing, in Computer Graphics and Image Processing, 1981, p. 20-51.
- Wong, R.Y. and E.L. Hall, Sequential Hierarchical Scene Matching, in IEEE Transactions on Computers. 1978, IEEE Computer Society p. 359-366.

- Kumar, R., et al. Registration of video to geo-referenced imagery. in Fourteenth International Conference on Pattern Recognition. 1998.
- Sarvaiya, J.N., S. Patnaik, and S. Bombaywala, Image Registration by Template Matching Using Normalized Cross-Correlation, in International Conference on Advances in Computing, Control, and Telecommunication Technologies. 2019
- Stefano, L.D., S. Mattoccia, and M. Mola. An efficient algorithm for exhaustive template matching based on normalized cross correlation. In 12th International Conference on Image Analysis and Processing. 2003.
- 16. Lewis, J.P., Fast Normalized Cross-Correlation, in Vision Interface, 1995.
- Wong, E.L., W.Y.F. Yuen, and C.S.T. Choy. Designing Wii Controller As A Powerful Musical Instrument In An Interactive Music Performance System. in The 6th International Conference on Advances in Mobile Computing and Multimedia. 2008.
- Schlömer, T., et al. Gesture recognition with a Wii controller. in the 2nd international conference on Tangible and embedded interaction. 2008. Born, Germany
- Cheorng, S.N., W.J. Yap, and M.L. Chan. Cost-Effective Wiimote-Based Technology-Enhanced Teaching and Learning Platform. in the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing. 2009.
- Luinge, H.J., P.H. Veltink, and C.T.M. Baten. Estimating orientation with groscopes and accelerometers. in The First Joint BMES/EMBS Conference. 1999.
- Sakaguchi, T., et al., in IEEE/SICE/RSJ International Conference on Multisensor Fusion and Integration for Intelligent Systems. 1996. p. 470-475.
- 22. Harris, M., GPGPU: General-Purpose Computation on GPUs.
- 23. Rosado, G., Motion Blur as a Post-Processing Effect, in GPU Gem 3. 2007.
- Fung, J. and S. Mann, Using graphics devices in reverse: GPU-based Image Processing and Computer Vision, in 2008 IEEE International Conference on Multimedia and Exeo. 2008. n. 9 - 12.

- Yu, W. and T. Chen, High Performance Stereo Vision Designed for Massively Data Parallel Platforms, in IEEE Transactions on Circuits and Systems for Video Technology
- 26. Jargstorff, F. and E. Young, CUDA Video Decoder API. 2008.
- Shen, G., et al. Accelerating video decoding using GPU. in 2003 IEEE International Conference on Acoustics. Speech, and Signal Processing. 2003.
- Nyland, L., M. Harris, and J. Prins, Fast N-Body Simulation with CUDA, in GPU Gents 3, 2007.
- 29. Grand, S.L., Broad-Phase Collision Detection with CUDA, in GPU Gems 3, 2007.
- Garcia, V., E. Debreuve, and M. Barlaud, Fast k nearest neighbor search using GPU, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2008. p. 1–6.
- Sintorn, E. and U. Assarsson, Fast parallel GPU-sorting using a hybrid algorithm Parallel and Distributed Computing, 2008. 68(10): p. 1381-1388.
- 32. Harris, M. Oetimizing Parallel Reduction in CUDA.
- Harris, M., S. Sengupta, and J.D. O'Wens, Parallel Prefix Sum (Scan) with CUDA, in CPU Gens 3, 2007.
- 34. NVIDIA. NVIDIA CUDA Programming Guide V3.0.
- 35. gl.mer. Wijvourselft! wijvourself.gl.tter.org

## Appendix A

## A.1 Bank conflict

Due to the hadrourd dedge, similarationally accenting the shared memory of CUDA devices from mainping processing subs rung cases back confilers. The on-big barder denomy is divided into event back of the back carbox denoming the back for C23 devices.) 32-bit words will be distributed into those backs in reporting sequential order. The following figures above how the demonsts of an integer array are distributed to the shared memory.



The following discussion is hand on the C1 a devices. If the memory accors it a vary finite indicates back, you be due in administration, Historer, when the our more accesses full into the same bank, they will be wridlend, which cannot lattery. If two accesses full into the same bank, the due due the interfault as 2-along back conflict. Because cannot be experimented as an away back conflict. Because cannot breach in a way, there are so sufficient. The due tamotry access is a wary is separated into two half, so the first is financh in the wary will nearce emfort which be committed in the bank in the source due that are sufficient in a 16-way back conflict source.

## A.2 Atomic Reduction Without Bank Conflicts on CUDA Device

To avoid bank conflict (Appendix A.1.), we introduce a buffered shared memory access pattern to perform atomic instruction. An integer array of 16 elements is allocated in shared memory to store the intermediate results. The sum of all the elements is calculated by adding together all the intermediate results. The code below shows an example of buffered shared memory access.

\_shared\_\_int s\_buff[16]; if(threadfar,x < 16) s\_buff(threadfar,x] = 0; \_syncthreadfar,y] = 0; \_syncthreadfar,y] \_syncthreadfar,y]; \_syncthreadfar,y]; & rst = Reduce(s buff);

84

A.3 Optimized Parallel Reduction on CUDA Device

device float reduce512(volatile float\* s\_sum. float elem) s sum[threadIdx.x] = elem; \_\_syncthreads(); if(threadIdx.x < 256)[s\_sum[threadIdx.x] = elem = elem + s complete addidy y a 2561: } sumrthreads(): if/threadIdy x < 1281/s sum[threadIdy.x] = elem = elem + s sum[threadIdx.x + 128];} \_\_syncthreads(); if/threadIdx x < 64)(s sum[threadIdx.x] = elem = elem + s sum[threadIdy x + 64]: ] sum(threads()); if/threadIdy x ( 32) s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + 321 s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + 16]; s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + eles as a sum[threadIds.x + 1]; neturn elem: shared float s sum[512]; float elem = 0: elem = vector[threadIdx.x]; \_\_syncthreads(); float sum = reduce512(s sum, elem);

The optimized parallel reduction presented by the above figure is the modified version based on Harris' work [32]. It introduces sequential addressing to avoid bank conflicts and fully unrolls codes to reduce the complexity of flow control. The valuable elon is used for ensuring register is used for storing intermediate result that reduce shared memory accessing. The number of adding operations for each thread is 9. The shared memory usage is  $512 \times 4 = 2008$ .

## A.4 Parallel Reduction for Vectors on CUDA Device

The basic date of parallel reduction for vectors in to such the life freedu-to reduce multiple dimensions at one time. If we consider the breads sampe of the optimisation parallel models and the life of breads that during the proceedant. To laid 312 chosent from global ememory, we and 312 shows. The first may of reduction model. 364 threads the 252 densites life, the source along use and 12 chosent in the 254 threads life, and we see The parallel reduction for vectors model a special strange pattern to exolute sequentity from growth and there are as back confide. Vectors should be source sequentity in time strape. The distribution of the strand sequentity for the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity for the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape. The distribution of the strand sequentity is the strape sequentity in the strape sequentity is the str

| ν,                      | w, | ¥3 |  |
|-------------------------|----|----|--|
| 5+ 5+ F1 5+ 5+ 5+ 5+ 5+ |    |    |  |
|                         |    |    |  |
|                         |    |    |  |
|                         |    |    |  |

If the vectors are saved in this pattern, we can use the same sequential addressing method as the optimized parallel reduction to sum the vectors but stop the reduction earlier. For example, the following device function can be used for adding 64 8-dimentional vectors.

device\_\_ void reduce64for8(volatile float\* s\_sum, float elem, float\* s rst) s sum[threadIdx.x] = elem: syncthreads(); if(threadIdx.x < 256)(s sum[threadIdx.x] = elem = elem + s sum(threadIdx.x + 128]:) syncthreads(): if(threadIdx.x < 64)(s sum[threadIdx.x] = elem = elem + s sum(threadIdx.x + 641: ) syncthreads(): s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + s sum[threadIdx.x] = elem = elem + s sum[threadIdx.x + 16]: elem - elem + s sum[threadIdx.x + 8]; if(threadIdx.x < 8) s rst[threadIdx.x] = elem; syncthreads(); \_\_shared\_\_float s\_sum(512]; shared\_float s\_rst[#]; float elem = 0; elem = vector[threadIdx.x]; syncthreads(); reduce64for8(s sum, elem, s rst);

The number of addition operation is 6. Compared to 48 operations, it is a significant improvement. If more than 64 vectors of 8 elements are to be added, sequential-then parallel processing can be introduced.

#### A.5 Local Array Cached Parallel Reduction Approach

To apply the hybrid processing strategy, the whole image is partition into the segments of 512 horizontally adjacent pixels in row major. Each blocks sequentially accumulating M segments.

```
__shared__float s_sum[512];
_shared_ float s_dp[8];
int tid - blockTdy y t blockflip y + threadfdy y:
int i - ide X imme width int i - ide / imme height:
fondint c = 0: c c H: ++c)
     id/idy by image givel size/break;
     float x, y;
     Harofi i By Bull
     float diff = 0:
     16/ x + H & X + K & X + 0 & X + 2 0 )
           diff = I(x, x) = I(1, 1))
     for(int s = 0: s ( E: ++s)
           float do = diff == 0 ? 0 : diff * SDI(1, 1, 5);
           float sum = reduce512(s sum, dp);
           if(threadIdx.x == 0)s.dp[s] += sum;
     idx += gridDim.x * blockDim.x;
} syncthreads();
if(threadIdx,x < 8) local parameter[blockIdx,x * 8 + threadIdx,x] =
s do[threadIdy_w]:
```














