Optimal Control Problem And Power-efficient Medical Image Processing Using Puma

International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
View more...
   EMBED

Share

Preview only show first 6 pages with water mark for full document please download

Transcript

www.ijmer.co m International Journal of Modern Engineering Research (IJM ER) Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 Optimal Control Problem and Power-Efficient Medical Image Processing Using Puma Himadri Nath Moulick1, Moumita Ghosh2 2 CSE, Aryabhatta Institute of Engg& Management, Durgapur, PIN-713148, India CSE, University Institute Of Technology, (The University Of Burdwan) Pin -712104, India 1 ABSTRACT: As a starting point of this paper we present a problem fro m mammographic image processing. We show how it can be formulated as an optimal control problem for PDEs and illustrate that it leads to penalty t erms which are nonstandard in the theory of optimal control of PDEs. To solve this control problem we use a generalization of the conditional gradient method which is especially suitable for non -convex problems. We apply this method to our control problem and illustrate that this method also covers the recently proposed method of surrogate functional from the theory of inverse problems. Graphics processing units (GPUs) are becoming an increasingly popularplatform to run applications that require a high computation throughput.They are limited, however, by memory bandwidth and power and, assuch, cannot always achieve their full potential. This paper presents thePUMA architecture - a domain-specific accelerator designed specificallyfor medical imaging applications, but with sufficient generality to makeit programmable. The goal is to closely match the performance achievedby GPUs in this domain but at a fraction of the power consumption. Theresults are quite promising - PUMA achieves upto 2X the performance of a modern GPU architecture and has up to a 54X improved efficiency on a floating-point and memory-intensive MRI reconstruction algorithm. KEYWORDS: generalized conditional gradient method, surrogate functional, image processing, optimal control of PDEs I. INTRODUCTION For many years medical imaging has aimed at developing fully automatic, software based diagnostic systems. However, the success of those automatic systems is rather limited and the human expert is as much responsible for the final diagnosis as in prev ious years. Hence, gro wing effort has been devoted to enhancing the techniques for presenting the med ical images as well as additional informat ion. In Germany a particular effort is made in mammography, i. e. X -ray scans of the female breast for early detection of b reast cancer. The process of examination by the medical experts is d ivided into a very short recognition phase (< 1 sec.) and a second verific ation phase (≈ 1 min.). Du ring the recognition phase, the expert first recognizes the coarse features, then more and more fine features. Tests have shown, that the experts usually form their decisions during this very short recognition phase. Nevertheless, the verificat ion phase is the more critical one. The crit ical and difficu lt cases, where the recognition phase does not end with a preliminary diagnosis, most often applies to women in the early stages of cancer. During the verification phase the expert s hifts forwards and backwards, thereby alternating in examining small details and in catching an overall impression of the location of crit ical patterns within the organ. The advent of programmable graphics processing units, or GPUs, for general-purpose computing is one of the major steps taken in computing over the last few years. These GPGPUs which, in the past, have been predominantly used for gaming and advanced image and video editing are now being used by many developers to accelerate inherently parallel programs in several other domains. Indeed, considerable amounts time and engineering effort are often spent in order to mod ify programs so that they may run effectively on GPUs.Several d ifferent application do mains observe remarkable speedups when using GPUs, including the following [18]:• 4X to 100X speedup on medical applications, such as biomedical image analysis, 3D reconstruction of tissue structures for large sets of microscopic images and accelerating MRI reconstructions.• 8X to 260X speedup on electronic design automation, such as power grid analysis and statistical static timing analysis.• 4X to 327X speedup on physics applications, such as weather prediction and astrophysics.• 11X to 100X speed up financial applications such as instrument pricing using Monte-Carlo methods. II. MOTIVATION FROM MEDICAL IMAGING This process can be supported by presenting the expert d ifferent versions of the o rig inal image during close up and normal sub phases. More precisely, the expert sees a version with contrast enhanced small details in a close up phase („fine scale‟), while he sees an image which preserves all major edges but smoothes within regions („coarse scale‟) during the normal phase. For enhancing fine details in mammography images a variety of algorith m have been proposed. Many of them are based on the wavelet transform due to its property of dividing an image into different scale representations; see for example [7] and references therein. In this work we deal with the development o f an optimized p resentation for one cycle of the verificat ion phase. To put the problem in mathematical terms, we start with a given image y0 assumed to be a function defined on . The fine scale and the coarse scale image are denoted yf and yc respectively. Under the The goal is, to produce a movie (i. e. natural assumption of finite energy images wemodel them as functions in atime dependent function) y: y0, i. e. y(0) = y 0, , fro m the given images y0, yfand yc such that • the movie starts in www.ijmer.co m 2205 | Page International Journal of Modern Engineering Research (IJM ER) www.ijmer.co m Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 the movie sweeps to the fine scale image and to the coarse scale image,e. g. y(t) ≈ yffor t ∈[.2, .4] and y(t) ≈ ycfor t ∈[.6, .8], • the movie sweeps in a “natural” way.An examp le f or a mammography image, a fine scale, and coarse scale imageis shown in Fig 1. As a first guess one could try to make a linear interpolationbetween the fine scale and the coarse scale representation. Thismethod has one serious drawback: It does not take th e scale sweep into account,i. e. all fine details are just faded in rather than developing one afteranother. Fig 1: A mammography image. Left: o rig inal image y 0, middle: fine scale image yf , right: coarse scale image yc Modeling as an optimal control problem III. PDES AND CONTROL PROBLEMS IN IMAGE PROCESSING Parabolic partial differential equations are a widely used tool in image processing. Diffusion equations like the heat equation [14], the Perona-Malik equation [10] or an isotropic equations [13] are u sed for smoothing, denoising and edge enhancing. The smoothing of a given image equation with the heat equation isdone by the solution of the (1) Where y_ stands for the normal derivative, i. e. we impose homogeneous Neumann boundary conditions. The solution y: gives a movie wh ich starts at the image y0 and becomes smoother with time t. This evolution is also called scale space and is analyzed by the image processing community in detail since the 1980s. Especially the heat equation does not create new features with increasing time, see e. g. [5] and the references therein. Thus, the heat equation is well suited to model a sweep fro m a fine scale image yf to a coarse scale image yc. Hence, we take the image yf as in itial value. To make the movie y end at a certain coarse scale image yc instead of its given endpoint y(1) we propose the following optimal control problem: (2) In other words, the diffusion process is forced to end in yc with the help of a heat source u. III.1: Adaption to image processing: The above described problem is classical in the theory of optimal control of PDEs, though not well adapted to image processing. The solution of this problem may have several drawbacks: The control u will be smooth due to the regularization and have a large support. This will result in very s mooth changes in the image sequence y and, more wo rse, in global changes in the whole image. To overco me these difficu lties, d ifferent norms can beused for regularizat ion. A widely used choice in image processing is to u se Besov norms because they are appropriate to model images. Besov norms can be defined in different ways, e. g. in terms of moduli of s moothness [12] or in terms of Littlewood Paley deco mpositions [6]. Here we take another viewpoint and define the Besov s paces via norms of wavelet expansions [2, 9].For a sufficient smooth wavelet the Besov semi norm of a function f on a set is defined as (3) Where j is the scale index, k indicates translation and i stand for the directions. The Besov space Bsp, q (M) is defined as the functions f∈Lp (M) that has a finite Besov semi norm. See [6, 9] for a more detailed introduction to wavelets and Besov spaces. III.2: The solution of the PDE and t he control-to-state mapping: The solution of the heat equation is a classical task. If we assume that the init ial value yf is in L2( 1],L2( ) and the control u is in L2([0, 1] ×) the solution y is in L2(0, 1,H1( )) ∩ C([0, )). Especially y is continuous with respect to time and the point evaluation y(1) makes sense, see e. g. [8]. In our www.ijmer.co m 2206 | Page www.ijmer.co m International Journal of Modern Engineering Research (IJM ER) Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 case the solution operator is affine linear, due to the nonzeroinitial value. We make the following modifications to come back to a linear problem: We split the solution into two parts. The non -controlled part yn is the solution of and the homogeneous part is the solution of (4) (Both with homogeneous Neumann boundary conditions). Then the solution operator G : linear and continuous from L2([0, 1],L2( )) to L2(0, 1,H1( )) ∩ C([0, 1],L2( of equation (1) is )). With the help of the point evaluation operator we have the control-to-state mapping K: yh (1) linear and continuous fro m L2 ([0, 1],L2( )) to L2( ). Then the solution is y = yn + yh and we can focus on the control problem for yh. Together with the tho ughts of the previous subsection we end up with the following minimization problem: Minimize (5) III.3. Solution of the optimal control problem: The minimizat ion of the functional (2) is not straightforward. The no quadratic constraint leads to a nonlinear normal equation wh ich can not be solved explicitly. Here we use a generalization of the conditional gradient method for the minimizat ion. IV. THE GENERALIZED CONDITIONAL GRADIENT METHOD The classical conditional gradient method deals with min imization problems of the form (6) here C is a bounded convex set and F is a possible non -linear function. One notices that this constrained problem can actually be written as an “unconstrained” one with the help of the indicator functional With = IC, problem (3) thus can be reformulated as (7) To illustrate the proposed generalization, we summarize the key properties of F and non-differentiable parts. The minimizat ion problem with : F is s mooth while may contain alone is considered be solved easily while the min imization of F is comparatively hard. The influence of is rather small in co mparison to F. With these assumptions in mind, the conditional gradient method can also be motivated as follows. Let u ∈H be given such that _(u) <∞. We like to find an update direction by a linearized problem. Since is not differentiable, we only linearize F: (8) The min imizer of this problem serves as an update direction.So th is “generalized conditional gradient method” in the (n +1) – st step reads as follows: Let un ∈H be given such that (un) < ∞. 2207 | Page www.ijmer.co m International Journal of Modern Engineering Research (IJM ER) www.ijmer.co m Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 1. Determine the solution of (5) and denote it vn. 2. Set sn as a solution of 3. Let To ensure existence of a solution in Step 1 we state the following condition:Assumption 1 Let the functional : H→ ]−∞ ,∞ ] be proper, convex, lower semi-continuous and coercive with respect to the norm. Standard arguments fro m convex analysis yield the existence of a min imize in Step 1 of the algorith m [4]. So if F is Gˆateaux-d ifferentiable in H, the algorith m is well-defined. The convergence of the generalized conditional gradient method is analyzed in detail by the authors in [1]. The main result there is the following theorem. Theorem 2 Let satisfy Assumption 1 and let every set Et = {u ∈H | (u) ≤ t} be co mpact. Let F be continuously Fr´echet differentiab le, let F +_ be coercive and u0 be g iven such that (u0) <∞ . Denote (un) thesequence generated by the generalized conditional g radient method.Then every convergent subsequence of (un) converges to a st ationarypoint of F + At least one such subsequence exists.Two remarks are in o rder: First, we notice that the theorem is also validif the functional F is not convex. Second, the theorem only gives convergenceto a stationary point which may seem unsatisfa ctory, specially if one wantsto min imize non-convex functions. But this does not have to be a drawback,as we will see in the next section: 1. Application: Here we show the application of the above described methodology. Since the effects can be seen more clearly in artificial images, we will not use original images. The artificial images we used are shown in Fig 2. Fig 2: Images used for illustrati on. Left: fine scale i mage, right: coarse scale image. For illustration we use the values p = 1 and s = 3/2 + " > 3/2 in theminimizat ion problem (2), since this is close to the BV -norm and we have co mpactly. The results are presented in Figure 3. The figure shows a comparison of the linear interpolation, the pure result of the application of the heat equation and the result of the optimal control problem. One sees that the linear interpolation is only fading out the details. In the uncontrolled result(middle colu mn) the details are vanishing one after another but the process does not end in the desired endpoint. The result of the optimal control p roblem (right column) exh ibits both a nice vanishing of the details and end in the given endpoint. V. THE ADVANTAGES OF GPUs GPUs have many appealing hardware features. Firstly, they lend themselves very well to both thread -level and datalevel parallelis m.Thread-level parallelis m (TLP) is exp loited by having a large nu mberof independent processing elements (PEs) on the GPU, each with itsown set of functional units (FUs) and local storage. Individual threadscan quite cleanly be assigned, either statically by the programmer o rdynamically by the hardware, to each of these PEs and inter threadcommunication is made possible by some form of interconnect fabricor through local storage such as caches. Programs with a large amountof data-level parallelis m (DLP) can make use of vector-SIMD units inthese PEs which allow a single instruction to perform an operationon several data at the same time. DLP can also be extracted inprograms with co mputeintensive loops that have little or no interiteration dependencies by executing operations from different iterationswithin a single SIMD instruction.Secondly, GDDR RAM and its increasingly fast successor‟s haveallowed for GPUs to have access to an immense amount of memorybandwidth. The AMD Radeon HD 4870 - the first GPU to supportGDDR5 memory - has a memo ry bandwidth of up to 115 GB/s.Above all, GPUs are co mmod ity hardware products commonly availablea s a part of many desktop and laptop computers. The tools toprogram them are also easily available; NVIDIA‟s Co mpute UnifiedDev ice Architecture (CUDA) package, for examp le, is free to down loadfro m their website [15]. CUDA is a general purpose parallel computingarchitecture which consists of the CUDA instruction set and the computeengine in the GPU. It provides a small set of extensions to the C programming language, which enables straightforward implementation of parallel algorith ms on the GPU. CUDA also supports scheduling the computation between CPU and GPU, such that serial portions of applications run on the CPU and parallel portions are mapped to the GPU. Indiv idual cores in Intel‟s up -and-coming Larrabee processor www.ijmer.co m 2208 | Page International Journal of Modern Engineering Research (IJM ER) www.ijmer.co m Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 implement the ubiquitous x86 ISA [23], allo wing users to use a host of already-existing development tools to port their applications to it. Server products like Tesla [17] with even more co mpute power are also available. VI. THE QUEST FOR PROGRAMMABLE AND SPECIALIZED HARDWARE A wide range of architectures, in addition to GPUs, have beendesigned before to address the problem o f providing high performancecomputation efficiently. These solutions maintain or sacrifice programmab ility to various degrees depending on the domain they target. Fig 3 shows the performance (on the y-axis) and programmab ility (on the x-axis) expectations from various architecture styles. The numbers next to each of the ovals shows the approximate performance power ratio o ffered by each of these solutions.General purpose processors (GPPs) wh ich fall on the lower rightcorner of the figure, are highly programmable solutions but are limited in terms of the peak performance they can achieve. Further, structureslike instruction decoders and caches that are needed to support programma bility consume energy. This results in a very low computational efficiency of about 1 MIPS-per-mW, for examp le, for the Intel Pentiu m- M processor. On the other end of the spectrum are Application-specific Integrated Circuits (ASICs). ASICs are custom-designed specifically for a particular problem, without extraneous hardware structures. Thus, ASICs havea high computational density with hardwired control, resulting in h igh co mputation efficiency up to 1,000 to 10,000 times more than that of GPPs. The space between these two extremes is populated by different solutions that have varying degrees of programmab ility. Application specific instruction-set processors (ASIPs) are p rocessors with custom extensions for a particular applicat ion or applicationdomain . They can be quite efficient when running the applications for which they are designed, and they are also capable of running any other application, though with reduced efficiency. Examples include processors from Tensilica and ARC, transport triggered architectures [3] and custom-fit processors [9]. Do main loop accelerators are designed to execute co mputation intensive loops present in media and signal processing domains. Their design is close to that of VLIW p rocessors, but with a much h igher nu mber of function units, and consequently, a higher peak performance. Very long instruction words in a control memo ry direct all FUs every cycle. However, domain loop accelerators (LAs) have less flexib ility than GPPs because only highly co mputationally -intensive loops map well to them. So me examp les of arch itectures in this design space are RSVP [1] and CGRAs [14]. Coarse-grain adaptable architectures have coarser-grain build ing blocks co mpared to FPGAs, but, like FPGAs, still maintain b it-level reconfigurability. The coarser reconfiguration granularity improves the computation efficiency of these solutions. However, non -standard tools are needed to map co mputations onto them and their success has been limited to the mu ltimed ia do main. PipeRench [10], RaPiD [6] are some e xamp les of coarse-grain adaptable architectures. Fig. 3.Comparison of peak performance, power efficiency, and programmability of di fferent architecture design styles. 1. Programmable Loop Accelerators : The programmab le solutions shown in Figure 1 are all “universally”programmable, allowing any loop to be mapped on to them, although atvarying degrees of efficiency. There is a wide gap between the efficiency that can be achieved by ASICs and the efficiency that can be achieved by these programmable solutions. There are, for example, instances where there is a narro w requirement of flexib ility. Using any of these above solutions is overkill for these instances as these solutions sacrifice too much efficiency for the needed flexib ility. Fu rther, most of the middlegroundsolutions listed above do not offer any support for fast floatingpoint computation, which limits the number of applicat ions that they can be used for. This work advocates an open area in the des ign space where a non-trivial amount of programmab ility is provided in terms of intraprocessor communication, functionality and storage, but the application and domain -specific design, as a whole, resemb les an ASIC mo re than a processor. The design point is labeled Programmable Loop Accelerator, or PLA (not to be confused with programmab le logic array). TABLE 1.Medical application characteristics www.ijmer.co m 2209 | Page www.ijmer.co m International Journal of Modern Engineering Research (IJM ER) Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 VII. TARGETING MEDICAL APPLICATIONS Medical imag ing devices are generally large, bulky and expensivemachines that have very limited portability and consume large amountsof power. There is an increasing focus on reducing the power ofthese medical imag ing devices [20]. In order to address this issue,this work focuses on principle co mponents of Co mputed Tomography (CT) and Magnetic Resonance Imaging (M RI) image processing andreconstruction. A CT scan involves capturing a composite image fro m a series of X-Ray images taken fro m various angles around a subject. It produces a very large amount of data that can be manipulated using a variety of techniques to best arrive at a diagnosis. Oftentimes, this involves separating different layers of the captured image based on their radio densities. A common way of acco mplishing this is by using a well known image-p rocessing algorit hm known as “image segmentation”. In essence, image segmentation allo ws one to partition a g iven image into mu ltiple reg ions based on any of a number of different criteria such as edges, colors, textur es, etc. Being able to partition images in this manner allows for studying of isolated sections of the image rather than of all the informat ion that was captured.The segmentation algorithm used in this work has three main floating-point- intensive components: Graph segmenting (CT.segment), Laplacian filtering (CT. Lap lace) and Gaussian convolution (CT. gauss). aplacian filtering highlights portions of the image that exhib it a rap id change of intensity and is used in the segmentation algorithm for edge detection. Gaussian convolution is used to smooth textures in an image to allow for better partitioning of the image into different regions. An MRI scan, instead of using X-Rays, uses a strong magnetic and rad io frequency fields to align, and alter the alignment of, hydrogen atoms in the body. The hydrogen atoms t hen produce a rotating magnetic field that can be detected by the MRI scanner and converted to an image. The main co mputational component of reconstructing an MRI image is calculating the value of two different vectors, known here as M RI.FH and MRI.Q, respectively (exp lained in more detail in [13], [24]). Table I shows some characteristics of the benchmarks in consideration. All of these benchmarks are floating-point-intensive and require large amounts of data for the co mputation they perform, especially when co mpared to the 0.15 bytes/instruction supported by the GTX 280 GPU mentioned earlier. The main loops in these benchmarks are “do -all” loops - there are no inter-iterat ion dependences. Prior work in this field has predominantly focused on using commercial products to accelerate med ical imaging. For instance, in [11], the authors port “large -scale, bio medical image analysis” applications to mult i-core CPUs and GPUs, and compare different implementation strategies with each other. In [21], the authors stu dy image registration and segmentation and accelerate those applications by using CUDA on a GPU. In [24], the authors use both the hardware parallelism and the special function units available on an NVIDIA GPU to d ramatically improve the perfo rmance of an advanced MRI reconstruction algorithm. There are several other such examples of novel work in this field. In contrast with such research, this work focuses on designing a new, highly efficient, microarch itecture and architecture with the specific hardware requirements of medical imaging in consideration. VIII. PUMA PUMA, Parallel micro -arch itecture for M edical Applications, is a tiled arch itecture as shown in Fig 4. It is specifically designed to maximize power-efficiency when executing med ical imaging applications while still retaining fu ll programmab ility. Each t ile in PUMA is an instance of a specialized PLA - a generalized loop accelerator. The PLA t iles are connected to their neighboring tiles and to the external interface through a high -bandwidth mesh of on-chip routers. 1. Background: Fig. 5 shows the hardware schema for the single-function loop accelerator [7], [5]. The LA is designed to efficiently execute a modulo scheduled loop [19] in hardware. The lengths of the schedule, and the corresponding run-time of the loop, are determined by the initiationinterval (II) - the number of cycles between the beginnings of successive iterations of the loop. Thus, a lower II corresponds to a shorter schedule and increased performance. The modulo schedule contains a kernel that repeats every II cycles and may include operations fro m mu ltip le loop iterat ions. The LA is designed to exploit the high degree of parallelis m availab le in modulo scheduled loops with a large nu mber of function units (FUs).Each FU performs a specific set of functions that is tailored for the part icular loop. Each FU writes to a dedicated shift register file (SRF); in each cycle, the contents of the registers shift downwards to the next register. Point-to-point wires fro m the registers to FU inputs allow data transfer fro m producers directly to consumers. Multip le registers may be connected to each FU input; a mult iplexer (MUX) is used to select the appropriate one. Since the operations executing in a modu lo scheduled loop are periodic, the s elector for this MUX is essentially a modulo counter. In addition, a central register file (CRF) holds static live-in register values that cannot be stored in the SRFs. The schema described is a temp late that is customized for the particular loop being accelerated. The nu mber, types, and widths of the FUs, the widths and depths of the SRFs, and the connections from the SRFs to the FUs are all determined fro m the loop. During synthesis, the loop is first modulo scheduled to meet a g iven performance requireme nt, and then the details of the LA datapath are determined fro m the communication patterns in the scheduled loop.The control path for the single function LA consists of a finite state machine with II states corresponding to each of time slots in the kernel of the modulo schedule. In each state, control signals direct the execution of FUs (for FUs capable o f mu ltip le operat ions) and control the MUXes at the FU inputs. Finally, a Verilog HDL realization of the accelerator is generated by emitting modules with pre-defined behavioral Verilog descriptions that correspond to the datapath elements. A simu lation environment is used to verify that the Verilog properly imp lements the loop. Finally, gatelevel synthesis, placement, routing, power analysis and post-synthesis verification are performed on the design. www.ijmer.co m 2210 | Page www.ijmer.co m International Journal of Modern Engineering Research (IJM ER) Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 Fig. 4. PUMA. Each tile comprises of a programmable loop accelerator (template pictured) and the control and data memories required for its operation. On-chi p routers transfer data between each tile and the externalinterface. Fig. 5.Templ ate for single-function loop accelerator. 2. PUMA Architecture 2.1. Base line PLA Design: A PLA is generalized loop accelerator,created by modifying the template datapath shown in Figure 5. A generic datapath temp late for the PLA is illustrated on the right side of Figure 4 The accelerator is designed for a specific loop at a specific throughput, but contains a more general datapath than the single -function LA to allo w for different loops to be mapped onto the hardware [8]. These generalizations provide the LA with flexib ility in functionality, storage, control and communication. To provide functionality, simp le modifications were made to FUs inorder for them to support more operations; adders (both integer and floating-point) are generalized to adder/subtracter units, left-shift unitsare generalized to left/right rotators, every FU can execute an identityoperation to act like a move instruction, etc. Even load-store units canbe generalized to integer adder/subtracter units if they already had thefunctionality to compute indirect addresses. Further, the input -mu xes toFUs are redesigned to allow for operandswapping as well.The SRFs used in the LA have limited addressability and fixed lifet imesfor variables. To ove rco me these constraints and provide moregenerality, these SRFs are rep laced with rotating -register files (RRs).To improve controllability, the LA‟s fin ite state machine is replacedwith a control memory, the contents of which can be changed based onthe loop that needs to be executed. Further, nu merical constants whichwere hard -coded in the LA are instead stored in a literal register file.To generalize co mmunication, the PLA has a bus (in addition to thepoint -to-point connections) that connects all the RRs to all the FUs. Toreduce the hardware cost of having a potentially long bus, its width islimited to one operand, but has a predictable latency of one cycle. Fig. 6.ILP formul ati on for FU arrangement on the PUMA ring 2.2. PUMA PLA: The PLA bus is not always a viable solution. Onemain d isadvantage with the bus is that it is not a scalable solutionfor larger PLAs with many FUs. Further, the bus only carries a singleoperand per cycle, limit ing the amount of programmab ility available inthe PLA and the sequences of opcodes that can be executed in parallel.To overco me these limitat ions, the intra-PLA communication fabricin PUMA is changed to a ring. A ring allo ws for as many operands to betransferred as there are connections to FUs. It does have its limitations,however. The assumed single-cycle latency to transfer data betweentwo arbitrary points in the PLA using the bus is no longer valid, asit takes one cycle to transfer an operand fro m one ring connection(or ring stop) to another. Also, t he longer the d istance an operandneeds to travel on the ring, the more FUs that have to execute moveinstructions to propagate the operand along at each ring stop. Theseadded instructions can potentially increase the loop‟s schedule lengthand reduce the acc elerator‟s performance. In PUMA, the ring architecture actually consists of six rings (three sets of two rings going in opposite directions). The first set of rings ha s a www.ijmer.co m 2211 | Page International Journal of Modern Engineering Research (IJM ER) www.ijmer.co m Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 Bus/FU connector (or ring-stop) at every single FU. The second set of rings has a ring-stop at all the odd-numbered FUs, and the third set of rings has a ring-stop at all the even-nu mbered FUs. This effectivelyconnects an FU/RF pair to its two neighbors and also to its neighbors‟neighbors; i.e. every FU can co mmun icate with itself o r with otherFUs one or t wo positions on either side of it on the ring. With thisconfiguration, the number of cycles required to transmit data betweenany two arb itrary FUs is no mo re than ⌉ , and regard less of theordering of FUs on the ring, every possible produce rconsumer pairingcan be executed, provided sufficient time.In order to best maintain generality, we chose to arrange the FUs along the ring to allow maximu m connectivity and to distribute the varioustypes of FUs as evenly as possible. Th is was done by formulatingthe problem as an integer linear program (ILP) as shown in Fig 6.In the objective function, T_ and T_ are t wo different sets of FUs, each set having all and only the FUs of a particu lar type. The subscriptsi and j are FU IDs and Cij is a binary variable that is 1 if a connectionexists between FU i and FU j. Essentially, this objective function aimsto maximize the number of connections between different types of FUs, subject to the fo llo wing constraints: In constraint set (1), Xij is a binary variable that is 1 if FU i is“positioned” adjacent to FU j, imp lying that they are connected bythe ring. Every FU, therefore, is “positioned” next to 5 other FUs:itself, its two neighbors and the two additional FUs neighboringits neighbors. • Constraint sets (2) and (3) specify that every FU is “positioned”next to and connected to itself.• Constraint sets (4) and (5) specify that all added connections arebidirect ional.• In constraint set (6), Iij is a binary number that is 1 if a connectionbetween FU i and FU j has already been inserted by the synthesistool. This constraint enforces the rule that a connection betweenFU i and FU j can only exist if they are either “positioned” nextto each other or are already connected. A 7th set of constraints was initially used which specified thatthere must always be a path between any two FUs with exact ly connections between them This constraint was used toprevent insular sets of 5 FUs or sets of 5 FUs connected linearlyrather than in a ring (i.e. without a direct connection between thetwo ends). While this problem might occur in theory, the preexistingconnections put in place by the synthesis system preventit from happening in practice and these constraints were removedto reduce the size of the ILP. Once the optimal solution is obtained, the values of the Xij variables provide a unique ring arrangement. IX. EXPERIMENTS AND RESULTS 1. Setup: All the PLAs in this work were synthesized for (and run at) afrequency of 450 MHz. The logic synthesis was done using SynopsysDesign Co mpiler 2006-06 and Synopsys Physical Co mp iler 2006-06,targeting a 65n m p rocess technology with a no minal supply voltage of0.9 Vo lts. Energy nu mbers were obtained using Synopsys PrimeTime PX 2006-12. For the purposes of this study, we assume that a peakmemory bandwidth of 142 GB/s is available to each PUMA system.This is the same amount of bandwidth afforded to the NVIDIA GTX280 processor. 2. PLA Characteristics: PUMA systems were built using PLAs for each of the five benchmarksin considerations (five systems, each composed entirely ofmult iple t iles of a single type of PLA). Table II shows variouscharacteristics of these accelerators. The “Peak Perf.” colu mns show Fig. 7.Normalized performance of benchmarks on LA and PUMA PLA designed for MRI.FH Fig. 8.Normalized efficiency of benchmarks relati ve to MRI.FH The throughput when executing floating-point operations and integer operations, respectively, in billions of operations per second. The nextcolu mn shows the minimu m bandwidth required by each applicationto prevent starvation. Finally, the last colu mn shows the total numbero f tiles of each PLA that would be present in a PUMA system. Thenumber of tiles was chosen to prevent data starvation, to make the mostefficient use of the resources available. For example, the number www.ijmer.co m 2212 | Page www.ijmer.co m International Journal of Modern Engineering Research (IJM ER) Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 of tilesin a system with M RI.FH tiles is or 9.Fig 8 shows the normalized performance d ifference betweenthe non generalized and generalized loop accelerators across various Benchmarks, to illustrate the effects of the modifications made to thebaseline accelerator to increase programmab ility. Each of the different benchmarks were co mpiled fo r the MRI.FH accelerator.The left colu mn for each benchmark shows its normalized performance. The benchmarks M RI.Q, CT. Laplace and CT. gauss suffered a 50% reduction in perfo rmance; i.e. their II values had tobe doubled, fro m 1 to 2, in order for them to execute on the baselineloo paccelerator. The bench mark CT.segment could not be comp iledfor the MRI.FH accelerator at all. For each benchmark, the column on the right shows the achieved performance on the generalized accelerator, with the hardware modificationsspecified in section III-B1. As shown, these modifications allowedall the benchmarks to run at full performance, at min imu m II.Fig 8shows a graph similar to that in Fig 7, but showsthe normalized efficiency in terms of the accelerator‟s performance to-power ratio. Due to the increase in overall performance providedby the generalizations, the benchmarks MRI.Q, CT.laplace andCT.gauss had a significant increase in efficiency despite the poweroverhead of the additions. The MRI.FH benchmark, however, wh ichwould not experience any improved performance fro m the generalizations loses effic iency due to the increase in the accelerator‟s power consumption. On average, the generalizations increased the accelerator‟s efficiency increased by approximately 40%. 3. Commodity GPGPU Comparison: While other architectures may certainly be used for this domain,GPGPUs are the solutions that are currently in use in many med icalimag ing applications and, therefore, the most suitable comparison point.For this reason, we assessed the performance and efficiency of fiveNVIDIA GPUs. Fig. 9.Average energy consume d (per iterati on) by each benchmark while running on PUMA systems designed around different PLAs Fig. 10.Achieved performance of the MRI.FH benchmark (in trillions of operati ons) on the MRI.FH PUMA system and on various NVIDIA GPUs based on the GT200 architecture Fig9 shows the result of d irect performance co mparisons betweenan MRI.FH PUMA system and the GPUs in consideration. Thecolu mn on the left shows the total compute capability of each of theprocessors. The column on the right shows the realized performancewhile e xecuting the MRI.FH benchmark, accounting for bandwidth restrictions. PUMA achieves a very small fract ion of the peak performance offered by the GPUs, between 8.6% of the dual-GPU GTX 295 and 21.8% of the GTS 250. This gap changes dramat ically, however, wh en accounting for the bandwidth-intensive nature of the application in question. PUMA delivers between 63% (on the dual-GPU GTX 295) and 2X the perfo rmance (on the GTS 250) of the GPUs.The case for PUMA is further underscored by examin ing the GPUs‟powe r efficiency, as shown in Fig 10. This graph shows how many times more efficient, in terms of nu mber of operat ions per Watt, PUMAsystems are relative to the GPUs in consideration. These values rangefrom 20X, for the most complex benchmark running on the mostefficient GPU, to 54X, for the least complex benchmark running onthe least efficient GPU. X. CONCLUSION We have seen that the application of the theory of optimal control of PDEs to image processing problems is a fruitfu l field of research. Besides promising result, even for easy models like the linear heat equation, new interesting mathematical problems arise, like the treat ment of non-quadratic penaltyterms. For future research, better adapted PDEs (like the anisotropic diffusion equations) could be investig ated.The PUMA architecture is a power-efficient accelerator system designedspecifically for efficient medical image reconstruction. It consistsof tiles of programmab le loop accelerators ASICs with added hardwareto support general-purpose computing - designed around the computation requirements of the image reconstruction domain. As applicat ions in this do main are normally executed on very high -performance GPGPUs, the latest NVIDIA GPU arch itecture was used to gauge the performance and efficiency of PUMA. The results are very encouraging – PUMA achieves up to 2 times the performance of a modern GPU arch itecture and has up to 54 times the power efficiency. www.ijmer.co m 2213 | Page www.ijmer.co m [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. [11]. [12]. [13]. [14]. [15]. [16]. [17]. [18]. [19]. [20]. [21]. [22]. [23]. [24]. [25]. [26]. [27]. [28]. [29]. [30]. [31]. [32]. [33]. [34]. [35]. [36]. [37]. [38]. [39]. International Journal of Modern Engineering Research (IJM ER) Vo l. 3, Issue. 4, Ju l - Aug. 2013 pp-2205-2214 ISSN: 2249-6645 REFERENCES K. Bredies, D. A. Lorenz, and P. M ass. Equivalence of a generalized conditional gradient method and the method of surrogate functionals.Preprint of the DFG Priority Program 1114? University of Bremen, 2005. Cohen. Numerical Analysis of Wavelet M ethods. Elsevier ScienceB.V., 2003. Daubechies, M . Defrise, and C. De Mol. An iterative thres holding algorithm for linear inverse problems with a sparsity constraint. Com-munications in Pure and Applied M athematics, 57(11):1413–1457, 2004. Ekeland and R. Temam. Convex analysis and variational problems. North-Holland, Amsterdam, 1976. L. Florack and A. Kuijper.The topological structure of Scale-Spaceimages. Journal of M athematical Imaging and Vision, 12:65 – 79, 2000. M . Frazier, B. Jawerth, and G. Weiss. Little wood-Paley theory and the study of function spaces. Number 79 in Regional Conference Series inM athematics.American M athematical Society, 1991. P. Heinlein, J. Drexl, and W. Schneider. Integrated wavelets for enhancement of micro calcifications in digital mammography. IEEE Trans-actions on M edical Imaging, 22(3):402–413, M arch 2003. J.-L. Lions. Optimal Control of Systems Governed by Partial Differential Equations. Springer, 1971. Y. M eyer. Wavelets and Operators, volume 37 of Cambridge Studies inAdvanced M athematics.Cambridge University Press, 1992. P. Perona and J. M alik.Scale-space and edge detection usinganisotropic diffusion. IEEE Transactions on Pattern Analysis and M a-chine Intelligence, 12(7):629–639, 1990. L. I. Rudin, S. J. Osher, and E. Fatemi. Nonlinear total variation basednoise removal algorithms. Physical D, 60:259–268, 1992. H. Triebel. Theory of Function Spaces II. M onographs in M athematics. Birkh¨ auser, 1992. Weickert. Anisotropic diffusion in image processing. Teubner, Stuttgart, 1998. P. Witkin. Scale-space filtering.In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1019–1021, 1983. S. Ciricescu et al. The reconfigurable streaming vector processor (RSVP).In Proc. of the 36th Annual International Symposium on Microarchitecture,pages 141–150, 2003. CNET. The Gizmo Report: NVIDIA‟s GeForce G TX 280 GPU – introduction, 2008. http://news.cnet.com/8301-13512 39969234-23.html. H. Corporal. TTAs: M issing the ILP complexity wall. Journal of SystemArchitecture, 45(1):949–973, 1999. W. Dally et al. M errimac: Supercomputing with streams. In Proceedings ofthe 2003 ACM/IEEE conference on Supercomputing , pages 35–42, 2003. G. Dasika, S. Das, K. Fan, S. M ahlke, and D. Bull. DVFS in loopaccelerators using BLADES. In Proc. of the 45th Design AutomationConference, pages 894–897, June 2008. C. Ebeling et al. mapping applications to the Rapid configurable architecture. In Proc. of the 5th IEEE Symposium on Field Programmable CustomComputing Machines , pages 106–115, Apr. 1997. Fan et al. Cost sensitive modulo scheduling in a loop acceleratorsynthesis system. In Proc. of the 38th Annual International Symposiumon Microarchitecture, pages 219–230, Nov. 2005. Fan et al. M odulo scheduling for highly customized data paths to increase hardware reusability. In Proc. of the 2008 International Symposium on CodeGeneration and Optimization , pages 124–133, Apr. 2008. A. Fisher et al. Custom-fit processors: Letting applications definearchitectures. In Proc. of the 29th Annual International Symposium onMicroarchitecture, pages 324–335, Dec. 1996. S. Goldstein et al. PipeRench: A coprocessor for streaming multimediaacceleration. In Proc. of the 26th Annual International Symposium onComputer Architecture, pages 28–39, June 1999. T. D. Hartley, U. Catalyurek, A. Ruiz, F. Igual, R. M ayo, and M . Ujaldon. Biomedical image analysis on a cooperative cluster of GPUs and multicores.In Proc. of the 2008 International Conference on Supercomputing , pages15–25, 2008. G. Lu et al. The M orphoSys parallel reconfigurable system. In Proc.Ofthe 5th International Euro-Par Conference, pages 727– 734, 1999. M ahesri et al. Tradeoffs in designing accelerator architectures forvisual computing. In Proc. of the 41st Annual International Symposiumon Microarchitecture, pages 164–175, Nov. 2008. M ei et al. exploiting loop-level parallelism on coarse-grained reconfigurablearchitectures using modulo scheduling. In Proc. of the 2003 Design, Automation and Test in Europe, pages 296–301, M ar. 2003. Nvidia. CUDA Programming Guide, June 2007.http://developer.download.nvidia.com/compute/cuda. NVIDIA. GeForce GTX 280, 2008. http://www.nvidia.com/object/ product geforcegtx 280 us.html. NVIDIA. NVIDIA Tesla S1070, 2008.http://www.nvidia.com/object/product tesla s1070 us.html. Nvidia. Cuda Zone, 2009.http://www.nvidia.com/object/cuda home.html. B. R. Rau. Iterative modulo scheduling: An algorithm for softwarepipelining loops. In Proc. of the 27th Annual International Symposiumon Microarchitecture, pages 63–74, Nov. 1994. T. Review. Cheap, Portable M RI, 2006.http://www.technologyreview.com/computing/17499/?a=f. Ruiz, M . Ujaldon, L. Cooper, and K. Huang.Non-rigid registration forlarge sets of microscopic images on graphics processors.Springer Journalof Signal Processing, M ay 2008. Sankaralingam et al. Exploiting ILP, TLP, and DLP using polymorphismin the TRIPS architecture. In Proc. of the 30th Annual InternationalSymposium on Computer Architecture, pages 422–433, June 2003. Seiler et al. Larrabee: a many -core x86 architecture for visual computing.ACM Transactions on Graphics , 27(3):1–15, 2008. S. S. Stone et al. accelerating advanced M RI reconstructions on GPUs. In2008 Symposium on Computing Frontiers , pages 261– 272, 2008. Taylor et al. Evaluation of the Raw microprocessor: An exposed wire-delay architecture for ILP and streams. In Proc. of the 31st AnnualInternational Symposium on Computer Architecture, pages 2–13, June 2004. www.ijmer.co m 2214 | Page