# Manycore and Accelerator-based High-performance Scientific Computing

# Abstracts

**UC Berkeley/LBNL, Berkeley CA, USA**

**Tutorials** January 24 – January 25, 2011

**Workshop** January 26 – January 28, 2011

**Copperhead**

Bryan Catanzaro

Data parallelism is abundant in real applications, scalable, and useful. However, efficiently exploiting data parallelism can be complicated, and often requires programmers to express computations at a low level. Copperhead is a data-parallel Python runtime, which aims to make data parallel programming more productive. Using standard Python constructs like map and reduce, we embed data parallel computations in standard Python programs, capable of interoperating with numerical and visualization libraries such as NumPy, SciPy and Matplotlib. The Copperhead runtime efficiently compiles data parallel computations to highly parallel processors. We achieve between 45-100% of handcoded, optimized CUDA performance using our Nvidia GPU backend.

### Quantum chemistry many-body methods on GPUs and multicore CPUs

Jeff R. Hammond [1] and A. Eugene DePrince III [2]

[1] Leadership Computing Facility (jhammond@anl.gov)

[2] Center for Nanoscale Materials (adeprince@anl.gov)

Quantum chemistry many-body methods (QMBM) solve the Schroedinger approximately, but at relatively high accuracy as compared to molecular dynamics (MD) and density-functional theory (DFT). While both MD and DFT have been implemented on GPGPUs by a number of groups, very little work has been done with QMBM on GPGPUs despite the excellent match between their regular and floating-point intensive algorithmic structure and the data-parallel capability of GPGPUs. In this talk we discuss the implementation of one variant of QMBM – the coupled-cluster method - on NVIDIA Tesla and Fermi processors. The performance is compared to an identically structured multicore CPU implementation as well as many existing software packages (NWChem, GAMESS, Molpro, PSI, Dalton and Aces II). The performance of our prototype implementation is 7 to 10 times faster than the fastest available CPU implementation (Molpro) and our mixed-precision algorithm achieves over 500 GF while converging to the same double precision accuracy achievable with a CPU code. Finally, we will compare our single-node CPU/GPU implementation to the massively-parallel implementation in NWChem, highlighting the challenges and opportunities for realizing petaflop QMBM using large GPU clusters.

### MultiFit and Shear Measurement for LSST

Jim Bosch

jbosch@physics.ucdavis.edu

UC Davis / LSST

One of the principal ways the Large Synoptic Survey Telescope will constrain cosmology is through weak gravitational lensing, an analysis of the spatial correlation of the ellipticities of billions of galaxies. The systematic errors on individual ellipticity measurements must be controlled to unprecedented levels, a requirement that suggests a galaxy modeling framework in which a model for each galaxy is simultaneously fit to data from hundreds of exposures. This "multifit" procedure could very easily dominate LSST's data-release computational budget, and it is a natural candidate for a GPU-based implementation, albeit with a data flow that is significantly different from most astronomical image processing algorithms.

### Liszt: A Domain Specific Language for solving PDEs on unstructured meshes

Z. DeVito, N. Joubert, M. Medina, M. Barrientos, F. Ham, E. Darve, J. Alonso, P. Hanrahan

Many physical simulations, such as computational fluid dynamics, involve solving partial differential equations over a discrete space. Programs written for this domain are traditionally either written in a high level language such as MATLAB for productivity, or in a low-level language targeting a particular platform such as C with MPI, or CUDA, for performance. In the era of power limited computing, heterogeneous hardware provides the potential for continued gains in performance, but will require new programming models. Rather than re-architect low-level code for every new programming model, we proposes a language - Liszt - that raises the level of abstraction for mesh-based PDEs. By restricting the language to a specific domain - mesh-based computation - we enable the generation of performant code using domain-level knowledge embedded in our compiler. This paper will demonstrate how a user of the Liszt language can write a single program that achieves performance on both a distributed memory cluster and a modern GPU.

### Using GPUs for signal correlation in Radio Astronomy

Mike Clark

mikec@seas.harvard.edu

We present a software correlator implemented using graphics processing units suitable for the cross correlation of signals arising from radio astronomy. We demonstrate that the computational dominant part of the cross-correlation algorithm, the {\it X-engine}, is ideal for deployment upon graphics processing units since it involves \(O(n^2)\) computation, but requires only \(O(n)\) memory traffic. With the appropriate tiling strategy we achieve up to 78\% of peak performance of Nvidia's Fermi architecture, which corresponds to an excess of one sustained teraflop per second. The flexibility, short development time and low cost of the GPU correlator has the potential to revolutionize future instrument away from traditional fixed-function hardware implementations.

### The Development of Mellanox - NVIDIA GPUDirect over InfiniBand – a New Model for GPU to GPU Communications

Gilad Shainer

Shainer@mellanox.com

The usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparallel performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can offload many of the application parallel computations, the system architecture of a GPU-CPU-InfiniBand server does require the CPU to initiate and manage memory transfers between remote GPUs via the high speed InfiniBand network. In this session we introduce a new innovative technology - GPUDirect that enables Tesla GPUs to transfer data via InfiniBand without the involvement of the CPU or buffer copies, hence dramatically reducing the GPU communication time and increasing overall system performance and efficiency. We also explore the performance benefits of GPUDirect using Amber, a molecular dynamics software package and LAMMP, a Large-scale Atomic/Molecular Massively Parallel Simulator software.

### Derivative Discretization on GPUs

Paulius Micikevicius

pauliusm@nvidia.com

Derivative discretization is fundamental to finite difference computations in time domain and is amenable to GPU acceleration. The goal of this paper is to provide scientists with a deterministic framework for choosing the best GPU implementation for their application. We review derivative computation approaches for 1D, 2D, and 3D data and a range of orders in space. In addition to partial derivatives along one dimension, we also consider mixed second derivatives for 2D and 3D data. Performance bottlenecks are analyzed for both single- and multiple-pass implementations, thus allowing one to select the right approach based on operator size.

### Accelerated Monte Carlo Simulation of the Xray Radiography : Geant 4 revisited for GPU

Alain Bonissent

bonissen@cppm.in2p3.fr

Computer simulation of the interaction of X ray photons with live material is widely used by the medical imaging community. It permits the preparation of well controlled data sets which can be used for the development of image processing algorithms, setting the requirements on detector performance, estimate of the radiation dose received by the patient etc. Technology evolution towards higher resolution and higher detector size makes these simulations more and more computer intensive.

The High Energy Physics community has developed over many years the general purpose software GEANT (Generation of Events and Tracks), which implements all known physical phenomena which can happen when a particle travels throughmatter, including the complex geometry which characterizes the HEP detectors. Therefore it was natural to use the GEANT framework for the simulation of the radiography process, which is the current and most popular approach, together with the use of EGS, an equivalent code specific of electromagnetic interactions.

In order to improve the execution speed, we investigated how the GEANT code could take advantage of the GPU type hardware. I will describe why and how the structure of the code had to be reconsidered while preserving the scientific content, namely the cross sections, so that the physics output of the simulations could remain unchanged.

### NPGPU: Network Processing on Graphics Processing Units

Yangdong Steve Deng

dengyd@tsinghua.edu.cn

Throughput and programmability have always been two conflicting design requirements for modern networking devices. Now network processing units (NPUs) are widely used to achieve programmable network processing at a high throughput. However, the lack of a widely accepted programming model and diversification of NPU architectures make it extremely difficult to develop NPU programs in a timely manner. Modern GPUs are offering significant computing power, and its data-parallel computing model well matches the typical patterns of Internet packet processing. Accordingly, in this research we investigated the potential of GPUs for network processing. Two families of applications, IP routing and deep packet inspection, were implemented are GPUs. Besides identifying GPU oriented algorithmic structures and optimizations, a meta-programming technique was also introduced to expedite the handling of large rule sets that are common in today’s network applications. Experimental results proved that GPU could accelerate typical packet processing applications by one order of magnitude. On the other hand, the communication bottleneck between CPU and GPU as well as the lack of quality-of-service control pose new challenges for GPUs to deliver a higher level of performance. Accordingly, we introduced microarchitectural enhancements to GPUs to make them a better network processing engine.

### Multithreaded Graph Coloring Algorithms for Scientific Computing on Many-core Architectures

Assefaw Gebremedhin

agebreme@purdue.edu

Assefaw Gebremedhin * Md. Mostofa Ali Patwary † Alex Pothen ‡ Extended Abstract

*Department of Computer Science, Purdue University. agebreme@purdue.edu

†Department of Informatics, University of Bergen, Norway. This work was done in part while this author was visiting Purdue University the Fall of 2010. Mostofa.Patwary@ii.uib.no

‡Department of Computer Science, Purdue University. apothen@purdue.edu

We present a collection of new, efficient and scalable multithreaded algorithms, suitable for many-core architectures, for distance-k (k = 1,2) graph coloring problems. Distance-2 coloring is an archetypal model used in the efficient computation of sparse Jacobian and Hessian matrices using automatic differentiation [3]. Distance-1 coloring and related abstractions are used, among others, in parallel scientific computing to identify subtasks that can be carried out or data elements that can be updated concurrently. The coloring algorithms we have developed are implemented and their high-performance demonstrated on various multicore platforms, including an 8-core Intel Nehalem, a 16-core Sun Niagara 2 and a 128-processor Cray XMT system. These platforms cover a broad spectrum of multithreading capabilities and memory structure: the Intel Nehalem relies primarily on a cache-based, hierarchical memory system as a means for hiding latencies in data access and supports only two hyper-threads per processor, the Sun Niagara 2 features a moderate number of hardware-threads (8 per processor) along with a hierarchical memory system to achieve high performance, while the Cray XMT has a flat, cache-less memory system and utilizes massive multithreading (128 threads per processor) as the sole mechanism for tolerating latencies in data access. We attain high-performance on the various platforms by paying careful attention to and taking advantage of the programming abstractions and hardware features the machines provide.

The distance-k graph coloring problem is known to be NP-hard to solve optimally for every fixed integer k ≥ 1 [3]. We rely on greedy algorithms that provide approximate solutions fast (linear time in the case of distance-1 coloring). When combined with good vertex ordering techniques, the greedy algorithms yield solutions that are extremely close to optimal for graphs that arise in practice, despite the pessimistic theoretical inapproximability results that are known for coloring problems [4]. We utilize in this work several effective degree-based vertex ordering techniques, including Smallest Last, Incidence Degree and Dynamic Largest First. And we parallelize not only the encompassing greedy coloring framework but also the embedded vertex ordering stage.

The greedy coloring algorithms are inherently sequential in nature and thus challenging to parallelize. We overcome this challenge by using speculation and iteration, where concurrency is maximized by tentatively allowing inconsistencies and resolving them later (iteratively), as our primary ingredients. The speculation-and-iteration paradigm was successfully used in the design of parallel distance-1 and distance-2 coloring algorithms in the context of distributed-memory ar- chitectures [1, 2]. The algorithms presented here differ in their specifics since the performance

requirements and the available hardware features on multithreaded architectures are significantly different from those on distributed-memory, message-passing based architectures. The vertex or- dering techniques used within the greedy coloring algorithms are in turn inherently sequential in nature. We circumvent this difficulty by relaxing the requirements in the orderings and using ap- proximate variants that are more amenable for parallelization. Additional elements we incorporate to enhance concurrency include randomization and various conflict-reducing color choice strategies in each step of the greedy coloring algorithms.

We assess the scalability and performance of the algorithms using a set of synthetically generated massive graphs carefully designed to include instances that test-stress the algorithms. We also use in our experiments real-world graphs drawn from various application areas. We show that the algorithms scale well on the various platforms, nearly ideally for certain classes of graphs, with increasing relative performance on platforms with greater thread concurrency. Further, the number of colors used by the parallel algorithms is fairly close to what the sequential algorithms output. In turn, the solutions the sequential algorithms provide is only a small factor of the optimal, as can be shown by computing appropriate lower bounds. Hence there is negligible loss in quality of solution due to parallelization.

The techniques employed in and the lessons learnt from this work will likely be helpful for the design of high-performance algorithms suitable for many-core architectures for many other graph problems as well.

References

[1] D. Bozda ̆g, U.V. Catalyurek, A. H. Gebremedhin, F. Manne, E. G. Boman, and F. Ozgun- ner. Distributed-memory parallel algorithms for distance-2 coloring and related problems in derivative computation. SIAM J. Sci.Comput., 32(4):2418–2446, 2010.

[2] D. Bozda ̆g, A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. Journal of Parallel and Distributed Computing, 68(4):515–535, 2008.

[3] A.H. Gebremedhin, F. Manne, and A. Pothen. What color is your Jacobian? Graph coloring for computing derivatives. SIAM Review, 47(4):629–705, 2005.

[4] A.H. Gebremedhin, D. Nguyen, M.M.A Patwary, and A. Pothen. ColPack: Graph coloring software for derivative computation and beyond. Submitted to ACM Trans. on Math. Softw., 2010.

### Lattice QCD on GPU Clusters, using the QUDA Library and the Chroma software system

Michael A. Clark, Balint Joo (speaker)

bjoo@jlab.org

The QUDA library for lattice quantum chromodynamics, combined with a high level application framework such as the Chroma software system provides a powerful tool for computing quark propagators, which is a key step in current calculations of hadron spectroscopy, nuclear structure and nuclear forces. In this contribution we will discuss our experiences, including performance and strong scaling of the QUDA library and Chroma on a variety of GPU clusters, such as the NERSC Dirac Cluster, the LLNL Edge Cluster and on the various Jefferson Lab clusters. We'll highlight some scientific successes and consider future directions for GPUs in lattice QCD calculations.

### Using Automated Code Generation to Support High Performance Extended MHD Integration in OpenGGCM

Kai Germaschewski and Joachim Raeder

kai.germaschewski@unh.edu

Space Science Center and Department of Physics University of New Hampshire, Durham, NH 03824

Automatic code generation is a technique that takes the specification of an algorithm at a high abstraction level and turns it into a well-tuned computer code. For finite-volume / finite-difference based discretizations, this higher abstraction level can be a stencil computation. At the backend, the code generator features modules which generate optimal code for specific hardware architectures, for example conventional ar- chitectures (x86) using SIMD instructions (e.g. SSE2), or heterogeneous architectures like the Cell processor or GPGPUs. The definition of the computation is agnostic to the actual hardware used, as a high-performance implementation tailored to the specific ar- chitecture will be generated automatically. While a conventionally written stencil-based code will typically only achieve a few percent of peak performance, the automatically generated version improves performance even on conventional hardware, and provide very significant gains on heterogeneous computing architectures.

The OpenGGCM code, a global magnetosphere model, has been converted to use an automatically generated implementation of its magnetohydrodynamics (MHD) integrator. We present performance analysis results for the new version of OpenGGCM and describe the recent extension of the MHD model to incorporate Hall physics.

### GPU Application to Real-time Astronomical Image Pipelines

Steven Hartung

steve@integratedwork.com

James Cook University, Centre for Astronomy 30 minute talk.

Observational astronomy has entered the era of gigapixel cameras and sub-minute exposure times where each image requires TFLOP scale computation. The GPU offers a scalable approach to keep pace with the increasing image size and rate of data generation. The burgeoning field of temporal astronomy relies heavily on computationally intensive image processing to extract results. A common analytic tool in this field is image differencing, the pixel by pixel subtraction of images taken at different times. While simple in concept, in practice, the vagaries of atmospheric and instrument optics make image subtraction a computationally dense operation. Amdahl’s Law places restrictions on the amount of performance gain that is possible in any parallel processing, but it has nothing to say about the scalability of the algorithm in question to the size of the problem. Restating Amdahl’s Law as a time of execution rather than a speed-up multiplier provides a potential tool for characterizing the scalability of an algorithm to increases in problem size. Observational astronomy is facing just such an increase in problem size. Image differencing code adapted for GPU has been examined for execution time as a function of image size. Indications are that a many-core approach, such as the low-cost GPU, can be applied so that processing times are controlled and do not grow at the rate that data acquisition is growing.

### Support Operator Rupture Dynamics on GPU

Dave Yuen

daveyuen@gmail.com

Tingxing Dong^{1, 2, 4}, Yichen Zhou^{1}, Dr. David A. Yuen^{1, 3}, Shenyin Song^{2}

^{1} Minnesota Supercomputing Institute, University of Minnesota

^{2} Supercomputing Center, Chinese Academy of Sciences (SCCAS)

^{3} Department of Geology and Geophysics, University of Minnesota

^{4} Innovative Computing Laboratory, University of Tennessee

SORD (Support Operator Rupture Dynamics) is an open-source software based on a fourth-order finite-difference method which can simulate 3D elastic wave propagation and spontaneous rupture on hexahedral mesh. It can be used for various kinds of surface boundary conditions, including free surface. The original software is developed by Geoffrey Ely from USC and modified by us for acceleration on GPU with NVIDIA CUDA. We applied it on GPU to achieve to accelerate its computing speed.

Our motivation on accelerating SORD on GPU is inspired by new generation GPU’s superior ability on general purpose computing and NVIDIA CUDA’s user-friendly developing environment for academic users. After translating the code from Fortran 95 to CUDA and implementing the transformed CUDA SORD code on the Nvidia Tesla C1060, we obtained a factor of 6 speedup as compared to the original Fortran 95 version code, which was run on Intel Xeon X5570 2.9GHz. Our 3D wave solutions show explicitly visually in 3D format the different propagating wave fronts associated with the P and S waves according to the appropriate elastic parameter ratios. Because of the limitation of the global memory of NVIDIA Tesla C1060, too many more grid points would require the use of main memory thus slow down the calculation. However, by using the new NVIDIA Tesla C2070, which has 6 GBytes global memory, we can increase the simulation data size into 350X350X350.

### Large Scale Computations in Electromagnetic Geophysics

Gregory Newman, Michael Commer, Fillipe Maia

Large-scale three-dimensional (3D) modeling and imaging is receiving considerable attention in the interpretation of controlled source electromagnetic (CSEM) and magnetotelluric (MT) data in offshore hydrocarbon exploration and geothermal exploration in complex volcanic terrains. The need for 3D modeling and imaging is necessary as the search for energy resources now increasingly occurs in highly complex situations where hydrocarbon effects and geothermal fluids are subtle aspects of their particular geological environment. Further complicating matters is the realization that electrical anisotropy also needs to be incorporated directly into the imaging process. Failure to properly treat anisotropy can produce misleading and sometimes un-interpretable results when broadside/wide azimuth CSEM data is included. Merely excluding broadside data detecting antennas is frequently an issue when 3D coverage is desired. In this workshop we discuss a 3D modeling and imaging approaches that treats both CSEM and MT data jointly as well as transverse anisotropy, which appears to be relevant for many practical exploration scenarios. We discuss effective strategies for large scale modeling and imaging using multiple levels of parallelization on distributed machines, and new developments in porting our 3D modeling and imaging algorithms to GPU computational platforms.

### Mantle Convection on GPU Using the Rayleigh-Benard Paradigm

David Sanchez

sanc0174@umn.edu

Gregory A. Barnett, University of Colorado Denver, Department of Applied Mathematics

Amidst sophisticated mantle convection models that take a wide range of physical and rheological parameters into consideration, Rayleigh-Benard convection may be thought of as too simple to be competitive. However, by virtue of this simplicity the Rayleigh-Benard paradigm enjoys two particular advantages. Firstly, for models that can be thought of as modifications to Rayleigh-Benard convection, it can provide a significant insight by showing which features are due to the underlying fluid mechanics and which are due to geophysical refinements. Secondly, its solution by way of finite- difference method is excellently suited to implementation on manycore, SIMD architectures, such as GPU. Because the computational landscape is changing so rapidly, thanks to the introduction of GPU and manycore computing, these advantages make the Rayleigh-Benard model noteworthy both as a computational and as a geophysical benchmark.

Accordingly, we implement two- and three-dimensional Rayleigh-Benard thermal convection solved using a 2nd-order finite-difference scheme on GPU using CUDA for C. As we show, the 2D implementation enables us to achieve performance on the order of 535 GFLOP/s on a single Tesla C2070. Additionally, since the 3D implementation can be given a non-overlapping domain decomposition whose subdomains strongly represent the numerical characteristics of the 2D problem, the scaling features of the 2D code on a single GPU are quite encouraging as to the scaling of the 3D code along several GPUs—such as NSF’s Keeneland cluster, on which we would be capable of modeling 3000x3000x1500 finite-difference grid points.

### Automatically Generating High Performance Source Code for a Many-Core CPU or Accelerator from a Simplified Input Code Expression

Paul Woodward

paul@lcse.umn.edu

Laboratory for Computational Science & Engineering University of Minnesota

To achieve exascale performance on future systems will require an overall 1000X performance increase. We believe that a significant fraction of this increase can come from better utilization of each CPU chip, with its many cores. For the last three years we have been working with Cell-processor accelerated systems, especially the Los Alamos Roadrunner machine (see recent NECDC 2010 proceedings [1]). From this experience we have distilled key elements of code design and numerical algorithm expression that (1) make efficient use of SIMD engines at each processor or accelerator core, (2) use on- chip data stores to dramatically reduce the requirement for main memory bandwidth to the chip, and (3) address the message passing complexities of systems with thousands of accelerators interconnected through host CPUs. We are now engaged in building tools that can enable others to restructure their codes in order to achieve these three goals. We are now focusing primarily on the first two goals, and are writing code translation tools for this purpose. In this talk, we will review the techniques we have found effective on Roadrunner, explain why these same techniques deliver similar performance gains on standard systems, such as the NSF’s Kraken machine at ORNL, and describe the tools now under construction. We will illustrate the use of our tools via simplified examples that nevertheless capture some of the primary challenges to high performance on many-core systems.

### Multithreaded Algorithms for Approximation and Exact Matching in Graphs

Halappanavar, Mahantesh M

Mahantesh.Halappanavar@pnl.gov

Ariful Azad1 , Florin Dobrian 2, John Feo3 , Mahantesh Halappanavar3,† , and Alex Pothen1 ,

We describe multithreaded algorithms for computing matchings in graphs. The graph matching problem has several key applications in scientific computing such as matrix permutation and scaling for linear solvers, computing block triangular forms, computing bases for column and null spaces, coarsening for multilevel graph partitioners, etc. Other applications of interest include bioinformatics, networking, image analysis and web applications.

The graph matching problem presents significant challenges for parallelization because of low amounts of concurrency at coarser levels that lead to load balancing problems, and a low ratio between computation and data accesses leading to inefficient use of cache hierarchy. On a distributed memory platform, communication is characterized by a large number of small messages that are irregular and therefore difficult to aggregate. Communication dominates the run time leading to poor utilization of resources. All the above problems are exacerbated for the emerging informatics applications such as social network analysis where data is large and sparse, has power law degree distribution, and demonstrates small world phenomenon of short paths between any two vertices.

In order to address the challenges of data distribution and to exploit fine-grained parallelism we target commodity multicore architectures as well as the massively multithreaded Cray XMT.

For the matching algorithms, our main candidate is a half approximation algorithm for computing maximum weighted matchings in general graphs that uses the idea of matching locally dominant edges. Locally dominant edges are those edges that are heavier than the their neighboring edges, and need to process edges in a global order can be avoided by using them. We provide two implementations. The first implementation is a natural extension of the sequential algorithm and is targeted for multicore architectures using OpenMP. The second implementation modifies the computation using dataflow principles and is exclusively targeted for the XMT.

While both parallel algorithms have the same sequential algorithm as the starting point, they differ in the level of synchronization. The OpenMP version synchronizes the threads through a global queue, while the dataflow version synchronizes them more efficiently through the memory tag-bits available on the XMT. Using the massive number of threads on the Cray XMT we demonstrate scalable performance for over 12, 000 threads on massive graphs with over a billion edges.

As part of ongoing work, we also consider exact algorithms for computing maximum cardinality matchings. Both single and multiple path (Hopcroft-Karp) approaches are considered. The challenges in implementing exact algorithms arise due to the dynamic nature of the amount of exploitable parallelism. The graph structure plays a critical role in this dynamics. We propose a disjoint forest approach for the Hopcroft-Karp algorithm as an efficient method for parallelization. We also explore the use of futures on the XMT in order to parallelize the depth-first kind of searches in the two algorithms.

To conclude, we use graph matching as a test case in an investigation along three dimensions: • Algorithms: using dataflow principles we exploit the extended memory semantics of the Cray XMT and demonstrate the need for algorithmic innovation in conjunction with the hardware platform. •

Architecture: we use an 8 core (16 threads) Intel Nehalem, a 16 core (128 threads) Sun Niagara-2, and a 128 processor (16, 384 threads) Cray XMT to demonstrate the differences between architectures, particularly the efficiency of the memory hierarchy versus hardware multithreading in tolerating latency. • Graph classes: Using R-MAT algorithms for graph generation we experiment with different classes of graphs from classical Erdo ̋s-Re ́nyi random graphs to graphs with power law distribution. The known characteristics of these graphs (degree distribution, average path-length and clustering coefficient) are used to answer various

questions on the performance of algorithms.

aazad@purdue.edu, dobrian@cs.odu.edu, john.feo@pnl.gov, mahantesh.halappanavar@pnl.gov and apothen@purdue.edu

1 Purdue University.

2 Conviva Inc. 3 Pacific Northwest National Laboratory. † Corresponding author.

### The basis and perspectives of an exascale algorithm: our ExaFMM project

Lorena Barba

labarba@bu.edu

Linearly scaling algorithms will be crucial for the problem sizes that will be tackled in capability exascale systems. It is interesting to note that many of the most successful algorithms are hierarchical in nature, such as multi-grid methods and fast multipole methods (FMM). We have been leading development efforts for open-source FMM software for some time, and recently produced GPU implementations of the various computational kernels involved in the FMM algorithm. Most recently, we have produced a multi-GPU code, and performed scalability studies showing high parallel efficiency in strong scaling. These results have pointed to several features of the FMM that make it a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape. We propose that the FMM algorithm offers exceptional opportunities to enable exascale applications. Among its exascale-suitable features are: (i) it has intrinsic geometric locality, and access patterns are made local via particle indexing techniques; (ii) we can achieve temporal locality via an efficient queuing of GPU tasks before execution, and at a fine level by means of memory coalescing based on the natural index-sorting techniques; (iii) global data communication and synchronization, often a significant impediment to scalability, is a soft barrier for the FMM, where the most time-consuming kernels are, respectively, purely local (particle-to-particle interactions) and "hierarchically synchronized" (multipole-to-local interactions, which happen simultaneously at every level of the tree). In addition, we suggest a strategy for achieving the best algorithmic performance, based on two key ideas: (i) hybridize the FMM with treecode by choosing on-the-fly between particle-particle, particle-box, and box-box interactions, according to a work estimate; (ii) apply a dynamic error-control technique, effected on the treecode by means of a variable "box-opening angle" and on the FMM by means of a variable order of the multipole expansion. We have carried out preliminary implementation of these ideas/techniques, achieving a 14x speed-up with respect to our current published version of the FMM. Considering that this effort was only exploratory, we are certain to possess the potential for unprecedented performance with these algorithms.

### Adaptive Real Time Imaging for Radio Astronomy

Melvyn Wright

melvyn.wright@gmail.com

Data processing poses significant problems for arrays of 100’s of anten-nas. Aperture synthesis imaging in radio astronomy has been developed with arrays of 10 to 30 antennas. In the current data processing paradigm, digital signal processing is performed in on-line, custom designed DSP hardware. Data are transferred to off-line computers for calibrations and imaging using data reduction packages. There is a severe mismatch between the data rates in the on-line DSP off-line processing which can typically handle only a few percent of the data rates large telescopes are capable of producing.

This can be resolved by integrating the calibration and imaging into the data acquisition process. Adaptive Real Time Imaging (ARTI) can simultaneously image multiple regions within the field of view by integrating the DSP with the off-line data processing. Calibration in close to real time uses a model of the sky brightness distribution. Derived calibration parameters are fed back into the DSP. New data update and improve the a-priori model, which becomes the final calibrated image by the time the observations are complete.

ARTI facilitates integrating the calibration and imaging into the data acquisition process by using standard 10 Gb packetized data and switches. Data from beamformers and correlators can be routed to CPU clusters to calibrate the data in close to real time. The original data can be stored and re-used at high bandwidths until the best possible calibration has been obtained.

There are several additional reasons for pursuing this approach. At decimeter wavelengths there are problems with non-coplanar array geom- etry and non-isoplanicity of the atmosphere. The whole field of view must be imaged in order to remove the sidelobes of all the sources within primary beam pattern of the antenna. The off-line data processing needed to decon- volve the sidelobes of sources in a large field using the current algorithms is prohibitively expensive.

The delayed calibration and analysis of the data are currently limiting the science which can be done. Variable sources, targets of opportunity, and RFI are more easily handled as the data are being acquired.

The next generation of aperture synthesis telescopes should implement cal- ibration and imaging in close to real time in order to reduce the burden of expert data reduction on the end user, and to make best use of both tele- scope and human resources. Telescopes such as ALMA, LOFAR, MWA and SKA are powerful instruments, and will only achieve full potential if they can easily be used by non radio astronomers. These instruments should produce final, calibrated images.There are no major technical obstacles to this approach, which is facili- tated by the high data bandwidth between FPGAs, GPUs and CPUs which implement the highly parallel data processing. The current generation of instruments built using CASPER boards use this approach to record data using 10 Gb links.

### Simulating Electromagnetic Cavities on Next-Generation Computing Architectures in VORPAL

Paul Mullowney, Peter Messmer, David Smithe, Travis Austin, John R. Cary

Tech-X Corporation, Boulder, CO 80302

paulm@txcorp.com

Effective design of radio-frequency (RF) cavities is fundamental to meeting the physics performance requirements in particle accelerators. Amongst those desired quantities is the accurate computation of cavity modes that transfer power to electron bunches traversing the cavity. Instrumental to the cavity design are complex geometric considerations such as the placement of dampers, couplers, and of course, the shape of the device. Determining the most efficient modes results is a parameter optimization problem where each point in the high-dimensional space corresponds to an expensive cavity simulation. Fortunately, these simulations can be efficiently performed on modern Graphics Processing Units (GPUs), which offer exceptional floating point power at a modest operational cost. Indeed, these devices are particularly well-suited to memory bandwidth limited, finite-difference-time-domain (FDTD) computations useful for simulating RF cavities. In this poster, we present results of a parallel GPU implementation of the VORPAL Dey-Mittra algorithm for simulating complex electromagnetic cavities. In particular, we'll demonstrate strong scaling results for simulations run on the NERSC Dirac GPU cluster. Improvements to the implementation for full 3D domain decomposition plus future plans are also discussed.

### Redesigning Combustion Modeling Algorithms for the Graphics Processing Unit (GPU): Chemical Kinetic Rate Evaluation and Ordinary Differential Equation Integration

Yu Shi^{1*}, William H. Green Jr.^{1}, Hsi-Wu Wong^{2}, Oluwayemisi O. Oluwole^{2*}

- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02149, USA

Aerodyne Research, Inc., Billerica, MA 01821, USA

*Corresponding authors: yushi@mit.edu, oluwoleo@aerodyne.com

Detailed modeling of complex combustion kinetics remains challenging and often intractable, due to prohibitive computational costs incurred when solving the associated large kinetic mechanisms. Here, we present methods for exploiting the highly parallel structure of GPUs for combustion modeling. This poster outlines simple algorithm revisions that can be applied to the majority of existing combustion modeling algorithms for GPU computations. Significant simulation acceleration and predictive capability enhancements were obtained by using these GPU-enhanced algorithms for reaction rate evaluation and in ODE integration. The simulations using the revised algorithms are more than an order of magnitude faster than the corresponding CPU-only simulations, even for a low-end (double-precision) graphics card. An analysis of the growth rates of combustion mechanism sizes versus computational capabilities of CPUs and GPUs further reveals the important role that GPUs are expected to play in the future of combustion modeling. We also demonstrate that the GPU-assisted chemistry solver, when coupled with other existing acceleration methods, enables the incorporation of even larger chemical kinetic mechanisms for multidimensional modeling of the emerging Homogeneous Charge Compression Ignition (HCCI) engine technology, which was previously impractical due to the computational expense.

### PDE-based Analysis of Void Space of Porous Materials on Multicore CPUs

Richard L. Martin, Maciej Haranczyk, Prabhat, James A. Sethian

Computational Research Division, Lawrence Berkeley National Laboratory One Cyclotron Road, Mail Stop 50F-1650,

Berkeley, CA 94720-8139

mharanczyk@lbl.gov

Porous materials such as zeolites and metal organic frameworks have been of growing impor- tance as materials for energy-related applications such as CO2 capture, hydrogen storage, and as catalysts for oil refinement. Databases of possible structures are being developed, and so there is a requirement for methods to screen such databases to discover materials with certain properties important to these applications. However, without automated tools, for example for the analysis of void space within the materials, researchers are restricted to visual analysis and simulation of individual structures.

Hence we have developed a partial differential equation (PDE) based tool for the automated analysis of void space, which utilizes the Fast Marching Method (FMM). Our tool can predict whether a guest molecule can traverse a structure, and determine the size and shape of accessible and inaccessible regions and calculate accessible volumes and surfaces. Unlike other tools of this kind, we can model the guest molecule not as a single sphere but as a probe built from multiple spheres (atoms, separated by some bond length). Then, by studying the available movement of this probe inside a chemical system we incorporate not only three translational degrees of freedom as in the case of a single spherical probe, but also up to three rotational degrees of freedom.

Our tool is intended to be the most complex application of FMM to the problem of robotic navigation, as at present we consider up to a six-dimensional marching space. The most time-consuming step is determining whether the guest molecule can access each position of discretized space in each orientation. We have developed a parallel implementation of this step, and demonstrate that our software achieves nearly linear scaling on multi-core CPUs and is expected to achieve similar scaling on GPUs.

**NVIDIA Experiences with Porting Large-Scale Engineering Codes to GPUs**

Stan Posey, NVIDIA, Santa Clara, CA, USA

sposey@nvidia.com

This presentation will examine the development and performance results of large-scale engineering application software for the Tesla Fermi GPU architecture. Examples will include GPU implementations of computational structural mechanics and fluid dynamics software that support the mechanical product design industry such as ANSYS, Abaqus, and LS-DYNA. The critical requirements of memory optimization and storage formats for highly irregular matrices will be discussed for grid-based iterative solvers and multi-frontal direct solvers. For each example, the choice of CPU-GPU system software and hardware configurations will be described.

**Harnessing heterogeneous systems for N-body problems**

Felipe A. Cruz

Nagasaki Advanced Computing Center, Nagasaki University (Japan)

fcruz@nacc.nagasaki-u.ac.jp

The N-body problem is the problem of predicting the motion of a group of objects that interact with each other over long distances. A classic example of such problem is the motion of celestial objects due to the gravitational forces. However, N-body systems are also of great importance to a variety of fields, such as, fluid dynamics, electrostatics, acoustics, and many others. One important tool to study N-body systems are the N-body simulations.

Direct N-body simulations are characterized by its computational complexity of $O(N^2)$ for systems with N particles. Due to this, direct N-body simulations with millions of particles would take unreasonable amounts of time. Alternatively, fast summation algorithms can solve the N-body problem in a $O(N log N)$ or $O(N)$ time. Fast N-body algorithms approximately evaluate the pair-wise interactions, effectively trading accuracy by speed in the calculations.

In this work, we present an heterogeneous implementation of the fast multiple method (a fast summation algorithm), that has been designed to support more parallel and compute-dense architectures as in the case of heterogeneous GPU accelerated or many-core systems. Due to the irregular data-patterns of the original algorithm formulation, the design strategy for many-core and GPU technologies is based on a reorganization of the computations that take place on the most time consuming routines of the algorithm. To do this, the computations of these stages are passed into a queue and scheduling module that reorganizes and submit the work to the processing elements.

**GPU Computing on a New Frontier in Cosmology**

Lincoln Greenhill

Harvard / Smithsonian

greenhill@cfa.harvard.edu

The cosmological Dark Age and Epoch of Reionization that followed the Big Bang reflect the last gap in the "historical" record. During this billion-year interval, the first stars, galaxies, and supermassive black holes formed. Though critical to development of the Universe as we know it today, there is very little data available to constrain theory. The redshifted 21 cm hyperfine transition of atomic Hydrogen promises to be a unique tracer of the then rapidly evolving, natal intergalactic medium and an indicator of what objects populated the early Universe.

Detection and study of the HI line requires new generations of low-frequency radio interferometer arrays (< 200 MHz). Mass deployments of dipoles enable assembly of very large collecting areas at low cost. However, sheer numbers, wide field of view, and rough tolerances demand use of advanced algorithms and potentially extreme signal processing costs.

I will describe a unique GPU-enabled one-pass, real-time, wide-field, full-polarization calibration and imaging pipeline now in field use. Long-term build out of the underlying array will drive the computing budget toward 100 TFlop/s by late 2012. I will discuss generalization of this pipeline to serve more advanced, iterative calibration and imaging with other apertures. I will also discuss scaling smf application of GPU computing to a planned next-generation array over 10x larger and for which the budget may approach 100 PFlop/s.

**Dynamical Evolution of and Gravitational Waves from Binary Black Holes in Galactic Nuclei and Prospects of General Purpose Computing using GPU**

Rainer Spurzem, Peter Berczik, et al.

National Astronomical Observatories, Chinese Academy of Sciences, Beijing and Astronomisches Rechen-Inst., Zentrum fuer Astronomie, Univ. Heidelberg

We study the dynamical evolution of dense star clusters (globular clusters and galactic nuclei with single and binary supermassive black holes). Rotation accelerates the collisional dynamics of the system, and if Post-Newtonian relativistic forces between black holes are included, SMBH coalescence and gravitational wave emission is triggered without any stalling problem for the last parsec after galaxy mergers. The treatment of relativistic corrections to Newtonian gravity is reviewed and we discuss gravitational wave emission in pulsar timing and LISA bands,

In the second part of the talk our new computational instruments are presented. New powerful supercomputers have been built using graphical processing units (GPU) for general purpose computing. China has just obtained the top rank in the list of the fastest supercomputers in the world with a GPU based system, and holds further ranks in the top 20 (all with GPU clusters).

The research of Chinese Academy of Sciences and National Astronomical Observatory with such GPU clusters will be reviewed, present and future applications in computer simulation and data processing discussed. GPU and other 'green' supercomputing hardware is one stepping stone on the path to reach Exascale supercomputing.

**CUDA-free programming of 3D stencil methods with Mint**

Didem Unat1, Xing Cai^{2}, and Scott B. Baden^{1}

^{1} Computer Science and Engineering, University of California-San Diego, USA

^{2} Simula Research Laboratory, and Department of Informatics, University of Oslo, Norway

We have developed Mint, a domain-specific programming model and translator that generates highly optimized CUDA C from annotated C source. Using just 5 different Mint pragmas, non expert CUDA programmer can benefit from GPU technology without having to be an expert. Mint includes an optimizer that targets 3D stencil methods. The translator generates both host and device code and handles the memory management details including host-device and shared memory optimizations. Mint parallelizes loop-nests in appropriately annotated C source, performing domain-specific optimizations important in three-dimensional problems. Preliminary results on a set of commonly-used 3D stencil methods reveal that Mint can achieve on average 76% (between 65% and 89%) of the performance achieved on the NVIDIA Tesla C2050 using hand-written CUDA.

**Scalability and the real world: lessons learned optimizing ATLAS reconstruction and simulation performance on multicore CPUs.**

Mous Tatarkhanov, Sebastien Binet, Paolo Calafiura, Keith Jackson, Wim Lavrijsen, Charles Leggett, David Levinthal, Yushu Yao.

Modern computer architectures have evolved towards multi-core, multi-socket CPUs. Exploiting optimally the CPU resources of these machines is a major challenge for complex scientific applications. The software of ATLAS experiment at the LHC [1], is an example of a large, complex and resource-hungry scientific application. Given the amount of existing ATLAS code (millions of lines of code in thousands of libraries) any optimization that requires major code overhauls is not practicable. AthenaMP [2] is a non-intrusive approach to coarse-grained parallelism designed primarily with the goal to reduce memory usage (ATLAS applications require up to 2GB of memory to run). AthenaMP makes use of Linux process forking to create event-processing workers farm and utilizes Copy-on-Write physical memory sharing mechanism among workers and parent. The parent process also takes the role of a work scheduler and merger of the output. The use of multiprocessing instead of multithreading is what makes our approach non-intrusive: since each process exists in its own memory address space the existing code can run in parallel without the notorious synchronization problems of multi-threaded applications.

Having demonstrated the feasibility of the AthenaMP, we have studied its performance on various many-core platforms. A major effort has been taken to boost the performance of the AthenaMP on Intel quad core Nehalem CPU systems, which appears to be the platform of choice for the next round of ATLAS hardware procurement. In this study, we showed that the significant scaling enhancements are possible for large-scale scientific applications without major change in the software design. The major improvements are also available from exploiting the new architectural and OS features. The thorough understanding of non-ideal scaling can be reached by utilizing tools like Intel's Performance Tuning Utility (PTU) to profile and analyze all aspects of the complex software on many-core machines.

[1] P. Mato, Gaudi-architecture design document, Tech. Rep. LHCb-98-064 Geneva

[2] S. Binet et al., Harnessing multicores: strategies and implementations in ATLAS, CHEP-2009

**Cosmological Simulations on Large, Accelerated Supercomputers**

Adrian Pope

Los Alamos National Laboratory

pope@lanl.gov

The advent of powerful cosmological surveys demands a new generation of high-precision, large-volume, and high dynamic range simulations of structure formation in the Universe. Key aims of these simulations understand why the expansion of the Universe is accelerating and what dark matter is made of. The availability of Roadrunner, the world’s ﬁrst petaﬂop platform, led us to develop a new hybrid cosmology simulation code that makes essential use of hardware acceleration and can scale to very large supercomputers. The modular design of our code

has enabled us to develop variants for both the IBM Cell Broadband Engine using the IBM Cell SDK and for GPGPU using OpenCL. We describe the strategies underlying the code, aspects of its implementation, initial results from code commissioning, and plans for variants on other architectures.

**Performance Evaluation of the 48-core Single Chip Cloud Computer**

Aparna Chandramowlishwaran

aparna@gatech.edu

This work presents a performance and scalability study of Intel's experimental 48-core Single-Chip Cloud (SCC) research system. Using a combination of simple microbenchmarks and collective communication routines, we evaluate processor and

interconnection network characteristics such as on-chip messaging latency,message bandwidths, and sustained bisection bandwidth. We also outline a performance model which guides the design of collective communication algorithms on the SCC.