### Rate

STARS-H is a high-performance parallel open-source Software for TestingAccuracy, Reliabilityand Scalability of Hierarchical computations. STARS-H provides a hierarchical matrix market in order to benchmark performance of various libraries for hierarchical matrix compressions and computations. STARS-H introduces a standard for assessing accuracy and performance of hierarchical matrix libraries on a given hardware architecture environment. STARS-H package is available online at https://github.com/ecrc/stars-h.

Why H-matrix market?

• H-matrices appear in N-body problems and integral equations as a result of discretization process (e.g. astrophysics, aerodynamics, drug design).

• There are no standard tests toassessperformanceofa software for exploiting hierarchically low-rank structure.

• To meet our own needs and those of the community, we offer a parameterizable library.

Goals of STARS-H

• Provide a set of applications to standardize comparison of hierarchical libraries.

• Enable standardized input for computations of different H-matrix libraries with optimized performance on a range of hardware.

STARS-H 0.1.0

• Data formats: Tile Low-Rank (TLR) with more to come.

• Operations: approximation,matvecproduct, CG solve.

• Synthetic applications: random TLR, Cauchy.

• Real applications: electrostatics, electrodynamics, spatial statistics.

• Programming models: OpenMP, MPIandtask-based (StarPU).

• Approximation techniques: SVD, Rank-Revealing QR, Randomized SVD.

Shaheen-II performance

STARS-H was tested on a Shaheen-II, a CRAY XC40 system. The system has 6,174 dual sockets compute nodes based on 16 core Intel Haswell processors running at 2.3GHz. Each node has 128GB of DDR4 memory running at 2300MHz. Overall the system has a total of 197,568 processor cores and 790TB of aggregate memory.

Summary

Matrices generated by STARS-H library have various rank distributions over tiles. It enables different inputs for further operations, like Cholesky factorization or matrix-matrix multiplication, making debugging and search for optimal implementation much easier. Although the TLR format itself is the simplest from a theoretical point of view, it scales nearly perfectly due to intrinsic load balance. The TLR could potentially be the best choice for emerging architectures, as a trend is to increase the number of computational cores, while sometimes decreasing power of a single core. As a popularity and number of compute nodes with several GPUs increases, TLR makes even more sense!

Future work

• Extend to other problems in a matrix-free form.

• Support HODLR, HSS, H and H2 data formats.

• Implement other approximation schemes (e.g., ACA).

• Port to GPU accelerators.

• Apply other dynamic runtime systems and programming models (e.g., PARSEC).

References

1. L. Greengard and V. Rokhlin. 1987. A fast algorithm for particle simulations. Journal of computational physics 73, 2 (1987), 325-348.

2. E. Tyrtyshnikov. Mosaic-skeleton approximations. Calcolo 33, 1 (1996) 47-57.

3. W. Hackbusch. 1999. A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices. Computing 62, 2 (1999), 89-108.

4. W. Hackbusch, B. KhoromskijandS. Sauter. 2000. On H2-Matrices.Lectures on Applied Mathematics (2000), 9-29.

5. SMart 2018. Structured Matrix Market. (Accessed January 2018). http://smart.math.purdue.edu.

6. S. Ambikasaran, D. Foreman-Mackey, L. Greengard, D. Hogg and M. O’Neil. 2016. Fast Direct Methods for Gaussian Processes. IEEE Trans. Pattern Anal. Mach. Intell 38, 2(2016), 252-265.

7. Y. Sun, B. Li and M. Genton. 2012. Geostatistics for Large Datasets.

In Space-Time Processes and Challenges Related to Environmental Problems, 55-77.

8. C. Augonnet, S. Thibault, R. Namyst and P. Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2), pp.187-198.