“Implementation of HPC libraries for linear algebraic or matrix operations and come up with a comparative study analysing speed, memory metrics”

14 min readDec 6, 2022

Welcome guys! In this blog I will talk about some of HPC libraries for linear algebraic or matrix operations and come up with a comparative study analysing speed, memory metrics .

So let’s get started.
Firstly lets see

Overview of Dense Numerical Linear Algebra Libraries

BLAS → This is a dense linear algebra kernel
LAPACK → This is used for sequential dense linear algebra
ScaLAPACK → Parallel distributed dense linear algebra

So lets dive into HPC for a bit by discussing

What do you mean by performance?

What is a xflop/s?

xflop/s is the execution speed of an algorithm or program, the time at which a certain number of floating point operations are performed per second. Whenever this term is used, it refers to 64-bit floating-point arithmetic, where the operation is either multiplication or addition. Tflop/s is -> trillions (1012) floating point operations per second. Pflop/s represents -> 1015 floating point operations per second.

2. What is the theoretical peak performance?

Theoretical peaks are not based on real-world performance when running tests, but are based on paper calculations that determine the theoretical peak speeds of floating-point operations by machine. The theoretical peak performance is determined by calculating the number of floating-point additions and multiplications (to full precision) that can be performed in a given amount of time (typically machine cycle time). For example, a 2.93 GHz quad-core Intel Xeon 5570 processor can perform 4 floating point operations per clock. Theoretical peak performance is 11.72 Gflops per core or 46.88 Gflops per socket.

What Is LINPACK?

LINPACK is a math software package for solving. Linear algebra problems, mostly linear equations in dense linear systems. LINPACK: A “Linear Algebra Package” Written in Fortran 66 This project was started in 1974. This project had his four main contributors. Jim Bunch in college. Clive Muller of San Diego, New Mexico and Pete Stewart of the University of Maryland. His LINPACK as a software package has been largely replaced. LAPACK is designed to run efficiently on shared memory vector supercomputers.

Computing in 1974 :

High-Performance Computers: IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 Fortran 66 Software Portability Efforts effective work BLAS (Level 1) vector operation Software released in 1979 cray 1

LINPACK Benchmark?

The Linpack test measures the speed of floating point operations. computer. start Computer programs allow you to solve dense systems of linear equations. characteristics over the years Benchmarks have changed slightly. in reality, The Linpack test report contains three tests. Linpack test Dense linear system solution using LU decomposition with partial rotations Operations: 2/3 n3 O(n2) Reference measurement: MFlop/s early benchmark Fortran execution speed program on 100x100 matrix.

Accidental Benchmarker :

Linpack User Manual Appendix B It is designed to allow users to evaluate the running time of Linpack software packages. First comparative report in 1977. Cray 1 for DEC PDP-10

High Performance Linpack (HPL) :

Compiler parallelization is possible. Also known as Toward Peak Performance (TPP) or Best Effort. Multiprocessor implementations are allowed. The highly parallel LINPACK test is also known as NxN Linpack. Benchmarks or highly parallel computing (HPC).

A brief history of (Dense) Linear Algebra software:

But BLAS-1 wasn’t enough Consider
AXPY ( y = α x + y ): 2n flops at 3n reads/writes
computation strength = (2n)/(3n) = 2/3
Too slow to run Near full speed (read/write dominated)
BLAS-2 development method (1984–1986)
Standard library of 25 operations (mostly) on matrices/vectors
pairs
“GEMV”: y = α A x + β x, “GER”: A = A + α x y^T,
x = T^(-1) x
up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
why BLAS 2 ? performs O(n^2) operations on O(n^2) data
so computational intensity is still fine for a vector machine ~(2n^2)/(n^2) = 2
but with cache no problem on the machine

Next step: BLAS-3 (1987–1988)
matrices/9 operations on matrices (mostly)
standard library
pairs
“GEMM”: C = α A B + β C, C = α A AT + β C , B = T-1 B
Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
Why BLAS 3? Do O(n3) operations on O(n2) data
So computational intensity (2n3)/(4n2) = n/2 — finitely large!
suitable for machines with cache, other memory. Hierarchy level
Amount of BLAS1/2/3 code so far (all
www.netlib.org/blas)
Sources: 142 routines, 31K LOC, Tests: 28K LOC
Reference implementation only (not optimized)
Example: 3 GEMM Nested Loops

But the BLAS-1 weren’t enough Consider AXPY ( y = α·x y ): 2n flops on 3n read/writes Computational intensity = (2n)/(3n) = 2/3 Too low to run near peak speed (read/write dominates) So the BLAS-2 were developed (1984–1986) Standard library of 25 operations (mostly) on matrix/vector pairs “GEMV”: y = α·A·x β·x, “GER”: A = A α·x·yT , x = T-1·x Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC Why BLAS 2 ? They do O(n2) ops on O(n2) data So computational intensity still just ~(2n2)/(n2) = 2 OK for vector machines, but not for machine with caches 9 A brief history of (Dense) Linear Algebra software The next step: BLAS-3 (1987–1988) Standard library of 9 operations (mostly) on matrix/matrix pairs “GEMM”: C = α·A·B β·C, C = α·A·AT β·C, B = T-1·B Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC Why BLAS 3 ? They do O(n3) ops on O(n2) data So computational intensity (2n3)/(4n2) = n/2 — big at last! Good for machines with caches, other mem. hierarchy levels How much BLAS1/2/3 code so far (all at www.netlib.org/blas) Source: 142 routines, 31K LOC, Testing: 28K LOC Reference (unoptimized) implementation only Ex: 3 nested loops for GEMM 10 Memory Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology.

A brief history of (Dense) Linear Algebra software ¨ LAPACK — “Linear Algebra PACKage” — uses BLAS-3 (1989 — now) Ø Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows — BLAS-1 Ø How do we reorganize GE to use BLAS-3 ? (details later) Ø Contents of LAPACK (summary) Ø Algorithms we can turn into (nearly) 100% BLAS 3 Ø Linear Systems: solve Ax=b for x Ø Least Squares: choose x to minimize ||Ax — b||2 Ø Algorithms that are only 50% BLAS 3 (so far) Ø “Eigenproblems”: Find λ and x where Ax = λ x Ø Singular Value Decomposition (SVD): (ATA)x=σ2x Ø Generalized problems (eg Ax = λ Bx) Ø Error bounds for everything Ø Lots of variants depending on A’s structure (banded, A=AT, etc) Ø How much code? (Release 3.4, Nov 2011) (www.netlib.org/ lapack) Ø Source: 1674 routines, 490K LOC, Testing: 448K LOC 17 A brief history of (Dense) Linear Algebra software ¨ Is LAPACK parallel? Ø Only if the BLAS are parallel (possible in shared memory) ¨ ScaLAPACK — “Scalable LAPACK” (1995 — now) Ø For distributed memory — uses MPI Ø More complex data structures, algorithms than LAPACK Ø Only (small) subset of LAPACK’s functionality available Ø All at www.netlib.org/scalapack 18 LAPACK ¨ http://www.netlib.org/lapack/ ¨ LAPACK (Linear Algebra Package) provides routines for Ø solving systems of simultaneous linear equations, Ø least-squares solutions of linear systems of equations, Ø eigenvalue problems, Ø and singular value problems. ¨ LAPACK relies on BLAS ¨ The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. ¨ Dense and striped matrices are covered, but general thin matrices are not. Similar functionality is provided for real and complex matrices of all domains, both single and double precision. LAPACK is major in FORTRAN column LAPACK is SEQUENTIAL LAPACK is REFERENCE Implementation 19 New Generation Algorithms? Algorithms follow hardware evolution over time. LINPACK (80’s) (vector operations) dependent — level 1 BLAS operations LAPACK (90’s) (blocking, cache friendly) dependent — level 3 BLAS operations nrhs, A, lda, ipiv, b, ldb, info ) input: output: GESV example LU Solve a system of linear equations using LU factorization A B n n n nrhs i p i v n n n L U X n nrhs i p i v n info info Ax=b solution SV Linear least-squares problem LLS Least-squares problem with linear equality constraints LSE General linear model problem GLM Symmetric eigenvalue problem SEP asymmetric eigenvalue problem NEP singular value decomposition SVD generalized symmetric problem logically defined eigenvalue problem GSEP generalized asymmetric eigenvalue problem GNEP generalized (or ent) singular value decomposition (QSVD) over GSVD included in commercial socware packages is included. All you have to do is display the author’s name correctly. • Open SVN repository • Multiple operating systems — *Nix, Mac OS/X, Windows • Multiple build (cmake) support — make, xcode, nmake, VS Studio, Eclipse, etc.
• LAPACKE: for standard C language LAPACK API (in cooperation with INTEL) — 2 interface levels • High level interface: workspace allocation and NAN checking • Low level interface • Ready library for Windows • Extensive test suite • Forums and user support: hip://icl.cs.utk.edu/lapack-forum/ 23 Latest Algorithm From rapac version 3.0 — Mul; — Less convex Hessenberg QR algorithm with shic QR algorithm More aggressive. premature ejaculation. [2003 SIAM SIAG LA Algorithm Award Brahman, Byers, Mathias] — Improved Hessenberg reduction routines. [Gram. [Quintana-Orne & van de Gijn] — New MRRR eigenvalue algorithm [2006 SIAM SIAG LA award-winning algorithm by Dhillon and Negotiation] — New strategy for updating Par;al column norms of QR decomposition with column panning. . . [DR. [Mark and Buyanovich] — GE, PO [Langous] Improved mixed precision iteration using fast single precision Rectangular Full Packed (RFP) hardware [Gustavson, Langou] — Hyperfine iteration with XBLAS and GESV [Demmel et al .]

ScaLAPACK ¨ Software library for high density and tape routines ¨ Distributed memory — message passing ¨ Network of MIMD computers and workstations ¨ Provides clusters of SMPs — Solves systems of linear equations — Least squares solutions of systems of linear equations — Eigenvalue problems, — and singular value problems. • Depends on LAPACK/BLAS and BLACS/MPI • FORTRAN and C PBLAS (Parallel BLAS) ScaLAPACK includes his ScaLAPACK for PARALLEL DISTRIBUTED. , Intel, TMC Ø Please provide correct layer notation. Ø BLAS ¨ LAPACK Software Competence/Quality Ø Software Approach Ø Numerical Methods 33 Software Forest ¨ Object-Based — Array Descriptor Creation and Placement. Ø Provides a flexible framework to easily specify additional data distributions and matrix types. Ø Currently Dense, Zoned, Off-Core ¨ Use contextual concepts 34 PBLAS ¨ Functionality and nomenclature are similar to BLAS. ¨ Based on BLAS and BLACS ¨ Matrix CALL DGEXXX ( M, N, A ( IA, JA ), LDA,… ) CALL PDGEXXX ( M, N, A, IA, JA, DESCA,. . ) 35 ScaLAPACK Structure ScaLAPACK BLAS LAPACK BLACS PVM/MPI/.
PBLAS Global Local 36 Data Distribution Selection ¨ Topics include: Used in ScaLAPACK for dense matrices [LAPACK] Subroutine dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) Input: Output: From LAPACK to ScaLAPACK A B n n n nrhs i p i v n n n L U X n nrhs i p i v n info info LAPACK data placement LAPACK data placement [LAPACK] subprogram dgesv( n, nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) input: output: LAPACK to ScaLAPACK n n info A11 A21 A31 A12 A22 A32 A13 A23 A33 n B11 B21 B31 nrhs ip1 ip2 ip3 n L info 21 L31 U12 L32 U13 U23 n X11 X21 X31 ip1 ip2 ip3 nrhs L11 U11 L22 U3LAPACK Scapacks L3 data layout , nrhs, a(ia,ja), lda, ipiv, b(ib,jb), ldb, info ) [ScaLAPACK] Subprogram dgesv( n, nrhs, a, ia, ja , des ca, ipiv, b, ib , jb, descb , info ) Inputs: Outputs: LAPACK to ScaLAPACKn X11 X21 X31 ip1 ip2 ip3 nrhs L21 U11 U22 L33 U33 n ScaLAPACK Data Layout ScaLAPACK Data Layout 41 2x2-Process Grid View ¨ Redistributable Routines/Data Distribute. A11 A12 A13 A14 A15 A21 A22 A23 A24 A25 A31 A32 A33 A34 A35 A41 A42 A43 A44 A45 A51 A52 A53 A54 A55 A11 A12 A15 A13 A14 A21 A22 A25 A23 A24 A51 A52 A55 A53 A54 A31 A32 A35 A33 A34 A41 A42 A45 A43 Parallelism in ScaLAPACK ¨ Level 3 BLAS block operations Ø All the reduction routines ¨ Pipelining Ø QR Algorithm, Triangular Solvers, classic factorizations ¨ Redundant computations Ø Condition estimators ¨ Static work assignment Ø Bisection ¨ Task parallelism Ø Sign function eigenvalue computations ¨ Divide and Conquer Ø Tridiagonal and band solvers, symmetric eigenvalue problem and Sign function ¨ Cyclic reduction Ø Reduced system in the band solver ¨ Data parallelism Ø Sign function Func;onali;es in LAPACK Type of Problem Acronyms Linear system of equa;ons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposi;on SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quo;ent) singular value decomposi;on GSVD (QSVD) Func;onnali;es in ScaLAPACK Type of Problem Acronyms Linear system of equa;ons SV Linear least squares problems LLS Linear equality-constrained least squares problem LSE General linear model problem GLM Symmetric eigenproblems SEP Nonsymmetric eigenproblems NEP Singular value decomposi;on SVD Generalized symmetric definite eigenproblems GSEP Generalized nonsymmetric eigenproblems GNEP Generalized (or quo;ent) singular value decomposi;on GSVD (QSVD) 56 Major Changes to Software • Must rethink the design of our software Ø Another disruptive technology Ø Similar to what happened with cluster computing and message passing Ø Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change Ø For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this A New Generation of Software: Parallel Linear Algebra Software for Multicore Architectures (PLASMA) Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on — Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on — Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on — PBLAS Mess Passing PLASMA (00’s) New Algorithms (many-core friendly) Rely on — a DAG/scheduler — block data layout — some extra kernels Those new algorithms — have a very low granularity, they scale very well (multicore, petascale computing, … ) — removes a lots of dependencies among the tasks, (multicore, distributed computing) — avoid latency (distributed computing, out-of-core) f A New Generation of Software: Parallel Linear Algebra Software for Multicore Architectures (PLASMA)

Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on — Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on — Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on — PBLAS Mess Passing PLASMA (00’s) New Algorithms (many-core friendly) Rely on — a DAG/scheduler — block data layout — some extra kernels Those new algorithms — have a very low granularity, they scale very well (multicore, petascale computing, … ) — removes a lots of dependencies among the tasks, (multicore, distributed computing) — avoid latency (distributed computing, out-of-core) f A New Generation of Software: Parallel Linear Algebra Software for Multicore Architectures (PLASMA) Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on — Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on — Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on — PBLAS Mess Passing PLASMA (00’s) New Algorithms (many-core friendly) Rely on — a DAG/scheduler — block data layout — some extra kernels Those new algorithms — have a very low granularity, they scale very well (multicore, petascale computing, … ) — removes a lots of dependencies among the tasks, (multicore, distributed computing) — avoid latency (distributed computing, out-of-core) f A New Generation of Software: Parallel Linear Algebra Software for Multicore Architectures (PLASMA) Software/Algorithms follow hardware evolution in time LINPACK (70’s) (Vector operations) Rely on — Level-1 BLAS operations LAPACK (80’s) (Blocking, cache friendly) Rely on — Level-3 BLAS operations ScaLAPACK (90’s) (Distributed Memory) Rely on — PBLAS Mess Passing PLASMA (00’s) New Algorithms (many-core friendly) Rely on — a DAG/scheduler — block data layout — some extra kernels Those new algorithms — have a very low granularity, they scale very well (multicore, petascale computing, … ) — removes a lots of dependencies among the tasks, (multicore, distributed computing) — avoid latency (distributed computing, out-of-core) — rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms. Moore’s Law is Alive and Well 61 1.E-01 1
E 00 1stE 01 1stE 02 1stE 03 1stE 04 1stE 05 1stE 06 1stE 07 1970 1975 1980 1985 1990 1995 2000 2005 2010 But clock frequency scaling replaced by core/chip scaling Batten and Krste Asanoviç function exponential growth per year 15 ~2x Years Too Slow Performance 63 1.E -01 1
E 00 1.E 01 1 .E 02 1.E 03 1st E 04 1st

E 05 1.E 06 1 E 07 1970 1975 1980 1985 1990 1995 2000 2005 2010 Transistors (in Thousands) Frequency (MHz) Power (W) Cores Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Power is the root cause of all this A hardware issue just became a software problem 64 Power Cost of Frequency • Power ∝ Voltage2 x Frequency (V2F) • Frequency ∝ Voltage • Power ∝Frequency3 65 Power Cost of Frequency • Power ∝ Voltage2 x Frequency (V2F) • Frequency ∝ Voltage • Power ∝Frequency3 Moore’s Law Reinterpreted ¨ Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Ø Need to deal with systems with millions of concurrent threads Ø Future generation will have billions of threads! Ø Need to be able to easily replace inter-chip parallelism with intro-chip parallelism ¨ Number of threads of execution doubles every 2 year 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 Cores in the Top20 Systems Example of typical parallel machine Chip/Socket Core Core Core Core Example of typical parallel machine Node/Board Chip/Socket Chip/Socket Chip/Socket Core Core Core Core … Core GPU GPU GPU Example of typical parallel machine Cabinet Node/Board Node/Board Node/Board Chip/Socket Chip/Socket Chip/Socket Core Core Core Core … Core Shared memory programming between processes on a board and a combination of shared memory and distributed memory programming between nodes and cabinets … Example of typical parallel machine Switch Cabinet Cabinet Cabinet Node/Board Node/Board Node/Board Chip/Socket Chip/Socket Chip/Socket Core Core Core Core … … Core Combination of shared memory and distributed memory programming … November 2011: The TOP10 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx custom Japan 705,024 10.5 93 12.7 826 2 Nat. SuperComputer Center in Tianjin Tianhe-1A, NUDT Intel Nvidia GPU custom China 186,368 2.57 55 4.04 636 3 DOE / OS Oak Ridge Nat Lab Jaguar, Cray AMD custom USA 224,162 1.76 75 7.0 251 4 Nat
Supercomputer Center in Shenzhen Nebula, Dawning Intel Nvidia GPU IB China 120,640 1.27 43 2.58 493 5 GSIC Center, Private Equipment Tusbame 2.0, HP Intel Nvidia GPU IB Speed 73,278 1.19 52 0 NNSA.85 LANL & SNL Cielo, Cray AMD Custom USA 142,272 1 11 81 3