Overview of Dense Numerical Linear Algebra Libraries

BLAS → This is a dense linear algebra kernel
LAPACK → This is used for sequential dense linear algebra
ScaLAPACK → Parallel distributed dense linear algebra

So lets dive into HPC for a bit by discussing

What do you mean by performance?

  1. What is a xflop/s?

xflop/s is the execution speed of an algorithm or program, the time at which a certain number of floating point operations are performed per second. Whenever this term is used, it refers to 64-bit floating-point arithmetic, where the operation is either multiplication or addition. Tflop/s is -> trillions (1012) floating point operations per second. Pflop/s represents -> 1015 floating point operations per second.

2. What is the theoretical peak performance?

Theoretical peaks are not based on real-world performance when running tests, but are based on paper calculations that determine the theoretical peak speeds of floating-point operations by machine. The theoretical peak performance is determined by calculating the number of floating-point additions and multiplications (to full precision) that can be performed in a given amount of time (typically machine cycle time). For example, a 2.93 GHz quad-core Intel Xeon 5570 processor can perform 4 floating point operations per clock. Theoretical peak performance is 11.72 Gflops per core or 46.88 Gflops per socket.


LINPACK is a math software package for solving. Linear algebra problems, mostly linear equations in dense linear systems. LINPACK: A “Linear Algebra Package” Written in Fortran 66 This project was started in 1974. This project had his four main contributors. Jim Bunch in college. Clive Muller of San Diego, New Mexico and Pete Stewart of the University of Maryland. His LINPACK as a software package has been largely replaced. LAPACK is designed to run efficiently on shared memory vector supercomputers.

Computing in 1974 :

High-Performance Computers: IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 Fortran 66 Software Portability Efforts effective work BLAS (Level 1) vector operation Software released in 1979 cray 1

LINPACK Benchmark?

The Linpack test measures the speed of floating point operations. computer. start Computer programs allow you to solve dense systems of linear equations. characteristics over the years Benchmarks have changed slightly. in reality, The Linpack test report contains three tests. Linpack test Dense linear system solution using LU decomposition with partial rotations Operations: 2/3 n3 O(n2) Reference measurement: MFlop/s early benchmark Fortran execution speed program on 100x100 matrix.

Accidental Benchmarker :

Linpack User Manual Appendix B It is designed to allow users to evaluate the running time of Linpack software packages. First comparative report in 1977. Cray 1 for DEC PDP-10

High Performance Linpack (HPL) :

Compiler parallelization is possible. Also known as Toward Peak Performance (TPP) or Best Effort. Multiprocessor implementations are allowed. The highly parallel LINPACK test is also known as NxN Linpack. Benchmarks or highly parallel computing (HPC).

A brief history of (Dense) Linear Algebra software:

But BLAS-1 wasn’t enough Consider
AXPY ( y = α x + y ): 2n flops at 3n reads/writes
computation strength = (2n)/(3n) = 2/3
Too slow to run Near full speed (read/write dominated)
BLAS-2 development method (1984–1986)
Standard library of 25 operations (mostly) on matrices/vectors
“GEMV”: y = α A x + β x, “GER”: A = A + α x y^T,
x = T^(-1) x
up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
why BLAS 2 ? performs O(n^2) operations on O(n^2) data
so computational intensity is still fine for a vector machine ~(2n^2)/(n^2) = 2
but with cache no problem on the machine

Next step: BLAS-3 (1987–1988)
matrices/9 operations on matrices (mostly)
“GEMM”: C = α A B + β C, C = α A AT + β C , B = T-1 B
Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
Why BLAS 3? Do O(n3) operations on O(n2) data
So computational intensity (2n3)/(4n2) = n/2 — finitely large!
suitable for machines with cache, other memory. Hierarchy level
Amount of BLAS1/2/3 code so far (all
Sources: 142 routines, 31K LOC, Tests: 28K LOC
Reference implementation only (not optimized)
Example: 3 GEMM Nested Loops

• LAPACKE: for standard C language LAPACK API (in cooperation with INTEL) — 2 interface levels • High level interface: workspace allocation and NAN checkingLow level interface • Ready library for Windows • Extensive test suite • Forums and user support: hip://icl.cs.utk.edu/lapack-forum/ 23 Latest Algorithm From rapac version 3.0 — Mul; — Less convex Hessenberg QR algorithm with shic QR algorithm More aggressive. premature ejaculation. [2003 SIAM SIAG LA Algorithm Award Brahman, Byers, Mathias] — Improved Hessenberg reduction routines. [Gram. [Quintana-Orne & van de Gijn] — New MRRR eigenvalue algorithm [2006 SIAM SIAG LA award-winning algorithm by Dhillon and Negotiation] — New strategy for updating Par;al column norms of QR decomposition with column panning. . . [DR. [Mark and Buyanovich] — GE, PO [Langous] Improved mixed precision iteration using fast single precision Rectangular Full Packed (RFP) hardware [Gustavson, Langou] — Hyperfine iteration with XBLAS and GESV [Demmel et al .]

ScaLAPACK ¨ Software library for high density and tape routines ¨ Distributed memory — message passing ¨ Network of MIMD computers and workstations ¨ Provides clusters of SMPs — Solves systems of linear equationsLeast squares solutions of systems of linear equationsEigenvalue problems, — and singular value problems. • Depends on LAPACK/BLAS and BLACS/MPI • FORTRAN and C PBLAS (Parallel BLAS) ScaLAPACK includes his ScaLAPACK for PARALLEL DISTRIBUTED. , Intel, TMC Ø Please provide correct layer notation. Ø BLAS ¨ LAPACK Software Competence/Quality Ø Software Approach Ø Numerical Methods 33 Software Forest ¨ Object-Based — Array Descriptor Creation and Placement. Ø Provides a flexible framework to easily specify additional data distributions and matrix types. Ø Currently Dense, Zoned, Off-Core ¨ Use contextual concepts 34 PBLAS ¨ Functionality and nomenclature are similar to BLAS. ¨ Based on BLAS and BLACS ¨ Matrix CALL DGEXXX ( M, N, A ( IA, JA ), LDA,… ) CALL PDGEXXX ( M, N, A, IA, JA, DESCA,. . ) 35 ScaLAPACK Structure ScaLAPACK BLAS LAPACK BLACS PVM/MPI/.
