- languages, parallel paradigms, hardware, applications, HPC libraries (vendor optimized libraries) - performance optimization looks very different for different codes Plan for performance! Faster scientific computing: - Improved algorithm (FFT, FMM, n^3 vs Strassen matrix multiplication, approximations, precomputing etc.) - Improved hardware utilization (usually parallelism but not always - cache optimization/data-locality) - sometimes a work sub-optimal method may be faster because it is better at utilizing the available hardware - co-design? Hardware architecture overview - distributed - shared memory |_ NUMA, cache-lines (false-sharing) - caches | - instruction-level-parallelism: pipelining, vectorization - accelerators (GPUs - Wenda Zhou) Trends in HPC hardware and software Programming models for different architectures Profiling/instrumentation tools - sometimes need to design experiments to profile - perf stat (https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/) Parallelism, arithmetic intensity, roofline model, Amdahl's law strong and weak scaling Cilk, CUDA, OpenMP, Threading Building Blocks Effect of programming languages: - interpreted - slow when processing element by element - but can be fast/efficient when operating on large blocks. - ex. MATLAB - JIT - https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html - julia - low-level - C/C++, FORTRAN, assembly Parallel debugging? Performance may vary across processors, compilers, operating systems: - but a well thought out algorithm and a well written code will generally perform well Plot memory & latency gains with flops over last 2 decades - latency expensive - avoid random (unpredictable) memory accesses - main-memory bandwidth - reuse data in caches Give example of numa allocation, cache, conditionals, memory allocations, false sharing Vector libraries: Sleef, Agner Fog, HPX, Blaze Other libraries: Baobzi, libxsmm, Intel JIT, (module avail) OpenMP common pitfalls Be roughly aware of how expensive different operations are: avoid divisions, exp, trig fn, other special functions. low-level-optimization: benchmark, benchmark, benchmark! https://www.embedded.com/common-multicore-programming-problems-part-4-memory-cache-issues-and-consistency/ SoA / AoS gcc -march=native -E -v - &1 | grep march Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x https://dl.acm.org/doi/fullHtml/10.1145/3408893 https://gavinreynolds.scot/docs/msc-dissertation.pdf Work-depth, work-span, work-time model, PRAM Lectures: https://www.cse.wustl.edu/~angelee/archive/cse341/fall14/handouts/lecture03.pdf