dmalhotra
/
2022-10-28-talk-fwam


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
							
- languages, parallel paradigms, hardware, applications, HPC libraries (vendor optimized libraries)
- performance optimization looks very different for different codes


Plan for performance!


Faster scientific computing:
- Improved algorithm (FFT, FMM, n^3 vs Strassen matrix multiplication, approximations, precomputing etc.)
- Improved hardware utilization (usually parallelism but not always - cache
  optimization/data-locality)
  - sometimes a work sub-optimal method may be faster because it is better at
    utilizing the available hardware

- co-design?


Hardware architecture overview
 - distributed
 - shared memory  |_  NUMA, cache-lines (false-sharing)
 - caches         |
 - instruction-level-parallelism: pipelining, vectorization
 - accelerators (GPUs - Wenda Zhou)

Trends in HPC hardware and software

Programming models for different architectures

Profiling/instrumentation tools
- sometimes need to design experiments to profile
- perf stat (https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/)

Parallelism, arithmetic intensity, roofline model, Amdahl's law

strong and weak scaling

Cilk, CUDA, OpenMP, Threading Building Blocks


Effect of programming languages:
 - interpreted
   - slow when processing element by element
   - but can be fast/efficient when operating on large blocks.
   - ex. MATLAB
 - JIT
   - https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html
   - julia
 - low-level
   - C/C++, FORTRAN, assembly


Parallel debugging?


Performance may vary across processors, compilers, operating systems:
- but a well thought out algorithm and a well written code will generally
  perform well


Plot memory & latency gains with flops over last 2 decades
- latency expensive - avoid random (unpredictable) memory accesses
- main-memory bandwidth - reuse data in caches


Give example of numa allocation, cache, conditionals, memory allocations,
false sharing


Vector libraries: Sleef, Agner Fog, HPX, Blaze
Other libraries: Baobzi, libxsmm, Intel JIT, 
(module avail)


OpenMP common pitfalls


Be roughly aware of how expensive different operations are: avoid divisions,
exp, trig fn, other special functions.


low-level-optimization: benchmark, benchmark, benchmark!

https://www.embedded.com/common-multicore-programming-problems-part-4-memory-cache-issues-and-consistency/


SoA / AoS


gcc -march=native -E -v - </dev/null 2>&1 | grep march


Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x
https://dl.acm.org/doi/fullHtml/10.1145/3408893
https://gavinreynolds.scot/docs/msc-dissertation.pdf

Work-depth, work-span, work-time model, PRAM

Lectures:
https://www.cse.wustl.edu/~angelee/archive/cse341/fall14/handouts/lecture03.pdf