123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117 |
- - languages, parallel paradigms, hardware, applications, HPC libraries (vendor optimized libraries)
- - performance optimization looks very different for different codes
- Plan for performance!
- Faster scientific computing:
- - Improved algorithm (FFT, FMM, n^3 vs Strassen matrix multiplication, approximations, precomputing etc.)
- - Improved hardware utilization (usually parallelism but not always - cache
- optimization/data-locality)
- - sometimes a work sub-optimal method may be faster because it is better at
- utilizing the available hardware
- - co-design?
- Hardware architecture overview
- - distributed
- - shared memory |_ NUMA, cache-lines (false-sharing)
- - caches |
- - instruction-level-parallelism: pipelining, vectorization
- - accelerators (GPUs - Wenda Zhou)
- Trends in HPC hardware and software
- Programming models for different architectures
- Profiling/instrumentation tools
- - sometimes need to design experiments to profile
- - perf stat (https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/)
- Parallelism, arithmetic intensity, roofline model, Amdahl's law
- strong and weak scaling
- Cilk, CUDA, OpenMP, Threading Building Blocks
- Effect of programming languages:
- - interpreted
- - slow when processing element by element
- - but can be fast/efficient when operating on large blocks.
- - ex. MATLAB
- - JIT
- - https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html
- - julia
- - low-level
- - C/C++, FORTRAN, assembly
- Parallel debugging?
- Performance may vary across processors, compilers, operating systems:
- - but a well thought out algorithm and a well written code will generally
- perform well
- Plot memory & latency gains with flops over last 2 decades
- - latency expensive - avoid random (unpredictable) memory accesses
- - main-memory bandwidth - reuse data in caches
- Give example of numa allocation, cache, conditionals, memory allocations,
- false sharing
- Vector libraries: Sleef, Agner Fog, HPX, Blaze
- Other libraries: Baobzi, libxsmm, Intel JIT,
- (module avail)
- OpenMP common pitfalls
- Be roughly aware of how expensive different operations are: avoid divisions,
- exp, trig fn, other special functions.
- low-level-optimization: benchmark, benchmark, benchmark!
- https://www.embedded.com/common-multicore-programming-problems-part-4-memory-cache-issues-and-consistency/
- SoA / AoS
- gcc -march=native -E -v - </dev/null 2>&1 | grep march
- Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x
- https://dl.acm.org/doi/fullHtml/10.1145/3408893
- https://gavinreynolds.scot/docs/msc-dissertation.pdf
- Work-depth, work-span, work-time model, PRAM
- Lectures:
- https://www.cse.wustl.edu/~angelee/archive/cse341/fall14/handouts/lecture03.pdf
|