outline.txt 2.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
  1. - languages, parallel paradigms, hardware, applications, HPC libraries (vendor optimized libraries)
  2. - performance optimization looks very different for different codes
  3. Plan for performance!
  4. Faster scientific computing:
  5. - Improved algorithm (FFT, FMM, n^3 vs Strassen matrix multiplication, approximations, precomputing etc.)
  6. - Improved hardware utilization (usually parallelism but not always - cache
  7. optimization/data-locality)
  8. - sometimes a work sub-optimal method may be faster because it is better at
  9. utilizing the available hardware
  10. - co-design?
  11. Hardware architecture overview
  12. - distributed
  13. - shared memory |_ NUMA, cache-lines (false-sharing)
  14. - caches |
  15. - instruction-level-parallelism: pipelining, vectorization
  16. - accelerators (GPUs - Wenda Zhou)
  17. Trends in HPC hardware and software
  18. Programming models for different architectures
  19. Profiling/instrumentation tools
  20. - sometimes need to design experiments to profile
  21. - perf stat (https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/)
  22. Parallelism, arithmetic intensity, roofline model, Amdahl's law
  23. strong and weak scaling
  24. Cilk, CUDA, OpenMP, Threading Building Blocks
  25. Effect of programming languages:
  26. - interpreted
  27. - slow when processing element by element
  28. - but can be fast/efficient when operating on large blocks.
  29. - ex. MATLAB
  30. - JIT
  31. - https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html
  32. - julia
  33. - low-level
  34. - C/C++, FORTRAN, assembly
  35. Parallel debugging?
  36. Performance may vary across processors, compilers, operating systems:
  37. - but a well thought out algorithm and a well written code will generally
  38. perform well
  39. Plot memory & latency gains with flops over last 2 decades
  40. - latency expensive - avoid random (unpredictable) memory accesses
  41. - main-memory bandwidth - reuse data in caches
  42. Give example of numa allocation, cache, conditionals, memory allocations,
  43. false sharing
  44. Vector libraries: Sleef, Agner Fog, HPX, Blaze
  45. Other libraries: Baobzi, libxsmm, Intel JIT,
  46. (module avail)
  47. OpenMP common pitfalls
  48. Be roughly aware of how expensive different operations are: avoid divisions,
  49. exp, trig fn, other special functions.
  50. low-level-optimization: benchmark, benchmark, benchmark!
  51. https://www.embedded.com/common-multicore-programming-problems-part-4-memory-cache-issues-and-consistency/
  52. SoA / AoS
  53. gcc -march=native -E -v - </dev/null 2>&1 | grep march
  54. Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x
  55. https://dl.acm.org/doi/fullHtml/10.1145/3408893
  56. https://gavinreynolds.scot/docs/msc-dissertation.pdf
  57. Work-depth, work-span, work-time model, PRAM
  58. Lectures:
  59. https://www.cse.wustl.edu/~angelee/archive/cse341/fall14/handouts/lecture03.pdf