outline.txt 3.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
  1. Title:
  2. HPC for scientific computing
  3. Introduction to HPC
  4. HPC: An overview
  5. What Every Programmer Should Know About HPC
  6. The art of HPC
  7. How to address diversity of user expectation in a single talk?
  8. - languages, parallel paradigms, hardware, applications, HPC libraries (vendor optimized libraries)
  9. - performance optimization looks very different for different codes
  10. Plan for performance!
  11. Faster scientific computing:
  12. - Improved algorithm (FFT, FMM, n^3 vs Strassen matrix multiplication, approximations, precomputing etc.)
  13. - Improved hardware utilization (usually parallelism but not always - cache
  14. optimization/data-locality)
  15. - sometimes a work sub-optimal method may be faster because it is better at
  16. utilizing the available hardware
  17. - co-design?
  18. Hardware architecture overview
  19. - distributed
  20. - shared memory |_ NUMA, cache-lines (false-sharing)
  21. - caches |
  22. - instruction-level-parallelism: pipelining, vectorization
  23. - accelerators (GPUs - Wenda Zhou)
  24. Trends in HPC hardware and software
  25. Programming models for different architectures
  26. Profiling/instrumentation tools
  27. - sometimes need to design experiments to profile
  28. - perf stat (https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/)
  29. Parallelism, arithmetic intensity, roofline model, Amdahl's law
  30. strong and weak scaling
  31. Cilk, CUDA, OpenMP, Threading Building Blocks
  32. Effect of programming languages:
  33. - interpreted
  34. - slow when processing element by element
  35. - but can be fast/efficient when operating on large blocks.
  36. - ex. MATLAB
  37. - JIT
  38. - https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html
  39. - julia
  40. - low-level
  41. - C/C++, FORTRAN, assembly
  42. Parallel debugging?
  43. Performance may vary across processors, compilers, operating systems:
  44. - but a well thought out algorithm and a well written code will generally
  45. perform well
  46. Plot memory & latency gains with flops over last 2 decades
  47. - latency expensive - avoid random (unpredictable) memory accesses
  48. - main-memory bandwidth - reuse data in caches
  49. Give example of numa allocation, cache, conditionals, memory allocations,
  50. false sharing
  51. Vector libraries: Sleef, Agner Fog, HPX, Blaze
  52. Other libraries: Baobzi, libxsmm, Intel JIT,
  53. (module avail)
  54. OpenMP common pitfalls
  55. Be roughly aware of how expensive different operations are: avoid divisions,
  56. exp, trig fn, other special functions.
  57. low-level-optimization: benchmark, benchmark, benchmark!
  58. https://www.embedded.com/common-multicore-programming-problems-part-4-memory-cache-issues-and-consistency/
  59. SoA / AoS
  60. gcc -march=native -E -v - </dev/null 2>&1 | grep march
  61. Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x
  62. https://dl.acm.org/doi/fullHtml/10.1145/3408893
  63. https://gavinreynolds.scot/docs/msc-dissertation.pdf
  64. Work-depth, work-span, work-time model, PRAM
  65. Lectures:
  66. https://www.cse.wustl.edu/~angelee/archive/cse341/fall14/handouts/lecture03.pdf