Kurt Keutzer, EECS, University of California, Berkeley
Tim Mattson, Intel Research
Fall 2014
As the basic computing device ranging single cell phones to racks of hardware in cloud computing, parallel processors are emerging as the pervasive computing platform of our time. This course will enable advanced undergraduate students to design, implement, optimize, and verify programs to run on present generations of parallel processors.
There are four principal themes that are pursued in this course:
- Software engineering
- Performance Programming
- Programming in Parallel Languages
- Course project
Contents
Software Engineering and Software Architecture
Our approach to this course reflects our view that a well-designed software architecture is a key to designing parallel software, and a key to software architecture is design patterns and a pattern language. Our course will use Our Pattern Language as the basis for describing how to design, implement, verify, and optimize parallel programs. Following this approach we will introduce each of the major patterns that are used in developing a high-level architecture of a program. Descriptions of these ten structural and thirteen computational patterns, together with other readings, may be found at: https://patterns.eecs.berkeley.edu/.
Performance Programming
Writing efficient parallel programs requires insights into the hardware architecture of contemporary parallel processors as well as an understanding as to how to write efficient code in general. As a result a significant amount of time in the course will be spent on looking “under the hood” of contemporary sequential and multiprocessors and identifying the key architectural details, such as non-uniform memory architecture (NUMA), that are necessary to write high performance code.
Programming in Parallel Languages
Other lectures and laboratories of the course will focus on implementation using contemporary parallel programming languages, verification of parallel software using invariants and testing, and performance tuning and optimization. Particular languages covered typically include OpenMP, MPI, and OpenCL.
Course Projects
The final third of the course will be an open-ended course project. These projects allow students to demonstrate their mastery of the course concepts mentioned above. Students will create their own projects in project teams of 4-6 students.
Prerequisites
Students should have taken, the following or equivalents:
- Basic programming course using Java, C or C++
- Undergraduate course on computer organization
- Linear algebra
It is recommended that students have taken:
- At least one upper division course that includes significant programming assignments (e.g. Compilers, Operating Systems, or Software Engineering)
Course Work and Grading
The course consists of twice-weekly lectures and once-weekly lab sessions. For the first two thirds of the course, there will be a series of programming assignments. There will be two examinations during the course.
Course Staff
Professor: Kurt Keutzer
Guest Lecturer: Tim Mattson, Intel
TA’s: Paden Tomasello; Peter Jin
Recommended Course Textbook
Patterns for Parallel Programming, T. Mattson, B. Sanders, B. Massingill, Addison Wesley, 2005.
Course Assignments Will Be Selected From Among this List
- Computer Architecture – Measure L1/L2/L3 bandwidth and latency on our lab machines. Also, investigate measured ILP for a handful of different SGEMM implementations. Performance in MFlops/s increases, but ILP drops. Also serves as a warmup / refresher for the small subset of C++ we use for the lab assignments. Follows the material from lecture 3 (sequential processor performance)
- Parallel Matrix Multiply (DGEMM) – Write naive parallel DGEMM using OMP for loops, OMP tasks, and pthreads. Serves as a simple warm-up for the basic threading libraries. Advanced question on how GCC converts code with OpenMP pragmas into parallel code. Follows the material from lecture 2/4 (parallel programming on shared memory computers)
- Optimize Matrix Multiply (DGEMM) – Optimize the naive parallel matrix multiply for both locality and data parallelism (using SSE2). Students get familiar with SSE2 intrinsics if they want to use them for their final projects. Follows the material from lecture 6/8 (memory subsystem performance)
- Introduction to OpenCL – Students write both VVADD and SGEMM in OpenCL. They will write the kernels. Follows lecture 9 / 10. (Data parallelism and CUDA).
- OpenCL + OpenGL – Students perform a handful of simple graphics operations on an image. Follows lecture 9 / 10. (Data parallelism and CUDA).
- Advanced OpenCL – Students write a reduction routine using the ideas presented in class. They also write array compaction using scan. Follows lecture 9 / 10. (Data parallelism and CUDA).
Syllabus, Fall 2014: Classes are at 2:00 – 3:30PM PDT/PST