December 26, 2006
What’s the story with Photoshop & multi-core?
Much has been written about the fact that the speed of individual CPU cores isn’t increasing at the rate it did from 1980 through 2004 or so. Instead, chip makers are now turning to multi-core designs to boost performance. (See this brief primer from Jason Snell at Macworld.) Thus a lot of people have been asking whether Photoshop takes advantage of these new systems. The short answer is yes, Photoshop has included optimizations for multi-processor machines (of which multi-core systems are a type) for many years.
What may not be obvious to a non-engineer like me, however, is that not all operations can or should be split among multiple cores, as doing so can actually make them slower. Because memory bandwidth hasn’t kept pace with CPU speed (see Scott Byer’s 64-bit article for more info), the cost of moving data to and from each CPU can be significant. To borrow a factory metaphor from Photoshop co-architect Russell Williams, "The workers run out of materials & end up standing around." The memory bottleneck means that multi-core can’t make everything faster, and we’ll need to think about doing new kinds of processing specifically geared towards heavy computing/low memory usage.
Because Russell has forgotten more than I will ever know about this stuff, I’ve asked him to share some info and insights in the extended entry. Read on for more.
Intel-based architectures don’t necessarily add memory bandwidth as they add cores. A single CPU on a system with limited memory bandwidth can often
saturate the memory bandwidth if it just moves a big chunk of memory from here to there. It even has time to do several arithmetic operations in between and still saturate the memory. If your system is bandwidth-limited and the operation you want to do involves moving a big chunk of data (bigger than the caches) from here to there while doing a limited number of arithmetic operations on it, adding cores cannot speed it up no matter how clever the software is. Many Photoshop operations are in this category, for instance.
AMD’s architecture adds memory bandwidth as you add CPU chips, but taking advantage of it can be dependent on placement of the data into different areas of physical RAM attached to the different chips. It doesn’t do any good if all your data gets put into one of the memory banks — then you’re right back where you started. So, the memory system and how it’s used will have a big effect on how many
things speed up when you add more cores to a computer.
The other issue is Amdahl’s Law, described by computer architect Gene Amdahl in the 1960s. Almost all algorithms that can be parallelized also have some portion that must be done sequentially — setup (deciding how to divide the problem up among multiple cores) or synchronization, or collecting and summarizing the results. At those times each step depends on the step before being completed. As you add processors and speed up the parallel part, the sequential part inevitably takes up a larger percentage of the time. If 10% of the problem is sequential, then even if you add an infinite number of processors and get the other 90% of the problem done in zero time, you can achieve at most a 10X speedup. And some algorithms are just really hard or impossible to parallelize: calculating text layout on a page is a commonly cited example.
These two basic issues are why the giant massively parallel machines have RAM attached to each node and are used to solve only a small set of specially selected, specially coded problems — usually ones where the parallel part of the problem itself has been scaled up to enormous sizes. As the number of cores goes up, the likelihood that a particular problem will hit one of the above limits goes up.
Why does video rendering scale better than Photoshop? Rendering video is typically done by taking some source image material for a frame and performing a stack of adjustments and filters on it. Each frame is only a few hundred thousand pixels (for standard definition) or at most 2 megapixels or 8MB in 8-bit (for HD). Thus, particularly for standard definition images, the cache gives a lot more benefit as a sequence of operations are performed on each frame, and for each frame, you fetch the data, do several operations, and write the final result. Different frames can usually be rendered in parallel – one per processor, and so each processor does a fair chunk of computation for each byte read or written from memory.
By contrast, in Photoshop most time-consuming operations are performed on a single image layer and the problem is the size of that layer — 30MB for an 8-bit image from a 10MP digital camera. 60MB if you keep all the information by converting the raw file to 16 bit. Or if you’ve merged some Canon 1DSMkII images to HDR, that’s over 200MB. And of course the people most concerned with speeding up Photoshop with more cores are the ones with the giant images. When you run a Gaussian Blur on that giant image, the processor has to read all of it from memory, perform a relatively few calculations, and then write the result into newly allocated memory (so you can undo it). You can work on different pieces of the image on different processors, but you’re not doing nearly as much computation on each byte fetched from memory as in the video case. The operations that scale best in Photoshop are those that:
- Do a lot of computation for each pixel fetched. Shadow/Highlight correction is an example of an operation that has to do a lot of computation on each byte fetched, while normal blending does very little. A giant-radius blur is an example of the opposite extreme: lots of pixels have to be fetched to do a simple computation and produce one output pixel.
- Do pixel-based operations that take advantage of Photoshop’s framework for parallel computation. Most filters and adjustments fall into this category. But
many text tool operations and the solution of partial differential equations
required for the healing brush are examples of things that don’t fit this
To take good advantage of 8- or 16- core machines (for things other than servers), we’ll need machines whose bandwidth increases with the number of cores, and we’ll need problems that depend on doing relatively large amounts of computation for each byte fetched from main memory (yes, re-reading the same data you’ve already fetched into the caches counts). Complex video and audio signal processing are good examples of these kinds of tasks. And we’re always looking for more useful things that Photoshop can do that are more computationally intensive.
— Russell Williams