AMD Master Plan Pt.2 – Heterogeneous Revolution
Originally published on August 26 2020
The death of Dennard scaling and the rising of the phenomenon of Dark silicon took the IC industry to a crossroads: How to keep improving microprocessors with current power constraints without excessively increasing heat densities on the chip, which could decrease its reliability and useful life? How can we continue to improve single and multithreaded performance without increasing the cooling cost of the processor excessively, and thus continue to bring innovations to environments with significant cooling limitations? In the real world where no financial horsepower is capable of breaking the laws of physics, a fundamental rethink in architectural projects is necessary step to remain capable of bringing innovation to the world, even with such limitations.
Therefore, in 2012, AMD dared to bring a new approach to the development of its architecture to the world: The Heterogeneous Systems Architecture, which using virtual memory, memory coherence, architected dispatch mechanisms, would allow GPU to operate as a true peer of the host CPU, making the CPU and GPU work together much more efficiently and in a way that it was much easier to write applications that took advantage of both. However, AMD's plans have been stalled because of the Bulldozer's absolute failure and many poor management decisions that have seriously put the company's future at risk. Having as only option "innovate to survive", AMD continued to work on the development of its heterogeneous vision, and now, through its most recent published patents, we can get a real insight into the size of the revolution AMD is planning to bring to the world.
AMD heterogeneous CPU - Enabling Multi-ISA heterogeneity using x86 as ISA superset.
The first major disruptive movement that AMD is bringing in its patents is the development of a new heterogeneous CPU, which takes the level of heterogeneity to the limit. Unlike ARM big.LITTLE, this new heterogeneous processor has as main feature to bring a big "high-feature" processor configured to support entirety the set of x86 ISA features and a little "low-feature" processor configured to support just a compact subset of the x86 ISA features (fig. 1). The way in which this new heterogeneous processor is described in that patent reminded me of a brilliant work presented last year in HPCA, which was even included in the symposium's best paper nominees (fig. 2 & 3).
In the HPCA2019, researchers have proposed a composite ISA architecture that employs compact cores implementing fully customized x86 ISAs, derived from a single large x86 ISA superset, showing that with this approach it was possible outperform fully heterogeneous-ISA designs, due to greatly increased flexibility in creating cores that mix and match specific sets of features. They proposed two different sets of designs: In designs optimized for multithreaded mixed workloads, 18% performance improvement and 35% reduction in energy-delay product was achieved over single-ISA heterogeneous designs, without sacrificing most of the benefits of a single ISA. In designs optimized for single thread performance/efficiency, it was achieved a speedup of 20% and an EDP reduction of 28% on average.
Thus, looking at the results obtained from this massive design space exploration carried out in this work with due attention to all the proposed compiler and runtime strategies, we can have a indicative insight of what we can expect in terms of performance improvement and power consumption reduction that future AMD heterogeneous processors will bring to the mobile market. In addition, there is already a wide range of related patents covering pipeline recovery (fig. 4), cache coherence (fig. 5) and wakeup latency (fig. 6) solutions that could be included in this new design approach and which could bring even greater improvements to the architecture as a whole.
However, AMD went even further in exploring new heterogeneous solutions for its products. In the high-performance computing environment, where energy efficiency and chip utilization need to be pushed to the limit, AMD has also brought its heterogeneous revolution, and this time in GPUs.
Heterogeneous GPU - Maximizing chip utilization through the use of variable width SIMD units
In an even more impressive work, AMD filed a new GPU patent aiming to improve the chip utilization in its Exascale projects. As you may know, many GPU workloads are non-uniform and have numerous wavefronts with predicated-off threads. These instructions still take up space in the pipeline. Unfortunately, the predicated instructions take up space, waste power, produce heat, and produce no useful output. Even the most modern GPU microarchitectures are unable to cope with certain dynamic runtime behaviors which are very difficult to know at compile time.
Therefore, to solve this problem AMD have proposed a disruptive approach to push the chip utilization level to the limit: A new GPU architecture, in which its SIMD units have different numbers of ALUs, so that each SIMD unit can run a different number of threads (fig. 7). Thus, by providing a set of execution resources within each GPU compute unit tailored to a range of execution profiles, the GPU can handle irregular workloads more efficiently.
This approach also works very well with branch divergence in a wavefront. Because of branch divergence, some threads follow a control flow path, and other threads will not follow the control flow path, which means that many threads are predicated off. So effectively, there will only be a few subsets of threads running. When it is determined that the active threads can be run in a smaller width SIMD unit, then the threads will be moved to the smaller width SIMD unit, and any unused SIMD units will not be powered up. Likewise, if divergence of control or other problems reduces the number of active threads on a wavefront, the more restricted execution feature can also be more efficient.
To properly support this new GPU architecture, AMD has already filed two other patents: First, the systems, apparatuses, and methods for abstracting tasks in virtual memory identifier containers, a implementation of multiple instances proposed by AMD, which includes multi-tasking support in each pipeline stage, already mentioned and duly explained in my previous article (fig. 8), and a new methods for processing variable wavefront sizes on GPU, which provides the dynamic warp subdivision necessary for the proper use of the resources of this new architecture (fig. 9).
The arduous path for revolution…
A patent is a profound indicator of long-term R&D planning which, when we analyze its details, can provide valuable clues about the evolution path in the search for innovation. Anyone who ignores these clear signs and considers them insignificant for a proper analysis would demonstrate total ignorance about the process of developing an intellectual property as well as technological development as a whole.
That said, given the level of difficulty of the proposals presented and their reachability, it is possible to say that the evolution path that AMD is taking to bring its heterogeneous revolution to the world is arduous. Nothing presented here is necessarily a breaking news. Even in 2014, VIA already had a patent for an asymmetric multi-core processor, which was completely abandoned in subsequent years. The fundamental point discussed here is the huge effort that AMD is putting into the development of heterogeneous solutions that previously would have been avoided. It is very clear that AMD realized that the emergence of Dark Silicon could no longer simply be ignored for years to come and that there would be no Deus Ex Machina that could provide a disruptive solution to these problems. After all, the era of Silicon Tricks is over and the only way to not leave any transistors behind is to face challenges previously avoided to bring real innovation to the world in the future.
Some references and further readings:
Mednick, E. H., Mclellan, E., "Instruction subset implementation for low power operation", US10698472, 2020
Meng, J., Tarjan, D., Skadron, K., Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance, ACM SIGARCH Computer Architecture News, 2010
Venkat, A., Basavaraj, H., Tullsen, DM., Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA, HPCA, 2019.
Shekhar Borkar and Andrew A. Chien, “The future of microprocessors”, Communications of the ACM - 2011