REclock: Reverse-engineer and implement NVA3/5/8 Voltage- and Frequency Scaling in Nouveau

NVIDIA graphics cards often support running at a variety of different performance "levels". This aids in reducing the power demand and heat dissipation of the devices when idle, while unleashing full potential under load. A performance level comprises the clock speed and voltage for several subcomponents in the GPU. The difference between the lowest and highest performance level can be as much as a factor 10 in clock speed.

Despite hard work from many developers, reclocking support in Nouveau still has quite a few loose ends: engine reclocking is mostly in place but not always reliable, there are several missing routines related to memory reclocking and in general the actions required to perform voltage- and frequency scaling are not or only partially understood. Because of this, NVIDIA GPUs driven by nouveau are limited to using the boot speed and voltage only, severely limiting performance and usability.

For this project, I aim to tie these loose ends together for NVIDIAs NVA3/5/8 GPUs. I intend to fully reverse engineer several subcomponents related to voltage and frequency scaling, try to get a full understanding of the clock tree and use this gained knowledge to further improve the nouveau voltage and frequency scaling implementation for said GPUs.

Personal information

My name is Roy Spliet, I'm a graduated masters student from Delft University of Technology (TU Delft), planning to continue my academic career as a PhD student in computer architecture. My background includes kernel/driver development (nouveau, LITMUS^RT) and GPGPU programming in OpenCL.

Previous involvement in Nouveau has led to successfully reverse-engineering and implementing reclocking support for the memory-less NVIDIA NVAA and NVAC chipsets, alongside many contributions to memory reclocking for pre-NVC0 (Fermi) GPUs. For more details about my personal background, please consult http://roy.spliet.org.

Background

NVIDIA GPUs feature a complex multi-layer clock tree that allows for per-subcomponent alteration of clock speeds. The precise clock tree is a complex network consisting of one or more input clocks, several fixed dividers, and a lot of routing to distribute these clocks to every subcomponent. On the last level there is usually a Phase-Lock Loop (PLL) that can take either the original clock or one of several divided clocks as an input, and bring this clock up to the desired level for the associated subcomponent. Control registers alter the precise input of these PLLs, and can in addition be configured to bypass the PLLs.

The video BIOS (VBIOS) provides two services: it takes care of bringing the GPU in to an initial valid state, and it contains crucial information regarding reclocking. Most importantly, the VBIOS describes the ranges of each PLL in the system. On a higher level, the VBIOS also contains several "performance levels". Each level consists of a clock speed for each subcomponent. NVIDIA's driver switches between these performance levels based on the load. For most engines this routine consists of bypassing the PLL, setting it to a new value, testing the newly set values, and then re-enabling the PLL.

Memory reclocking

Memory reclocking is a bit more difficult than other engines. Besides an input clock, the memory controller also needs to know of a variety of latencies, that are usually defined in clock ticks but mandated in nanoseconds. These latencies, or timings, are described in the VBIOS.

To keep the memory controller and the engines running in sync, a form of link training is also required. Updating all this information must be done according to strict timing requirements, and failure to meet these deadlines results in corrupted memory and all consequences associated. Although the memory is often well documented in the public, NVIDIA's memory controller is not. Reverse engineering it is a difficult challenge, as there is very little feedback beyond either a working system or a complete crash.

Reclocking engine

To facilitate the action of reclocking from within the GPU itself, increasing stability on operating system failures, NVIDIA added a subcomponent called PDAEMON. This component has full access to many registers accessible through MMIO, including the registers controlling the clocks, latencies and other power-management related features. PDAEMON is a programmable engine supporting the Falcon or fμc ISA. NVIDIA's driver uploads the firmware for this engine, dubbed PMU.

PMU is responsible for many power-management related functions, including: monitor temperature, control fan speed and monitor the load on the GPU. To alter clock speeds, the NVIDIA driver can upload special scripts in a language called "seq" that will be interpreted by PMU. These scripts contain sequences of registers that need to be adjusted in order, along with required pause commands and other logic. Full understanding of the seq ISA gives full understanding of the actions executed by NVIDIA's driver on a reclock operation and their timing.

Nouveau has it's own implementation of the PMU microcode, including a scriptable engine offering many of the capabilities implemented in older hardware. However, it's capabilities might be insufficient to perform all the tasks that NVIDIA's driver performs through PMU.

Current state

Nouveau has a lot of code in place for engine reclocking. Many of the PLLs have been identified, and some of the control registers have been reverse engineered either partially or completely. Although known to work on some GPUs, engine reclocking does not work reliably at least on my NVA8.

For memory reclocking, some code exists to determine the latencies that the memory and the memory controller need to know. Still, there are some other features vital for memory reclocking that are ill-understood, unimplemented and/or incorrect. In addition, the order of events is likely wrong. As a result, clocking memory to any performance level higher than the boot clocks likely results in memory corruption. The link training unit found on some GPUs with DDR3 is one important example of a feature not handled by Nouveau currently.

Large parts of the VBIOS are well understood and parsed both by the nouveau kernel driver and the envytools VBIOS parsing tool. Any bits left could lead to interesting clues on actions required for reclocking.

Project

Scope

In this project I aim to get a better understanding of the reclocking features of the NVA3/5/8, as utilised by NVIDIA's official device driver. The eventual goal of this project is complete voltage and frequency scaling for these GPUs in nouveau. Gained knowledge could benefit the implementation of newer generations of cards as well.

I limit myself to the core features and aim for a manual control of the voltage and clock frequencies based on profiles in the VBIOS; dynamic reclocking based on load information is beyond the scope of this project.

Initial code contributions will not make use of Nouveaus PMU engine. When established that this is absolutely necessary, the firmware could be extended to support the desired functionality. However, until this is established, reclocking through PDAEMON is considered a nice to have feature with low priority.

Benefits to the community

Users will benefit from the increased performance that nouveau can offer under higher clocks, while having the capability to save energy when the processing power is not required. This could lead to prolongued battery life for mobile systems using the Open Source NVIDIA driver stack.

This work combined with the GSoC project on performance counters provides the prerequisites for implementing dynamic frequency scaling in future work, enabling all users of the open source graphics driver stack to profit from these benefits without manual intervention.

Deliverables

Implementation will be done entirely in the Nouveau kernel module, forked from an upstream kernel. Produced patches are intended to be merged back into mainline kernel at the end of the project, but might require some after-care when conflicting mainteinance is done on nouveau. Controls are exposed through sysfs.

Documentation will be added to the "envytools" GIT repository where applicable.

Mentor

Ilia Mirkin

Schedule

My availability is roughly full time between now and the start of the new academic year in October. Tentative planning:

Description Deliverable Timeframe Required
Reverse engineer seq ISA Documentation (envytools) 1 week X
Write seq script decoder Decoding tool (envytools) 1 week X
RE clock tree for NVA3/5/8 Documentation (envytools), full graph 1-2 week(s) X
Finish/fix engine reclocking for NVA3/5/8 Kernel code allowing users to successfully select any performance level through SysFS 1 week X
RE+implement DDR3 link training unit Documentation (envytools) + Kernel code (no directly visible changes) 1 week X
RE+implement DDR3 memory reclocking Kernel code, observable performance improvements for highest performance level on affected GPUs 3 weeks X
RE+implement GDDR3 memory reclocking Kernel code, observable performance improvements for highest performance level on affected GPUs 3 weeks X
RE+implement GDDR5 memory reclocking* Kernel code, observable performance improvements for highest performance level on affected GPUs ?
RE+implement DDR2 memory reclocking* Kernel code, observable performance improvements for highest performance level on affected GPUs ?

* If hardware available

Risks

There is little risk attached to all tasks resulting in documentation of the clock tree. Patches to the nouveau kernel tree are expected, but chances exist that the code does not generalise to all cards. Earlier experience makes me confident engine reclocking can be implemented with low risk. Achievements for memory reclocking are not guaranteed given the complexity of the job, although progress is definitely expected.

Hardware

I currently possess one NVA8 GPU with DDR3 memory. More NVA3/5/8 hardware is available through Martin Peres and accessible remotely. Possibly missing in our combined collection are NVA3/5/8 graphics cards with DDR2. If budget is available, this could be purchased (new approximately €50,=) by either Martin Peres or myself for reverse-engineering purposes.