# Debugging Vulkan driver crash - equivalent of NVIDIA Aftermath
New generation, explcit graphics APIs (Vulkan and DirectX 12) are more efficient, involve less CPU overhead. Part of it is that they don't check most errors. In old APIs (Direct3D 9, OpenGL) every function call was validated internally, returned success of failure code, while driver crash indicated a bug in driver code. New APIs, on the other hand, rely on developer doing the right thing. Of course some functions still return error code (especially ones that allocate memory or create some resource), but those that record commands into a command buffer just return
void. If you do something illegal, you can expect undefined behavior. You can use Validation Layers / Debug Layer to do some checks, but otherwise everything may work fine on some GPUs, you may get incorrect result, or you may experience driver crash or timeout (called "TDR"). Good thing is that (contrary to old Windows XP), crash inside graphics driver doesn't cause "blue screen of death" or machine restart. System just restarts graphics hardware and driver, while your program receives
VK_ERROR_DEVICE_LOST code from one of functions like
vkQueueSubmit. Unfortunately, you then don't know which specific draw call or other command caused the crash.
NVIDIA proposed solution for that: they created NVIDIA Aftermath library. It lets you (among other things) record commands that write custom "marker" data to a buffer that survives driver crash, so you can later read it and see which command was successfully executed last. Unfortunately, this library works only with NVIDIA graphics cards and only in D3D11 and D3D12.
I was looking for similar solution for Vulkan. When I saw that Vulkan can "import" external memory, I thought that maybe I could use function
vkCmdFillBuffer to write immediate value to such buffer and this way implement the same logic. I then started experimenting with extensions: VK_KHR_get_physical_device_properties_2, VK_KHR_external_memory_capabilities, VK_KHR_external_memory, VK_KHR_external_memory_win32, VK_KHR_dedicated_allocation. I was basically trying to somehow allocate a piece of system memory and import it to Vulkan to write to it as Vulkan buffer. I tried many things:
HeapAlloc and other ways, with various flags, but nothing worked for me. I also couldn't find any description or sample code of how these extensions could be used in Windows to import some system memory as Vulkan buffer.
Everything changed when I learned that creating normal device memory and buffer inside Vulkan is enough! It survives driver crash, so its content can be read later via mapped pointer. No extensions required. I don't think this is guaranteed by specification, but it seems to work on both AMD and NVIDIA cards. So my current solution to write makers that survive driver crash in Vulkan is:
VkDeviceMemoryfrom memory type that has
HOST_VISIBLE + HOST_COHERENTflags. (This is system RAM. Spec guarantees that you can always find such type.)
vkMapMemoryto get raw CPU pointer to its data.
VK_BUFFER_USAGE_TRANSFER_DST_BITand bind it to that memory using
vkCmdFillBufferto write immediate data with your custom "markers" to the buffer.
VK_ERROR_DEVICE_LOST), read data under the pointer to see what marker values were successfully written last and deduce which one of your commands might cause the crash.
There is also a new extension available on latest AMD drivers: VK_AMD_buffer_marker. It adds just one function:
vkCmdWriteBufferMarkerAMD. It works similar to beforementioned
vkCmdFillBuffer, but it adds two good things that let you write your markers with much better granularity:
vkCmdFillBuffermust be called outside render pass.
I created a simple library that implements all this logic under easy interface, which I called "Vulkan AfterCrash". All you need to use it is just this single file: VulkanAfterCrash.h.
Update 4 April 2018: In GDC 2018 talk "Aftermath: Advances in GPU Crash Debugging (Presented by NVIDIA)", Alex Dunn announced that a Vulkan extension from NVIDIA will also be available, called VK_NV_device_diagnostic_checkpoints, but I can see it's not publicly accessible yet.