Lumina's Vulkan Memory Management

Deep Dive: Memory Management in Lumina Engine

When building a Vulkan renderer, one of the most deceptively complex challenges isn’t the graphics pipeline or shader compilation, it’s memory management. Raw Vulkan gives you complete control over GPU memory, but with that power comes significant responsibility. Today, I want to share how Lumina handles this through the Vulkan Memory Allocator (VMA) library and why it’s been a game-changer for our engine.

The Problem with Raw Vulkan Memory

Let me paint a picture of what memory management looks like without VMA. In raw Vulkan, every buffer and image you create needs memory allocated from the GPU. This sounds simple until you hit these constraints:

1. Allocation Limits

Most GPUs have a hard limit on the number of concurrent memory allocations—typically (maxMemoryAllocationCount). If you naively call vkAllocateMemory() for every buffer and texture, you’ll hit this limit almost immediately in a real game. A modern scene might have:

  • 10,000+ vertex buffers
  • 5,000+ textures
  • Hundreds of uniform buffers
  • Shadow maps, G-buffers, render targets

2. Manual Suballocation

The solution? Suballocation. You allocate large memory blocks and manually partition them for smaller resources. This means:

  • Tracking free regions within blocks
  • Handling alignment requirements (buffers need specific byte alignment)
  • Implementing a free-list or buddy allocator
  • Defragmentation when memory becomes fragmented

3. Memory Type Selection

Vulkan exposes multiple memory heaps with different properties:

  • Device-local: Fast GPU memory, not CPU-accessible
  • Host-visible: CPU-accessible, but slower for GPU
  • Host-cached: CPU-accessible with caching
  • Device-local + Host-visible: Rare, ReBAR/Smart Access Memory

For each allocation, you need to query memory types, check compatibility flags, and select the optimal heap. Get it wrong, and performance tanks.

4. Synchronization Complexity

You can’t free memory while the GPU is using it. This requires:

  • Tracking which command buffers reference which allocations
  • Waiting for GPU work to complete before freeing
  • Managing deferred deletion queues

This is hundreds of lines of error-prone code before you’ve even rendered a triangle.

Vulkan Memory Allocator

VMA is a single-header library from AMD that handles all of this complexity. Here’s why it’s brilliant:

Automatic Suballocation

VMA maintains large memory blocks internally and suballocates from them automatically. When you request a buffer:

VkBufferCreateInfo bufferInfo = { ... };
VmaAllocationCreateInfo allocInfo = {};
allocInfo.usage = VMA_MEMORY_USAGE_AUTO;

VmaAllocation allocation;
vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);

Behind the scenes, VMA:

  1. Finds an existing memory block with free space
  2. Suballocates from that block with proper alignment
  3. Falls back to creating a new block if needed
  4. Tracks the allocation for later freeing

No manual bookkeeping required.

Smart Memory Type Selection

The VMA_MEMORY_USAGE_AUTO flag tells VMA: “Pick the best memory type for this resource based on usage flags.” It considers:

  • Buffer/image usage flags
  • Read/write patterns (sequential vs random)
  • Host visibility requirements
  • Performance characteristics

For example, a staging buffer automatically gets host-visible memory with sequential write optimization, while a vertex buffer gets device-local memory for maximum GPU performance.

Lumina’s VMA Integration

Our FVulkanMemoryAllocator class wraps VMA and adds engine-specific policies. Let’s break down the key components:

Initialization

VmaAllocatorCreateInfo info = {};
info.vulkanApiVersion = VK_API_VERSION_1_3;
info.instance = instance;
info.physicalDevice = physicalDevice;
info.device = device;
info.flags = VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT | 
             VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT;

vmaCreateAllocator(&info, &Allocator);

We enable two extensions:

  • EXT_MEMORY_BUDGET: Provides real-time memory usage statistics to avoid over-allocation
  • EXT_MEMORY_PRIORITY: Lets us prioritize critical allocations (like render targets)

Custom Memory Pools (Probably Overkill)

Initially, I created custom pools for different buffer types:

void FVulkanMemoryAllocator::CreateCommonPools()
{
    CreateBufferPool(VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT, 
                     VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT, 
                     16 * 1024 * 1024); // 16MB blocks
    
    CreateBufferPool(VK_BUFFER_USAGE_VERTEX_BUFFER_BIT, 
                     VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT, 
                     64 * 1024 * 1024); // 64MB blocks
    
    CreateBufferPool(VK_BUFFER_USAGE_TRANSFER_SRC_BIT, 
                     VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | 
                     VMA_ALLOCATION_CREATE_MAPPED_BIT, 
                     32 * 1024 * 1024); // 32MB blocks
}

However, VMA’s documentation suggests this is unnecessary. VMA already maintains internal pools per memory type and performs sophisticated allocation strategies. Custom pools only make sense if you need:

  • Separate budgets for different resource categories
  • Specific defragmentation policies
  • Memory isolation for debugging

For most engines, VMA’s default pooling is sufficient and likely better tuned than anything we’d write ourselves. I’m keeping these pools for now, but they may be removed in a future refactor.

Buffer Allocation Strategy

Our buffer allocation logic demonstrates VMA’s flexibility:

VmaAllocation FVulkanMemoryAllocator::AllocateBuffer(
    const VkBufferCreateInfo* createInfo, 
    VmaAllocationCreateFlags flags, 
    VkBuffer* vkBuffer, 
    const char* allocationName)
{
    VmaAllocationCreateInfo info = {};
    info.usage = VMA_MEMORY_USAGE_AUTO;
    info.flags = flags;
    
    // Use custom pool if allocation is small enough
    uint64 poolKey = createInfo->usage;
    auto poolIt = BufferPools.find(poolKey);
    if (poolIt != BufferPools.end() && createInfo->size < poolIt->second.BlockSize / 2)
    {
        info.pool = poolIt->second.Pool;
    }
    
    // Large buffers get dedicated allocations for performance
    if (createInfo->size > 256 * 1024 * 1024) // 256MB
    {
        info.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
        info.priority = 1.0f; // Highest priority
    }
    
    // Persistently map host-visible buffers
    if (flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT)
    {
        info.flags |= VMA_ALLOCATION_CREATE_MAPPED_BIT;
    }
    
    VmaAllocation allocation;
    VmaAllocationInfo allocInfo;
    vmaCreateBuffer(Allocator, createInfo, &info, vkBuffer, &allocation, &allocInfo);
    
    AllocatedBuffers[*vkBuffer] = {allocation, allocInfo};
    return allocation;
}

Key optimizations:

  1. Small buffers (< half pool block size): Suballocate from pools
  2. Large buffers (> 256MB): Get dedicated allocations to avoid fragmenting pools
  3. Staging buffers: Persistently mapped with VMA_ALLOCATION_CREATE_MAPPED_BIT to eliminate map/unmap overhead

Image Allocation Strategy

Images have different requirements:

VmaAllocation FVulkanMemoryAllocator::AllocateImage(
    VkImageCreateInfo* createInfo, 
    VmaAllocationCreateFlags flags, 
    VkImage* vkImage, 
    const char* allocationName)
{
    constexpr uint64 DEDICATED_MEMORY_THRESHOLD = 2048 * 2048;
    
    VmaAllocationCreateInfo info = {};
    info.usage = VMA_MEMORY_USAGE_AUTO;
    info.flags = flags;
    
    VkDeviceSize imageSize = createInfo->extent.width * 
                             createInfo->extent.height * 
                             createInfo->extent.depth * 
                             createInfo->arrayLayers;
    
    // Render targets and large images get dedicated allocations
    if (imageSize > DEDICATED_MEMORY_THRESHOLD || 
        createInfo->usage & VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT || 
        createInfo->usage & VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT)
    {
        info.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
        info.priority = 0.75f;
    }
    
    VmaAllocation allocation;
    VmaAllocationInfo allocInfo;
    vmaCreateImage(Allocator, createInfo, &info, vkImage, &allocation, &allocInfo);
    
    AllocatedImages[*vkImage] = {allocation, allocInfo};
    return allocation;
}

Why dedicated allocations for attachments?

Render targets (color and depth attachments) are accessed constantly during rendering. Giving them dedicated memory:

  • Improves GPU cache locality
  • Enables driver-specific optimizations (tile memory on mobile, compression buffers)
  • Prevents memory aliasing with other resources
  • Reduces bandwidth contention

The threshold of 2048x2048 (4MB for RGBA8) balances memory efficiency with performance. Smaller textures can share memory pools without significant overhead.

How VMA Optimizes Small Buffers

VMA’s internal allocation strategy is sophisticated. Here’s what happens when you allocate a small buffer (say, 256 bytes for a uniform buffer):

1. Block Selection

VMA maintains a list of memory blocks per memory type. Each block is typically 256MB and contains a free-list of available ranges.

2. Allocation Algorithm

VMA uses a best-fit algorithm with optimization hints:

  • Small allocations (< 4KB): Uses a dedicated small allocation pool to avoid fragmenting large blocks
  • Medium allocations (4KB - 1MB): Standard best-fit from general-purpose blocks
  • Large allocations (> 1MB): May use dedicated blocks to prevent fragmentation

3. Alignment Handling

Vulkan requires specific alignment for different resource types:

  • Uniform buffers: 256-byte alignment (on most GPUs)
  • Storage buffers: 16-byte alignment
  • Textures: Implementation-defined (often 64KB)

VMA automatically pads allocations to meet these requirements, wasting minimal space through clever packing.

4. Linear vs. Buddy Allocation

For pools marked with VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT:

  • Allocations are tightly packed sequentially
  • Fast allocation (just increment an offset)
  • No free-list overhead
  • Perfect for temporary/staging buffers

For general pools:

  • Uses a buddy allocator or tree-based free list
  • Handles fragmentation better for long-lived allocations
  • Slightly slower allocation but more flexible

Persistent Mapping: A Hidden Gem

One of VMA’s cool features is persistent mapping:

void* FVulkanMemoryAllocator::GetMappedMemory(FVulkanBuffer* buffer)
{
    // Wait for GPU to finish if needed
    if (buffer->LastUseCommandListID != 0)
    {
        FQueue* queue = RenderContext->GetQueue(buffer->LastUseQueue);
        queue->WaitCommandList(buffer->LastUseCommandListID, UINT64_MAX);
    }

    // Return persistently mapped pointer
    return AllocatedBuffers[buffer->GetBuffer()].second.pMappedData;
}

Buffers created with VMA_ALLOCATION_CREATE_MAPPED_BIT remain mapped for their entire lifetime. This eliminates:

  • vkMapMemory() / vkUnmapMemory() call overhead.
  • Driver validation overhead
  • Risk of mapping failures under memory pressure

Memory Budgeting

VMA integrates with the VK_EXT_memory_budget extension to provide real-time memory statistics:

void FVulkanMemoryAllocator::LogMemoryStats()
{
    VmaTotalStatistics stats;
    vmaCalculateStatistics(Allocator, &stats);
    
    LOG_INFO("=== Vulkan Memory Statistics ===");
    LOG_INFO("Total Allocated: %.2f MB", 
             stats.total.statistics.allocationBytes / (1024.0f * 1024.0f));
    LOG_INFO("Total Block Count: %u", stats.total.statistics.blockCount);
    LOG_INFO("Allocation Count: %u", stats.total.statistics.allocationCount);
}

This lets us:

  • Monitor memory pressure in real-time
  • Detect memory leaks during development
  • Implement adaptive quality settings based on available VRAM
  • Profile memory usage per frame

Thread Safety

Notice the mutexes in our class:

FMutex ImageAllocationMutex;
FMutex BufferAllocationMutex;

VMA is not thread-safe by default. Multiple threads calling vmaCreateBuffer() concurrently will corrupt internal state. We protect allocations with per-resource-type mutexes, allowing:

  • Simultaneous buffer and image allocation
  • Fine-grained locking for better parallelism

In a future refactor, we might use VMA’s VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT flag and implement a custom threading strategy, but for now, simple mutexes work well.

Performance Impact

Before VMA, our prototype renderer had:

  • Manual block allocator: ~2000 lines of code
  • Frequent allocation failures under load
  • Memory fragmentation
  • Complex synchronization bugs

After VMA:

  • ~200 lines of wrapper code
  • Zero allocation failures in testing
  • Smooth performance even with aggressive allocation patterns
  • Trivial to debug with VMA’s built-in statistics

The difference comes from VMA’s optimized data structures and lack of synchronization overhead in our naive implementation.

Conclusion

Vulkan Memory Allocator is one of those libraries that seems simple on the surface but reveals its genius over time. By handling the tedious, error-prone aspects of GPU memory management, it lets us focus on building a renderer instead of debugging allocation bugs at 2 AM.

Key takeaways:

  1. Don’t roll your own memory allocator unless you have very specific requirements. VMA is battle-tested across thousands of projects.

  2. Custom pools are optional. VMA’s default pooling is excellent. Only add custom pools if you need isolation or specific budgets.

  3. Persistent mapping is a free performance win for host-visible buffers.

  4. Dedicated allocations for render targets improve performance by eliminating memory aliasing.

  5. VMA handles complexity you didn’t even know existed (alignment, memory types, defragmentation, budgeting).

If you’re building a Vulkan renderer, do yourself a favor: integrate VMA on day one. Your future self will thank you.


Lumina Engine is an open-source game engine built with modern Vulkan best practices. Check out the full source code on GitHub to see VMA integration in action.