Home

Luminas

Lumina's Vulkan Memory Management

October 14, 2025

#Lumina

#Rendering

#Vulkan

Deep Dive: Memory Management in Lumina Engine

When building a Vulkan renderer, one of the most deceptively complex challenges isn’t the graphics pipeline or shader compilation, it’s memory management. Raw Vulkan gives you complete control over GPU memory, but with that power comes significant responsibility. Today, I want to share how Lumina handles this through the Vulkan Memory Allocator (VMA) library and why it’s been a game-changer for our engine.

The Problem with Raw Vulkan Memory

Let me paint a picture of what memory management looks like without VMA. In raw Vulkan, every buffer and image you create needs memory allocated from the GPU. This sounds simple until you hit these constraints:

1. Allocation Limits

Most GPUs have a hard limit on the number of concurrent memory allocations—typically (maxMemoryAllocationCount). If you naively call vkAllocateMemory() for every buffer and texture, you’ll hit this limit almost immediately in a real game. A modern scene might have:

10,000+ vertex buffers
5,000+ textures
Hundreds of uniform buffers
Shadow maps, G-buffers, render targets

2. Manual Suballocation

The solution? Suballocation. You allocate large memory blocks and manually partition them for smaller resources. This means:

Tracking free regions within blocks
Handling alignment requirements (buffers need specific byte alignment)
Implementing a free-list or buddy allocator
Defragmentation when memory becomes fragmented

3. Memory Type Selection

Vulkan exposes multiple memory heaps with different properties:

Device-local: Fast GPU memory, not CPU-accessible
Host-visible: CPU-accessible, but slower for GPU
Host-cached: CPU-accessible with caching
Device-local + Host-visible: Rare, ReBAR/Smart Access Memory

For each allocation, you need to query memory types, check compatibility flags, and select the optimal heap. Get it wrong, and performance tanks.

4. Synchronization Complexity

You can’t free memory while the GPU is using it. This requires:

Tracking which command buffers reference which allocations
Waiting for GPU work to complete before freeing
Managing deferred deletion queues

This is hundreds of lines of error-prone code before you’ve even rendered a triangle.

Vulkan Memory Allocator

VMA is a single-header library from AMD that handles all of this complexity. Here’s why it’s brilliant:

Automatic Suballocation

VMA maintains large memory blocks internally and suballocates from them automatically. When you request a buffer:

VkBufferCreateInfo bufferInfo = { ... };
VmaAllocationCreateInfo allocInfo = {};
allocInfo.usage = VMA_MEMORY_USAGE_AUTO;

VmaAllocation allocation;
vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &buffer, &allocation, nullptr);

Behind the scenes, VMA:

Finds an existing memory block with free space
Suballocates from that block with proper alignment
Falls back to creating a new block if needed
Tracks the allocation for later freeing

No manual bookkeeping required.

Smart Memory Type Selection

The VMA_MEMORY_USAGE_AUTO flag tells VMA: “Pick the best memory type for this resource based on usage flags.” It considers:

Buffer/image usage flags
Read/write patterns (sequential vs random)
Host visibility requirements
Performance characteristics

For example, a staging buffer automatically gets host-visible memory with sequential write optimization, while a vertex buffer gets device-local memory for maximum GPU performance.

Lumina’s VMA Integration

Our FVulkanMemoryAllocator class wraps VMA and adds engine-specific policies. Let’s break down the key components:

Initialization

VmaAllocatorCreateInfo info = {};
info.vulkanApiVersion = VK_API_VERSION_1_3;
info.instance = instance;
info.physicalDevice = physicalDevice;
info.device = device;
info.flags = VMA_ALLOCATOR_CREATE_EXT_MEMORY_BUDGET_BIT | 
             VMA_ALLOCATOR_CREATE_EXT_MEMORY_PRIORITY_BIT;

vmaCreateAllocator(&info, &Allocator);

We enable two extensions:

EXT_MEMORY_BUDGET: Provides real-time memory usage statistics to avoid over-allocation
EXT_MEMORY_PRIORITY: Lets us prioritize critical allocations (like render targets)

Custom Memory Pools (Probably Overkill)

Initially, I created custom pools for different buffer types:

void FVulkanMemoryAllocator::CreateCommonPools()
{
    CreateBufferPool(VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT, 
                     VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT, 
                     16 * 1024 * 1024); // 16MB blocks
    
    CreateBufferPool(VK_BUFFER_USAGE_VERTEX_BUFFER_BIT, 
                     VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT, 
                     64 * 1024 * 1024); // 64MB blocks
    
    CreateBufferPool(VK_BUFFER_USAGE_TRANSFER_SRC_BIT, 
                     VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | 
                     VMA_ALLOCATION_CREATE_MAPPED_BIT, 
                     32 * 1024 * 1024); // 32MB blocks
}

However, VMA’s documentation suggests this is unnecessary. VMA already maintains internal pools per memory type and performs sophisticated allocation strategies. Custom pools only make sense if you need:

Separate budgets for different resource categories
Specific defragmentation policies
Memory isolation for debugging

For most engines, VMA’s default pooling is sufficient and likely better tuned than anything we’d write ourselves. I’m keeping these pools for now, but they may be removed in a future refactor.

Buffer Allocation Strategy

Our buffer allocation logic demonstrates VMA’s flexibility:

VmaAllocation FVulkanMemoryAllocator::AllocateBuffer(
    const VkBufferCreateInfo* createInfo, 
    VmaAllocationCreateFlags flags, 
    VkBuffer* vkBuffer, 
    const char* allocationName)
{
    VmaAllocationCreateInfo info = {};
    info.usage = VMA_MEMORY_USAGE_AUTO;
    info.flags = flags;
    
    // Use custom pool if allocation is small enough
    uint64 poolKey = createInfo->usage;
    auto poolIt = BufferPools.find(poolKey);
    if (poolIt != BufferPools.end() && createInfo->size < poolIt->second.BlockSize / 2)
    {
        info.pool = poolIt->second.Pool;
    }
    
    // Large buffers get dedicated allocations for performance
    if (createInfo->size > 256 * 1024 * 1024) // 256MB
    {
        info.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
        info.priority = 1.0f; // Highest priority
    }
    
    // Persistently map host-visible buffers
    if (flags & VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT)
    {
        info.flags |= VMA_ALLOCATION_CREATE_MAPPED_BIT;
    }
    
    VmaAllocation allocation;
    VmaAllocationInfo allocInfo;
    vmaCreateBuffer(Allocator, createInfo, &info, vkBuffer, &allocation, &allocInfo);
    
    AllocatedBuffers[*vkBuffer] = {allocation, allocInfo};
    return allocation;
}

Key optimizations:

Small buffers (< half pool block size): Suballocate from pools
Large buffers (> 256MB): Get dedicated allocations to avoid fragmenting pools
Staging buffers: Persistently mapped with VMA_ALLOCATION_CREATE_MAPPED_BIT to eliminate map/unmap overhead

Image Allocation Strategy

Images have different requirements:

VmaAllocation FVulkanMemoryAllocator::AllocateImage(
    VkImageCreateInfo* createInfo, 
    VmaAllocationCreateFlags flags, 
    VkImage* vkImage, 
    const char* allocationName)
{
    constexpr uint64 DEDICATED_MEMORY_THRESHOLD = 2048 * 2048;
    
    VmaAllocationCreateInfo info = {};
    info.usage = VMA_MEMORY_USAGE_AUTO;
    info.flags = flags;
    
    VkDeviceSize imageSize = createInfo->extent.width * 
                             createInfo->extent.height * 
                             createInfo->extent.depth * 
                             createInfo->arrayLayers;
    
    // Render targets and large images get dedicated allocations
    if (imageSize > DEDICATED_MEMORY_THRESHOLD || 
        createInfo->usage & VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT || 
        createInfo->usage & VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT)
    {
        info.flags |= VMA_ALLOCATION_CREATE_DEDICATED_MEMORY_BIT;
        info.priority = 0.75f;
    }
    
    VmaAllocation allocation;
    VmaAllocationInfo allocInfo;
    vmaCreateImage(Allocator, createInfo, &info, vkImage, &allocation, &allocInfo);
    
    AllocatedImages[*vkImage] = {allocation, allocInfo};
    return allocation;
}

Why dedicated allocations for attachments?

Render targets (color and depth attachments) are accessed constantly during rendering. Giving them dedicated memory:

Improves GPU cache locality
Enables driver-specific optimizations (tile memory on mobile, compression buffers)
Prevents memory aliasing with other resources
Reduces bandwidth contention

The threshold of 2048x2048 (4MB for RGBA8) balances memory efficiency with performance. Smaller textures can share memory pools without significant overhead.

How VMA Optimizes Small Buffers

VMA’s internal allocation strategy is sophisticated. Here’s what happens when you allocate a small buffer (say, 256 bytes for a uniform buffer):

1. Block Selection

VMA maintains a list of memory blocks per memory type. Each block is typically 256MB and contains a free-list of available ranges.

2. Allocation Algorithm

VMA uses a best-fit algorithm with optimization hints:

Small allocations (< 4KB): Uses a dedicated small allocation pool to avoid fragmenting large blocks
Medium allocations (4KB - 1MB): Standard best-fit from general-purpose blocks
Large allocations (> 1MB): May use dedicated blocks to prevent fragmentation

3. Alignment Handling

Vulkan requires specific alignment for different resource types:

Uniform buffers: 256-byte alignment (on most GPUs)
Storage buffers: 16-byte alignment
Textures: Implementation-defined (often 64KB)

VMA automatically pads allocations to meet these requirements, wasting minimal space through clever packing.

4. Linear vs. Buddy Allocation

For pools marked with VMA_POOL_CREATE_LINEAR_ALGORITHM_BIT:

Allocations are tightly packed sequentially
Fast allocation (just increment an offset)
No free-list overhead
Perfect for temporary/staging buffers

For general pools:

Uses a buddy allocator or tree-based free list
Handles fragmentation better for long-lived allocations
Slightly slower allocation but more flexible

Persistent Mapping: A Hidden Gem

One of VMA’s cool features is persistent mapping:

void* FVulkanMemoryAllocator::GetMappedMemory(FVulkanBuffer* buffer)
{
    // Wait for GPU to finish if needed
    if (buffer->LastUseCommandListID != 0)
    {
        FQueue* queue = RenderContext->GetQueue(buffer->LastUseQueue);
        queue->WaitCommandList(buffer->LastUseCommandListID, UINT64_MAX);
    }

    // Return persistently mapped pointer
    return AllocatedBuffers[buffer->GetBuffer()].second.pMappedData;
}

Buffers created with VMA_ALLOCATION_CREATE_MAPPED_BIT remain mapped for their entire lifetime. This eliminates:

vkMapMemory() / vkUnmapMemory() call overhead.
Driver validation overhead
Risk of mapping failures under memory pressure

Memory Budgeting

VMA integrates with the VK_EXT_memory_budget extension to provide real-time memory statistics:

void FVulkanMemoryAllocator::LogMemoryStats()
{
    VmaTotalStatistics stats;
    vmaCalculateStatistics(Allocator, &stats);
    
    LOG_INFO("=== Vulkan Memory Statistics ===");
    LOG_INFO("Total Allocated: %.2f MB", 
             stats.total.statistics.allocationBytes / (1024.0f * 1024.0f));
    LOG_INFO("Total Block Count: %u", stats.total.statistics.blockCount);
    LOG_INFO("Allocation Count: %u", stats.total.statistics.allocationCount);
}

This lets us:

Monitor memory pressure in real-time
Detect memory leaks during development
Implement adaptive quality settings based on available VRAM
Profile memory usage per frame

Thread Safety

Notice the mutexes in our class:

FMutex ImageAllocationMutex;
FMutex BufferAllocationMutex;

VMA is not thread-safe by default. Multiple threads calling vmaCreateBuffer() concurrently will corrupt internal state. We protect allocations with per-resource-type mutexes, allowing:

Simultaneous buffer and image allocation
Fine-grained locking for better parallelism

In a future refactor, we might use VMA’s VMA_ALLOCATOR_CREATE_EXTERNALLY_SYNCHRONIZED_BIT flag and implement a custom threading strategy, but for now, simple mutexes work well.

Performance Impact

Before VMA, our prototype renderer had:

Manual block allocator: ~2000 lines of code
Frequent allocation failures under load
Memory fragmentation
Complex synchronization bugs

After VMA:

~200 lines of wrapper code
Zero allocation failures in testing
Smooth performance even with aggressive allocation patterns
Trivial to debug with VMA’s built-in statistics

The difference comes from VMA’s optimized data structures and lack of synchronization overhead in our naive implementation.

Conclusion

Vulkan Memory Allocator is one of those libraries that seems simple on the surface but reveals its genius over time. By handling the tedious, error-prone aspects of GPU memory management, it lets us focus on building a renderer instead of debugging allocation bugs at 2 AM.

Key takeaways:

Don’t roll your own memory allocator unless you have very specific requirements. VMA is battle-tested across thousands of projects.
Custom pools are optional. VMA’s default pooling is excellent. Only add custom pools if you need isolation or specific budgets.
Persistent mapping is a free performance win for host-visible buffers.
Dedicated allocations for render targets improve performance by eliminating memory aliasing.
VMA handles complexity you didn’t even know existed (alignment, memory types, defragmentation, budgeting).

If you’re building a Vulkan renderer, do yourself a favor: integrate VMA on day one. Your future self will thank you.

Lumina Engine is an open-source game engine built with modern Vulkan best practices. Check out the full source code on GitHub to see VMA integration in action.