浅谈Linux内存alloc: kmalloc, vmalloc, malloc

浅谈Linux内存alloc: kmalloc, vmalloc, malloc

kmalloc

定义于include/linux/slab.h,用于内核分配大小小于页大小的内存, 物理上连续。

参数:

  • size: 分配的内存大小,最小值:SLAB分配器为32Bytes,SLOB和SLUB为8Bytes;最大值:SLAB分配器为32MB(25阶)或者最大分配页的阶数-1,SLOB和SLUB的阶数为:MAX_ORDER + PAGE_SHIFT - 1(11+12-1),4MB

  • flags:

    • GFP_KERNEL,表示当当前没有足够内存分配时,进程进入睡眠
    • GFP_ATOMIC, 用来从中断处理和进程上下文之外的其他代码中分配内存. 从不睡眠.
    • GFP_USER, 用来为用户空间页来分配内存; 它可能睡眠.
    • GFP_HIGHUSER, 如同 GFP_USER, 但是从高端内存分配, 如果有. 高端内存在下一个子节描述.
/**
 * kmalloc - allocate memory
 * @size: how many bytes of memory are required.
 * @flags: the type of memory to allocate.
 *
 * kmalloc is the normal method of allocating memory
 * for objects smaller than page size in the kernel.
 *
 * The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN
 * bytes. For @size of power of two bytes, the alignment is also guaranteed
 * to be at least to the size.
 *
 * The @flags argument may be one of the GFP flags defined at
 * include/linux/gfp.h and described at
 * :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>`
 *
 * The recommended usage of the @flags is described at
 * :ref:`Documentation/core-api/memory-allocation.rst <memory_allocation>`
 *
 * Below is a brief outline of the most useful GFP flags
 *
 * %GFP_KERNEL
 *  Allocate normal kernel ram. May sleep.
 *
 * %GFP_NOWAIT
 *  Allocation will not sleep.
 *
 * %GFP_ATOMIC
 *  Allocation will not sleep.  May use emergency pools.
 *
 * %GFP_HIGHUSER
 *  Allocate memory from high memory on behalf of user.
 *
 * Also it is possible to set different flags by OR'ing
 * in one or more of the following additional @flags:
 *
 * %__GFP_HIGH
 *  This allocation has high priority and may use emergency pools.
 *
 * %__GFP_NOFAIL
 *  Indicate that this allocation is in no way allowed to fail
 *  (think twice before using).
 *
 * %__GFP_NORETRY
 *  If memory is not immediately available,
 *  then give up at once.
 *
 * %__GFP_NOWARN
 *  If allocation fails, don't issue any warnings.
 *
 * %__GFP_RETRY_MAYFAIL
 *  Try really hard to succeed the allocation but fail
 *  eventually.
 */
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
    if (__builtin_constant_p(size)) {
#ifndef CONFIG_SLOB
        unsigned int index;
#endif
        if (size > KMALLOC_MAX_CACHE_SIZE)
            return kmalloc_large(size, flags);
#ifndef CONFIG_SLOB
        index = kmalloc_index(size);

        if (!index)
            return ZERO_SIZE_PTR;

        return kmem_cache_alloc_trace(
                kmalloc_caches[kmalloc_type(flags)][index],
                flags, size);
#endif
    }
    return __kmalloc(size, flags);
}

在调用具体分配器之前,内核用编译器方法首先判断是否是常量,以便于追踪。

SLAB使用__do_kmalloc来分配内存,caller用于追踪。

/**
 * __do_kmalloc - allocate memory
 * @size: how many bytes of memory are required.
 * @flags: the type of memory to allocate (see kmalloc).
 * @caller: function caller for debug tracking of the caller
 *
 * Return: pointer to the allocated memory or %NULL in case of error
 */
static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
                      unsigned long caller)
{
    struct kmem_cache *cachep;
    void *ret;

    if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
        return NULL;
    cachep = kmalloc_slab(size, flags);
    if (unlikely(ZERO_OR_NULL_PTR(cachep)))
        return cachep;
    ret = slab_alloc(cachep, flags, size, caller);

    ret = kasan_kmalloc(cachep, ret, size, flags);
    trace_kmalloc(caller, ret,
              size, cachep->size, flags);

    return ret;
}

首先找到指定大小的kmem_cache, 在缓存里面能找到就直接返回。此处认为这个指针是不大可能是空指针,即优化倾向于缓存命中。

最后调用slab_alloc, 使用slab分配。

static __always_inline void *
slab_alloc(struct kmem_cache *cachep, gfp_t flags, size_t orig_size, unsigned long caller)
{
    unsigned long save_flags;
    void *objp;
    struct obj_cgroup *objcg = NULL;
    bool init = false;

    flags &= gfp_allowed_mask;
    cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
    if (unlikely(!cachep))
        return NULL;

    objp = kfence_alloc(cachep, orig_size, flags);
    if (unlikely(objp))
        goto out;

    cache_alloc_debugcheck_before(cachep, flags);
    local_irq_save(save_flags);
    objp = __do_cache_alloc(cachep, flags);
    local_irq_restore(save_flags);
    objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
    prefetchw(objp);
    init = slab_want_init_on_alloc(flags, cachep);

out:
    slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
    return objp;
}

vmalloc

vmalloc用于内核分配在虚拟内存中连续但是在物理内存不一定连续的内存,目的是在内核内存空间提供一个内存区,能够让这个内存区映射到 896MB 之外的物理内存。定义于mm/vmalloc.c

/**
 * vmalloc - allocate virtually contiguous memory
 * @size:    allocation size
 *
 * Allocate enough pages to cover @size from the page level
 * allocator and map them into contiguous kernel virtual space.
 *
 * For tight control over page level allocator and protection flags
 * use __vmalloc() instead.
 *
 * Return: pointer to the allocated memory or %NULL on error
 */
void *vmalloc(unsigned long size)
{
    return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
                __builtin_return_address(0));
}
EXPORT_SYMBOL(vmalloc);

__vmalloc_node_range用于分配虚拟的连续内存,调用__get_vm_area_node,首先使用__alloc_vmap_area来获取预估适当的区域,然后分配各个页,最后将这些页连续地映射到函数输入的vmalloc区域。

/**
 * __vmalloc_node_range - allocate virtually contiguous memory
 * @size:         allocation size
 * @align:        desired alignment
 * @start:        vm area range start
 * @end:          vm area range end
 * @gfp_mask:         flags for the page level allocator
 * @prot:         protection mask for the allocated pages
 * @vm_flags:         additional vm area flags (e.g. %VM_NO_GUARD)
 * @node:         node to use for allocation or NUMA_NO_NODE
 * @caller:       caller's return address
 *
 * Allocate enough pages to cover @size from the page level
 * allocator with @gfp_mask flags.  Map them into contiguous
 * kernel virtual space, using a pagetable protection of @prot.
 *
 * Return: the address of the area or %NULL on failure
 */
void *__vmalloc_node_range(unsigned long size, unsigned long align,
            unsigned long start, unsigned long end, gfp_t gfp_mask,
            pgprot_t prot, unsigned long vm_flags, int node,
            const void *caller)
{
    struct vm_struct *area;
    void *addr;
    unsigned long real_size = size;
    unsigned long real_align = align;
    unsigned int shift = PAGE_SHIFT;

    if (WARN_ON_ONCE(!size))
        return NULL;

    if ((size >> PAGE_SHIFT) > totalram_pages()) {
        warn_alloc(gfp_mask, NULL,
            "vmalloc error: size %lu, exceeds total pages",
            real_size);
        return NULL;
    }

    if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
        unsigned long size_per_node;

        /*
         * Try huge pages. Only try for PAGE_KERNEL allocations,
         * others like modules don't yet expect huge pages in
         * their allocations due to apply_to_page_range not
         * supporting them.
         */

        size_per_node = size;
        if (node == NUMA_NO_NODE)
            size_per_node /= num_online_nodes();
        if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
            shift = PMD_SHIFT;
        else
            shift = arch_vmap_pte_supported_shift(size_per_node);

        align = max(real_align, 1UL << shift);
        size = ALIGN(real_size, 1UL << shift);
    }

again:
    area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
                  VM_UNINITIALIZED | vm_flags, start, end, node,
                  gfp_mask, caller);
    if (!area) {
        warn_alloc(gfp_mask, NULL,
            "vmalloc error: size %lu, vm_struct allocation failed",
            real_size);
        goto fail;
    }

    addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
    if (!addr)
        goto fail;

    /*
     * In this function, newly allocated vm_struct has VM_UNINITIALIZED
     * flag. It means that vm_struct is not fully initialized.
     * Now, it is fully initialized, so remove this flag here.
     */
    clear_vm_uninitialized_flag(area);

    size = PAGE_ALIGN(size);
    if (!(vm_flags & VM_DEFER_KMEMLEAK))
        kmemleak_vmalloc(area, size, gfp_mask);

    return addr;

fail:
    if (shift > PAGE_SHIFT) {
        shift = PAGE_SHIFT;
        align = real_align;
        size = real_size;
        goto again;
    }

    return NULL;
}
/*
 * Returns a start address of the newly allocated area, if success.
 * Otherwise a vend is returned that indicates failure.
 */
static __always_inline unsigned long
__alloc_vmap_area(unsigned long size, unsigned long align,
    unsigned long vstart, unsigned long vend)
{
    unsigned long nva_start_addr;
    struct vmap_area *va;
    enum fit_type type;
    int ret;

    va = find_vmap_lowest_match(size, align, vstart);
    if (unlikely(!va))
        return vend;

    if (va->va_start > vstart)
        nva_start_addr = ALIGN(va->va_start, align);
    else
        nva_start_addr = ALIGN(vstart, align);

    /* Check the "vend" restriction. */
    if (nva_start_addr + size > vend)
        return vend;

    /* Classify what we have found. */
    type = classify_va_fit_type(va, nva_start_addr, size);
    if (WARN_ON_ONCE(type == NOTHING_FIT))
        return vend;

    /* Update the free vmap_area. */
    ret = adjust_va_to_fit_type(va, nva_start_addr, size, type);
    if (ret)
        return vend;

#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
    find_vmap_lowest_match_check(size);
#endif

    return nva_start_addr;
}

malloc

malloc是用于申请动态内存(堆内存)的方法,一般使用的是glibc提供的内存分配器,分配器算法是ptmalloc2,其他也有jemalloc, tcmalloc, mimalloc等分配算法。

malloc会调用brk和mmap从进程的虚拟内存开辟和交还空间。

ptmalloc2

通用内存分配,适用于绝大多数场景。其他号称更快更少碎片更低占用的分配器大都为特定场景优化。

flowchart TB
    subgraph A
        dentry_obj_1(arena)
        dentry_obj_2(堆区)
        subgraph B
            bin
            空闲列表
            subgraph C
                chunk
                内存块
            end
        end
    end
  • Arena: 通过sbrk或mmap系统调用为线程分配的堆区(128KB为界限)
    • main arena:主线程建立的arena
    • thread arena:子线程建立的arena
    • 最大值(x86_64): 8 * number of cores
    • 竞争:每个线程使用arena时会自旋申请上锁,上锁后可以使用,每个时刻仅有一个线程使用一个arena,再次使用arena时会尝试使用上一次的arena,提高缓存命中率。
  • Chunk: 逻辑上划分的一小块内存
    • fast chunk: 大小为 16 ~ 80 字节的 chunk
    • small chunk: 大小小于 512 字节的 chunk
    • large chunk: 大小大于等于 512 字节的 chunk
    • Allocated chunk: 分配给用户的Chunk
    • Free chunk: 用户释放的Chunk(free)
    • Top Chunk: 一个 arena 中最顶部的 chunk
      • 当 top chunk 的大小比用户请求的大小大的时候,top chunk 会分割为两个部分: User chunk,返回给用户; Remainder chunk,剩余部分,将成为新的 top chunk
      • 当 top chunk 的大小比用户请求的大小小的时候,top chunk 就通过 sbrk(main arena)或 mmap( thread arena)系统调用扩容
    • Last Remainder Chunk: 最后一次 small request 中因分割而得到的剩余部分,它有利于改进引用局部性,也即后续对 small chunk 的 malloc 请求可能最终被分配得彼此靠近。
  • Bin:一个用以保存Free chunk链表的表头信息的指针数组(free()之后的内存空间并不会立即返回给系统)
    • Fast Bin:存放free(fast chunk)的单链表, fast bins 路径享有最快的内存分配及释放速度。
    • Small Bin: 存放free(small chunk)的双向循环列表,在内存分配回收的速度上,small bin 比 large bin 更快。
    • Large Bin:存放free(large chunk)的双向循环列表
    • Unsorted Bin: 未排序的垃圾桶,所有Chuck被free后首先进入这里,在内存分配的时候,检索没有结果,才会将对应的内存放入对应的Bin

jemalloc

FreeBSD项目中使用,侧重于减少内存碎片和提升高并发场景下内存的分配效率,其目标是能够替代malloc。

  • arena: 用于分配和回收 extent,线程采用 round-robin 轮询的方式选择可用的 arena 进行内存分配,为了减少线程之间的锁竞争,默认每个 CPU 会分配 4 个 arena

  • base:用于分配 jemalloc 元数据内存

  • bin: 用于管理不同档位的内存单元

  • extent: 管理 jemalloc 内存块(即用于用户分配的内存)

  • tcache: 每个线程私有的缓存,用于 small 和 large 场景下的内存分配,大多数内存申请都可以在 tcache 中直接得到,从而避免加锁

  • slab: 当 extent 用于分配 small_class 内存时,称其为 slab

  • extents: 管理extent

  • rtree: 全局唯一的存放每个 extent 信息的 Radix Tree(基数树)

  • cache_bin: 每个线程独有的用于分配小内存的缓存

  • tsd: Thread Specific Data,每个线程独有,用于存放与这个线程相关的结构

  • GC: 内存回收策略

    • tcache GC: 针对 small_class,防止某个线程预先分配了内存但是却没有实际分配给用户,会定期将缓存 flush 到 extent
      • 每228次malloc或free之后触发,产生一个cache_bin
    • extent GC: 合并extent进入extents_retained

tcmalloc

tcmalloc 出身于 Google,带有线程缓存,例如:Chrome。

flowchart LR
    left(User code)
    subgraph TCMalloc
        subgraph "Front end"
            t(Per Thread Cache)
            c(Per CPU Cache)
        end
        subgraph "Middle end"
            subgraph "Tansfer Cache"
                cfl(Central free list)
            end
        end
        subgraph "Back end"
            lph(Legacy page heap)
            haph(Hugepage aware page heap)
        end
    end
    right(OS)
    left(User code) <--> t <--> cfl <--> lph <--> right
    left(User code) <--> c <--> cfl <--> haph <--> right

mimalloc

微软研究院出品。目前应用不多,常见于较新的与整活项目:

知识共享许可协议
本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可。
上一篇