• E
    [IPV4]: rt_cache_stat can be statically defined · 2f970d83
    Eric Dumazet 提交于
    Using __get_cpu_var(obj) is slightly faster than per_cpu_ptr(obj, 
    raw_smp_processor_id()).
    
    1) Smaller code and memory use
    For static and small objects, DEFINE_PER_CPU(type, object) is preferred over a 
    alloc_percpu() : Better and smaller code to access them, and no extra memory 
    (storing the pointer, and the percpu array of pointers)
    
    x86_64 code before patch
    
    mov    1237577(%rip),%rax        # ffffffff803e5990 <rt_cache_stat>
    not    %rax  # part of per_cpu machinery
    mov    %gs:0x3c,%edx # get cpu number
    movslq %edx,%rdx # extend 32 bits cpu number to 64 bits
    mov    (%rax,%rdx,8),%rax # get the pointer for this cpu
    incl   0x38(%rax)
    
    x86_64 code after patch
    
    mov    $per_cpu__rt_cache_stat,%rdx
    mov    %gs:0x48,%rax # get percpu data offset
    incl   0x38(%rax,%rdx,1)
    
    2) False sharing avoidance for SMP :
    For a small NR_CPUS, the array of per cpu pointers allocated in alloc_percpu() 
    can be <= 32 bytes. This let slab code gives a part of a cache line. If the 
    other part of this 64 bytes (or 128 bytes) cache line is used by a mostly 
    written object, we can have false sharing and expensive per_cpu_ptr() operations.
    
    Size of rt_cache_stat is 64 bytes, so this patch is not a danger of a too big 
    increase of bss (in UP mode) or static per_cpu data for SMP 
    (PERCPU_ENOUGH_ROOM is currently 32768 bytes)
    Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
    Signed-off-by: NDavid S. Miller <davem@davemloft.net>
    2f970d83
route.c 78.1 KB