• D
    bpf: avoid false sharing of map refcount with max_entries · be95a845
    Daniel Borkmann 提交于
    In addition to commit b2157399 ("bpf: prevent out-of-bounds
    speculation") also change the layout of struct bpf_map such that
    false sharing of fast-path members like max_entries is avoided
    when the maps reference counter is altered. Therefore enforce
    them to be placed into separate cachelines.
    
    pahole dump after change:
    
      struct bpf_map {
            const struct bpf_map_ops  * ops;                 /*     0     8 */
            struct bpf_map *           inner_map_meta;       /*     8     8 */
            void *                     security;             /*    16     8 */
            enum bpf_map_type          map_type;             /*    24     4 */
            u32                        key_size;             /*    28     4 */
            u32                        value_size;           /*    32     4 */
            u32                        max_entries;          /*    36     4 */
            u32                        map_flags;            /*    40     4 */
            u32                        pages;                /*    44     4 */
            u32                        id;                   /*    48     4 */
            int                        numa_node;            /*    52     4 */
            bool                       unpriv_array;         /*    56     1 */
    
            /* XXX 7 bytes hole, try to pack */
    
            /* --- cacheline 1 boundary (64 bytes) --- */
            struct user_struct *       user;                 /*    64     8 */
            atomic_t                   refcnt;               /*    72     4 */
            atomic_t                   usercnt;              /*    76     4 */
            struct work_struct         work;                 /*    80    32 */
            char                       name[16];             /*   112    16 */
            /* --- cacheline 2 boundary (128 bytes) --- */
    
            /* size: 128, cachelines: 2, members: 17 */
            /* sum members: 121, holes: 1, sum holes: 7 */
      };
    
    Now all entries in the first cacheline are read only throughout
    the life time of the map, set up once during map creation. Overall
    struct size and number of cachelines doesn't change from the
    reordering. struct bpf_map is usually first member and embedded
    in map structs in specific map implementations, so also avoid those
    members to sit at the end where it could potentially share the
    cacheline with first map values e.g. in the array since remote
    CPUs could trigger map updates just as well for those (easily
    dirtying members like max_entries intentionally as well) while
    having subsequent values in cache.
    
    Quoting from Google's Project Zero blog [1]:
    
      Additionally, at least on the Intel machine on which this was
      tested, bouncing modified cache lines between cores is slow,
      apparently because the MESI protocol is used for cache coherence
      [8]. Changing the reference counter of an eBPF array on one
      physical CPU core causes the cache line containing the reference
      counter to be bounced over to that CPU core, making reads of the
      reference counter on all other CPU cores slow until the changed
      reference counter has been written back to memory. Because the
      length and the reference counter of an eBPF array are stored in
      the same cache line, this also means that changing the reference
      counter on one physical CPU core causes reads of the eBPF array's
      length to be slow on other physical CPU cores (intentional false
      sharing).
    
    While this doesn't 'control' the out-of-bounds speculation through
    masking the index as in commit b2157399, triggering a manipulation
    of the map's reference counter is really trivial, so lets not allow
    to easily affect max_entries from it.
    
    Splitting to separate cachelines also generally makes sense from
    a performance perspective anyway in that fast-path won't have a
    cache miss if the map gets pinned, reused in other progs, etc out
    of control path, thus also avoids unintentional false sharing.
    
      [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
    be95a845
bpf.h 17.9 KB