1. 13 10月, 2015 7 次提交
    • E
      tcp: shrink tcp_timewait_sock by 8 bytes · d475f090
      Eric Dumazet 提交于
      Reducing tcp_timewait_sock from 280 bytes to 272 bytes
      allows SLAB to pack 15 objects per page instead of 14 (on x86)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d475f090
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet 提交于
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e5eb54d
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
    • A
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov 提交于
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be7f75d
  2. 12 10月, 2015 2 次提交
  3. 11 10月, 2015 5 次提交
  4. 09 10月, 2015 10 次提交
  5. 08 10月, 2015 16 次提交