1. 13 10月, 2015 10 次提交
    • E
      tcp: shrink tcp_timewait_sock by 8 bytes · d475f090
      Eric Dumazet 提交于
      Reducing tcp_timewait_sock from 280 bytes to 272 bytes
      allows SLAB to pack 15 objects per page instead of 14 (on x86)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d475f090
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet 提交于
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e5eb54d
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      packet: support per-packet fwmark for af_packet sendmsg · c7d39e32
      Edward Jee 提交于
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d39e32
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
    • D
      Merge branch 'bpf-unprivileged' · c1bf5fe0
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf: unprivileged
      
      v1-v2:
      - this set logically depends on cb patch
        "bpf: fix cb access in socket filter programs":
        http://patchwork.ozlabs.org/patch/527391/
        which is must have to allow unprivileged programs.
        Thanks Daniel for finding that issue.
      - refactored sysctl to be similar to 'modules_disabled'
      - dropped bpf_trace_printk
      - split tests into separate patch and added more tests
        based on discussion
      
      v1 cover letter:
      I think it is time to liberate eBPF from CAP_SYS_ADMIN.
      As was discussed when eBPF was first introduced two years ago
      the only piece missing in eBPF verifier is 'pointer leak detection'
      to make it available to non-root users.
      Patch 1 adds this pointer analysis.
      The eBPF programs, obviously, need to see and operate on kernel addresses,
      but with these extra checks they won't be able to pass these addresses
      to user space.
      Patch 2 adds accounting of kernel memory used by programs and maps.
      It changes behavoir for existing root users, but I think it needs
      to be done consistently for both root and non-root, since today
      programs and maps are only limited by number of open FDs (RLIMIT_NOFILE).
      Patch 2 accounts program's and map's kernel memory as RLIMIT_MEMLOCK.
      
      Unprivileged eBPF is only meaningful for 'socket filter'-like programs.
      eBPF programs for tracing and TC classifiers/actions will stay root only.
      
      In parallel the bpf fuzzing effort is ongoing and so far
      we've found only one verifier bug and that was already fixed.
      The 'constant blinding' pass also being worked on.
      It will obfuscate constant-like values that are part of eBPF ISA
      to make jit spraying attacks even harder.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1bf5fe0
    • A
      bpf: add unprivileged bpf tests · bf508877
      Alexei Starovoitov 提交于
      Add new tests samples/bpf/test_verifier:
      
      unpriv: return pointer
        checks that pointer cannot be returned from the eBPF program
      
      unpriv: add const to pointer
      unpriv: add pointer to pointer
      unpriv: neg pointer
        checks that pointer arithmetic is disallowed
      
      unpriv: cmp pointer with const
      unpriv: cmp pointer with pointer
        checks that comparison of pointers is disallowed
        Only one case allowed 'void *value = bpf_map_lookup_elem(..); if (value == 0) ...'
      
      unpriv: check that printk is disallowed
        since bpf_trace_printk is not available to unprivileged
      
      unpriv: pass pointer to helper function
        checks that pointers cannot be passed to functions that expect integers
        If function expects a pointer the verifier allows only that type of pointer.
        Like 1st argument of bpf_map_lookup_elem() must be pointer to map.
        (applies to non-root as well)
      
      unpriv: indirectly pass pointer on stack to helper function
        checks that pointer stored into stack cannot be used as part of key
        passed into bpf_map_lookup_elem()
      
      unpriv: mangle pointer on stack 1
      unpriv: mangle pointer on stack 2
        checks that writing into stack slot that already contains a pointer
        is disallowed
      
      unpriv: read pointer from stack in small chunks
        checks that < 8 byte read from stack slot that contains a pointer is
        disallowed
      
      unpriv: write pointer into ctx
        checks that storing pointers into skb->fields is disallowed
      
      unpriv: write pointer into map elem value
        checks that storing pointers into element values is disallowed
        For example:
        int bpf_prog(struct __sk_buff *skb)
        {
          u32 key = 0;
          u64 *value = bpf_map_lookup_elem(&map, &key);
          if (value)
             *value = (u64) skb;
        }
        will be rejected.
      
      unpriv: partial copy of pointer
        checks that doing 32-bit register mov from register containing
        a pointer is disallowed
      
      unpriv: pass pointer to tail_call
        checks that passing pointer as an index into bpf_tail_call
        is disallowed
      
      unpriv: cmp map pointer with zero
        checks that comparing map pointer with constant is disallowed
      
      unpriv: write into frame pointer
        checks that frame pointer is read-only (applies to root too)
      
      unpriv: cmp of frame pointer
        checks that R10 cannot be using in comparison
      
      unpriv: cmp of stack pointer
        checks that Rx = R10 - imm is ok, but comparing Rx is not
      
      unpriv: obfuscate stack pointer
        checks that Rx = R10 - imm is ok, but Rx -= imm is not
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf508877
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
    • A
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov 提交于
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be7f75d
  2. 12 10月, 2015 9 次提交
  3. 11 10月, 2015 11 次提交
  4. 09 10月, 2015 10 次提交
    • D
      Merge branch 'net-non-modular' · d49ae37c
      David S. Miller 提交于
      Paul Gortmaker says:
      
      ====================
      make non-modular code explicitly non-modular
      
      [v2: drop m68k patches that Geert converted to modules; add one ARM
       driver patch ; update net-next baseline to today; switch to ARM
       for build testing.]
      
      In a previous merge window, we made changes to allow better
      delineation between modular and non-modular code in commit
      0fd972a7 ("module: relocate module_init
      from init.h to module.h").  This allows us to now ensure module code
      looks modular and non-modular code does not accidentally look modular
      just to avoid suffering build breakage.
      
      Here we target code that is, by nature of their Makefile and/or
      Kconfig settings, only available to be built-in, but implicitly
      presenting itself as being possibly modular by way of using modular
      headers, macros, and functions.
      
      The goal here is to remove that illusion of modularity from these
      files, but in a way that leaves the actual runtime unchanged.
      In doing so, we remove code that has never been tested and adds
      no value to the tree.  And we continue the process of expecting a
      level of consistency between the Kconfig/Makefile of code and the
      code in use itself.
      
      Fortuntately the net subsystem has relatively few instances, given
      the overall amount of code and drivers it contains.  For comparison
      there are over 300 instances tree wide, resulting in a possible net
      removal of on the order of 5000 lines of unused code.
      
      Build tested on net-next from today, on ARM, since that is the arch
      where the one ethernet driver changed here is available.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d49ae37c
    • P
      drivers/net/ethernet: make ti/cpsw-phy-sel.c explicitly non-modular · b3c8ec35
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      drivers/net/ethernet/ti/Kconfig:config TI_CPSW_PHY_SEL
      drivers/net/ethernet/ti/Kconfig:        bool "TI CPSW Switch Phy sel Support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the couple traces of modularity so that when reading the
      driver there is no doubt it is builtin-only.
      
      Since module_platform_driver() uses the same init level priority as
      builtin_platform_driver() the init ordering remains unchanged with
      this commit.
      
      Also note that MODULE_DEVICE_TABLE is a no-op for non-modular code.
      
      We also delete the MODULE_LICENSE tag etc. since all that information
      was (or is now) contained at the top of the file in the comments.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Varka Bhadram <varkabhadram@gmail.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3c8ec35
    • P
      net/sched: make sch_blackhole.c explicitly non-modular · 075640e3
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      net/sched/Kconfig:menuconfig NET_SCHED
      net/sched/Kconfig:      bool "QoS and/or fair queueing"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that
      when reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular
      case, the init ordering remains unchanged with this commit.  We can
      change to one of the other priority initcalls (subsys?) at any later
      date, if desired.
      
      We also delete the MODULE_LICENSE tag since all that information
      is already contained at the top of the file in the comments.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      075640e3
    • P
      net/dcb: make dcbnl.c explicitly non-modular · 36b9ad80
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      net/dcb/Kconfig:config DCB
      net/dcb/Kconfig:        bool "Data Center Bridging support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that
      when reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular
      case, the init ordering remains unchanged with this commit.  We can
      change to one of the other priority initcalls (subsys?) at any later
      date, if desired.
      
      We also delete the MODULE_LICENSE tag etc. since all that information
      is (or is now) already contained at the top of the file in the comments.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Anish Bhatt <anish@chelsio.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Shani Michaeli <shanim@mellanox.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36b9ad80
    • P
      net/core: make sock_diag.c explicitly non-modular · b6191aee
      Paul Gortmaker 提交于
      The Makefile currently controlling compilation of this code lists
      it under "obj-y" ...meaning that it currently is not being built as
      a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that
      when reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular
      case, the init ordering remains unchanged with this commit.  We can
      change to one of the other priority initcalls (subsys?) at any later
      date, if desired.
      
      We can't remove module.h since the file uses other module related
      stuff even though it is not modular itself.
      
      We move the information from the MODULE_LICENSE tag to the top of the
      file, since that information is not captured anywhere else.  The
      MODULE_ALIAS_NET_PF_PROTO becomes a no-op in the non modular case, so
      it is removed.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Craig Gallek <kraig@google.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6191aee
    • D
      Merge branch 'net-bool' · 4d886d65
      David S. Miller 提交于
      Yaowei Bai says:
      
      ====================
      net: small improvement
      
      This patchset makes several functions in net return bool to improve
      readability and/or simplicity because these functions only use one
      or zero as their return value.
      
      No functional changes.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d886d65
    • Y
      net/core: lockdep_rtnl_is_held can be boolean · 0cbf3343
      Yaowei Bai 提交于
      This patch makes lockdep_rtnl_is_held return bool due to this
      particular function only using either one or zero as its return
      value.
      
      In another patch lockdep_is_held is also made return bool.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cbf3343
    • Y
      net/inetdevice: bad_mask can be boolean · f06cc7b2
      Yaowei Bai 提交于
      This patch makes bad_mask return bool due to this particular function
      only using either one or zero as its return value.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f06cc7b2
    • Y
      net/inetdevice: inet_ifa_match can be boolean · c3225164
      Yaowei Bai 提交于
      This patch makes inet_ifa_match return bool due to this
      particular function only using either one or zero as its return
      value.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3225164
    • Y
      net/dccp: dccp_bad_service_code can be boolean · 45ae74f5
      Yaowei Bai 提交于
      This patch makes dccp_bad_service_code return bool due to these
      particular functions only using either one or zero as their return
      value.
      
      dccp_list_has_service is also been made return bool in this patchset.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45ae74f5