1. 10 11月, 2015 1 次提交
  2. 05 11月, 2015 1 次提交
    • J
      net/core: ensure features get disabled on new lower devs · e7868a85
      Jarod Wilson 提交于
      With moving netdev_sync_lower_features() after the .ndo_set_features
      calls, I neglected to verify that devices added *after* a flag had been
      disabled on an upper device were properly added with that flag disabled as
      well. This currently happens, because we exit __netdev_update_features()
      when we see dev->features == features for the upper dev. We can retain the
      optimization of leaving without calling .ndo_set_features with a bit of
      tweaking and a goto here.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7868a85
  3. 04 11月, 2015 2 次提交
    • J
      net/core: fix for_each_netdev_feature · 5ba3f7d6
      Jarod Wilson 提交于
      As pointed out by Nikolay and further explained by Geert, the initial
      for_each_netdev_feature macro was broken, as feature would get set outside
      of the block of code it was intended to run in, thus only ever working for
      the first feature bit in the mask. While less pretty this way, this is
      tested and confirmed functional with multiple feature bits set in
      NETIF_F_UPPER_DISABLES.
      
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      ...
      [  242.761394] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [  243.552178] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  244.353978] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [  245.147420] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      [root@dell-per730-01 ~]# ethtool -K bond0 gro off
      ...
      [  251.925645] bond0: Disabling feature 0x0000000000004000 on lower dev p5p2.
      [  252.713693] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  253.499085] bond0: Disabling feature 0x0000000000004000 on lower dev p5p1.
      [  254.290922] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: Geert Uytterhoeven <geert@linux-m68k.org>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba3f7d6
    • S
      ptp: Change ptp_class to a proper bitmask · 5f94c943
      Stefan Sørensen 提交于
      Change the definition of PTP_CLASS_L2 to not have any bits overlapping with
      the other defined protocol values, allowing the PTP_CLASS_* definitions to
      be for simple filtering on packet type.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f94c943
  4. 03 11月, 2015 2 次提交
    • J
      net/core: generic support for disabling netdev features down stack · fd867d51
      Jarod Wilson 提交于
      There are some netdev features, which when disabled on an upper device,
      such as a bonding master or a bridge, must be disabled and cannot be
      re-enabled on underlying devices.
      
      This is a rework of an earlier more heavy-handed appraoch, which simply
      disables and prevents re-enabling of netdev features listed in a new
      define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
      device that disables a flag in that feature mask, the disabling will
      propagate down the stack, and any lower device that has any upper device
      with one of those flags disabled should not be able to enable said flag.
      
      Initially, only LRO is included for proof of concept, and because this
      code effectively does the same thing as dev_disable_lro(), though it will
      also activate from the ethtool path, which was one of the goals here.
      
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: off
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: off
      
      dmesg dump:
      
      [ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      This has been successfully tested with bnx2x, qlcnic and netxen network
      cards as slaves in a bond interface. Turning LRO on or off on the master
      also turns it on or off on each of the slaves, new slaves are added with
      LRO in the same state as the master, and LRO can't be toggled on the
      slaves.
      
      Also, this should largely remove the need for dev_disable_lro(), and most,
      if not all, of its call sites can be replaced by simply making sure
      NETIF_F_LRO isn't included in the relevant device's feature flags.
      
      Note that this patch is driven by bug reports from users saying it was
      confusing that bonds and slaves had different settings for the same
      features, and while it won't be 100% in sync if a lower device doesn't
      support a feature like LRO, I think this is a good step in the right
      direction.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd867d51
    • E
      net: make skb_set_owner_w() more robust · 9e17f8a4
      Eric Dumazet 提交于
      skb_set_owner_w() is called from various places that assume
      skb->sk always point to a full blown socket (as it changes
      sk->sk_wmem_alloc)
      
      We'd like to attach skb to request sockets, and in the future
      to timewait sockets as well. For these kind of pseudo sockets,
      we need to take a traditional refcount and use sock_edemux()
      as the destructor.
      
      It is now time to un-inline skb_set_owner_w(), being too big.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Bisected-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e17f8a4
  5. 28 10月, 2015 1 次提交
  6. 27 10月, 2015 1 次提交
  7. 23 10月, 2015 2 次提交
  8. 22 10月, 2015 1 次提交
    • A
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen 提交于
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1974ed0
  9. 16 10月, 2015 1 次提交
  10. 15 10月, 2015 1 次提交
  11. 13 10月, 2015 3 次提交
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
    • A
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov 提交于
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be7f75d
  12. 11 10月, 2015 1 次提交
    • A
      bpf: fix cb access in socket filter programs · ff936a04
      Alexei Starovoitov 提交于
      eBPF socket filter programs may see junk in 'u32 cb[5]' area,
      since it could have been used by protocol layers earlier.
      
      For socket filter programs used in af_packet we need to clean
      20 bytes of skb->cb area if it could be used by the program.
      For programs attached to TCP/UDP sockets we need to save/restore
      these 20 bytes, since it's used by protocol layers.
      
      Remove SK_RUN_FILTER macro, since it's no longer used.
      
      Long term we may move this bpf cb area to per-cpu scratch, but that
      requires addition of new 'per-cpu load/store' instructions,
      so not suitable as a short term fix.
      
      Fixes: d691f9e8 ("bpf: allow programs to write to certain skb fields")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff936a04
  13. 09 10月, 2015 2 次提交
    • P
      net/core: make sock_diag.c explicitly non-modular · b6191aee
      Paul Gortmaker 提交于
      The Makefile currently controlling compilation of this code lists
      it under "obj-y" ...meaning that it currently is not being built as
      a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that
      when reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular
      case, the init ordering remains unchanged with this commit.  We can
      change to one of the other priority initcalls (subsys?) at any later
      date, if desired.
      
      We can't remove module.h since the file uses other module related
      stuff even though it is not modular itself.
      
      We move the information from the MODULE_LICENSE tag to the top of the
      file, since that information is not captured anywhere else.  The
      MODULE_ALIAS_NET_PF_PROTO becomes a no-op in the non modular case, so
      it is removed.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Craig Gallek <kraig@google.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6191aee
    • Y
      net/core: lockdep_rtnl_is_held can be boolean · 0cbf3343
      Yaowei Bai 提交于
      This patch makes lockdep_rtnl_is_held return bool due to this
      particular function only using either one or zero as its return
      value.
      
      In another patch lockdep_is_held is also made return bool.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cbf3343
  14. 08 10月, 2015 5 次提交
  15. 07 10月, 2015 1 次提交
  16. 05 10月, 2015 4 次提交
    • D
      bpf, seccomp: prepare for upcoming criu support · bab18991
      Daniel Borkmann 提交于
      The current ongoing effort to dump existing cBPF seccomp filters back
      to user space requires to hold the pre-transformed instructions like
      we do in case of socket filters from sk_attach_filter() side, so they
      can be reloaded in original form at a later point in time by utilities
      such as criu.
      
      To prepare for this, simply extend the bpf_prog_create_from_user()
      API to hold a flag that tells whether we should store the original
      or not. Also, fanout filters could make use of that in future for
      things like diag. While fanout filters already use bpf_prog_destroy(),
      move seccomp over to them as well to handle original programs when
      present.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Tycho Andersen <tycho.andersen@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NTycho Andersen <tycho.andersen@canonical.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bab18991
    • D
      bpf: fix panic in SO_GET_FILTER with native ebpf programs · 93d08b69
      Daniel Borkmann 提交于
      When sockets have a native eBPF program attached through
      setsockopt(sk, SOL_SOCKET, SO_ATTACH_BPF, ...), and then try to
      dump these over getsockopt(sk, SOL_SOCKET, SO_GET_FILTER, ...),
      the following panic appears:
      
        [49904.178642] BUG: unable to handle kernel NULL pointer dereference at (null)
        [49904.178762] IP: [<ffffffff81610fd9>] sk_get_filter+0x39/0x90
        [49904.182000] PGD 86fc9067 PUD 531a1067 PMD 0
        [49904.185196] Oops: 0000 [#1] SMP
        [...]
        [49904.224677] Call Trace:
        [49904.226090]  [<ffffffff815e3d49>] sock_getsockopt+0x319/0x740
        [49904.227535]  [<ffffffff812f59e3>] ? sock_has_perm+0x63/0x70
        [49904.228953]  [<ffffffff815e2fc8>] ? release_sock+0x108/0x150
        [49904.230380]  [<ffffffff812f5a43>] ? selinux_socket_getsockopt+0x23/0x30
        [49904.231788]  [<ffffffff815dff36>] SyS_getsockopt+0xa6/0xc0
        [49904.233267]  [<ffffffff8171b9ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      The underlying issue is the very same as in commit b382c086
      ("sock, diag: fix panic in sock_diag_put_filterinfo"), that is,
      native eBPF programs don't store an original program since this
      is only needed in cBPF ones.
      
      However, sk_get_filter() wasn't updated to test for this at the
      time when eBPF could be attached. Just throw an error to the user
      to indicate that eBPF cannot be dumped over this interface.
      That way, it can also be known that a program _is_ attached (as
      opposed to just return 0), and a different (future) method needs
      to be consulted for a dump.
      
      Fixes: 89aa0758 ("net: sock: allow eBPF programs to be attached to sockets")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d08b69
    • E
      tcp: restore fastopen operations · ac8cfc7b
      Eric Dumazet 提交于
      I accidentally cleared fastopenq.max_qlen in reqsk_queue_alloc()
      while max_qlen can be set before listen() is called,
      using TCP_FASTOPEN socket option for example.
      
      Fixes: 0536fcc0 ("tcp: prepare fastopen code for upcoming listener changes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac8cfc7b
    • E
      net: use sk_fullsock() in __netdev_pick_tx() · 004a5d01
      Eric Dumazet 提交于
      SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
      sk_dst_cache pointer.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      004a5d01
  17. 04 10月, 2015 1 次提交
    • E
      tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab
      Eric Dumazet 提交于
      Before letting request sockets being put in TCP/DCCP regular
      ehash table, we need to add either :
      
      - SLAB_DESTROY_BY_RCU flag to their kmem_cache
      - add RCU grace period before freeing them.
      
      Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
      like ESTABLISH and TIMEWAIT sockets, use it here.
      
      req_prot_init() being only used by TCP and DCCP, I did not add
      a new slab_flags into their rsk_prot, but reuse prot->slab_flags
      
      Since all reqsk_alloc() users are correctly dealing with a failure,
      add the __GFP_NOWARN flag to avoid traces under pressure.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e96f78ab
  18. 03 10月, 2015 10 次提交