1. 24 6月, 2017 3 次提交
  2. 21 6月, 2017 5 次提交
    • D
      net: introduce SO_PEERGROUPS getsockopt · 28b5ba2a
      David Herrmann 提交于
      This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
      retrieve the auxiliary groups of the remote peer. It is designed to
      naturally extend SO_PEERCRED. That is, the underlying data is from the
      same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
      is, if the provided buffer is too small, ERANGE is returned and @optlen
      is updated. Otherwise, the information is copied, @optlen is set to the
      actual size, and 0 is returned.
      
      While SO_PEERCRED (and thus `struct ucred') already returns the primary
      group, it lacks the auxiliary group vector. However, nearly all access
      controls (including kernel side VFS and SYSVIPC, but also user-space
      polkit, DBus, ...) consider the entire set of groups, rather than just
      the primary group. But this is currently not possible with pure
      SO_PEERCRED. Instead, user-space has to work around this and query the
      system database for the auxiliary groups of a UID retrieved via
      SO_PEERCRED.
      
      Unfortunately, there is no race-free way to query the auxiliary groups
      of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
      solution is to use getgrouplist(3p), which itself falls back to NSS and
      whatever is configured in nsswitch.conf(3). This effectively checks
      which groups we *would* assign to the user if it logged in *now*. On
      normal systems it is as easy as reading /etc/group, but with NSS it can
      resort to quering network databases (eg., LDAP), using IPC or network
      communication.
      
      Long story short: Whenever we want to use auxiliary groups for access
      checks on IPC, we need further IPC to talk to the user/group databases,
      rather than just relying on SO_PEERCRED and the incoming socket. This
      is unfortunate, and might even result in dead-locks if the database
      query uses the same IPC as the original request.
      
      So far, those recursions / dead-locks have been avoided by using
      primitive IPC for all crucial NSS modules. However, we want to avoid
      re-inventing the wheel for each NSS module that might be involved in
      user/group queries. Hence, we would preferably make DBus (and other IPC
      that supports access-management based on groups) work without resorting
      to the user/group database. This new SO_PEERGROUPS ioctl would allow us
      to make dbus-daemon work without ever calling into NSS.
      
      Cc: Michal Sekletar <msekleta@redhat.com>
      Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
      Reviewed-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b5ba2a
    • J
      rtnetlink: add restricted rtnl groups for ipv4 and ipv6 mroute · 5f729eaa
      Julien Gomes 提交于
      Add RTNLGRP_{IPV4,IPV6}_MROUTE_R as two new restricted groups for the
      NETLINK_ROUTE family.
      Binding to these groups specifically requires CAP_NET_ADMIN to allow
      multicast of sensitive messages (e.g. mroute cache reports).
      Suggested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NJulien Gomes <julien@arista.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f729eaa
    • S
      rtnetlink: add IFLA_GROUP to ifla_policy · db833d40
      Serhey Popovych 提交于
      Network interface groups support added while ago, however
      there is no IFLA_GROUP attribute description in policy
      and netlink message size calculations until now.
      
      Add IFLA_GROUP attribute to the policy.
      
      Fixes: cbda10fa ("net_device: add support for network device groups")
      Signed-off-by: NSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db833d40
    • S
      net/core: remove explicit do_softirq() from busy_poll_stop() · fe420d87
      Sebastian Siewior 提交于
      Since commit 217f6974 ("net: busy-poll: allow preemption in
      sk_busy_loop()") there is an explicit do_softirq() invocation after
      local_bh_enable() has been invoked.
      I don't understand why we need this because local_bh_enable() will
      invoke do_softirq() once the softirq counter reached zero and we have
      softirq-related work pending.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe420d87
    • S
      fib_rules: Resolve goto rules target on delete · bdaf32c3
      Serhey Popovych 提交于
      We should avoid marking goto rules unresolved when their
      target is actually reachable after rule deletion.
      
      Consolder following sample scenario:
      
        # ip -4 ru sh
        0:      from all lookup local
        32000:  from all goto 32100
        32100:  from all lookup main
        32100:  from all lookup default
        32766:  from all lookup main
        32767:  from all lookup default
      
        # ip -4 ru del pref 32100 table main
        # ip -4 ru sh
        0:      from all lookup local
        32000:  from all goto 32100 [unresolved]
        32100:  from all lookup default
        32766:  from all lookup main
        32767:  from all lookup default
      
      After removal of first rule with preference 32100 we
      mark all goto rules as unreachable, even when rule with
      same preference as removed one still present.
      
      Check if next rule with same preference is available
      and make all rules with goto action pointing to it.
      Signed-off-by: NSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdaf32c3
  3. 18 6月, 2017 6 次提交
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang 提交于
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • W
      net: remove dst gc related code · 5b7c9a8f
      Wei Wang 提交于
      This patch removes all dst gc related code and all the dst free
      functions
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b7c9a8f
    • W
      xfrm: take refcnt of dst when creating struct xfrm_dst bundle · 52df157f
      Wei Wang 提交于
      During the creation of xfrm_dst bundle, always take ref count when
      allocating the dst. This way, xfrm_bundle_create() will form a linked
      list of dst with dst->child pointing to a ref counted dst child. And
      the returned dst pointer is also ref counted. This makes the link from
      the flow cache to this dst now ref counted properly.
      As the dst is always ref counted properly, we can safely mark
      DST_NOGC flag so dst_release() will release dst based on refcnt only.
      And dst gc is no longer needed and all dst_free() and its related
      function calls should be replaced with dst_release() or
      dst_release_immediate().
      
      The special handling logic for dst->child in dst_destroy() can be
      replaced with a simple dst_release_immediate() call on the child to
      release the whole list linked by dst->child pointer.
      Previously used DST_NOHASH flag is not needed anymore as well. The
      reason that DST_NOHASH is used in the existing code is mainly to prevent
      the dst inserted in the fib tree to be wrongly destroyed during the
      deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
      flag is marked in all the dst children except the one which is in the
      fib tree.
      However, with this patch series to remove dst gc logic and release dst
      only based on ref count, it is safe to release all the children from a
      xfrm_dst bundle as long as the dst children are all ref counted
      properly which is already the case in the existing code.
      So, this patch removes the use of DST_NOHASH flag.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52df157f
    • W
      net: introduce a new function dst_dev_put() · 4a6ce2b6
      Wei Wang 提交于
      This function should be called when removing routes from fib tree after
      the dst gc is no longer in use.
      We first mark DST_OBSOLETE_DEAD on this dst to make sure next
      dst_ops->check() fails and returns NULL.
      Secondly, as we no longer keep the gc_list, we need to properly
      release dst->dev right at the moment when the dst is removed from
      the fib/fib6 tree.
      It does the following:
      1. change dst->input and output pointers to dst_discard/dst_dscard_out to
         discard all packets
      2. replace dst->dev with loopback interface
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a6ce2b6
    • W
      net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt · 5f56f409
      Wei Wang 提交于
      The current mechanism of freeing dst is a bit complicated. dst has its
      ref count and when user grabs the reference to the dst, the ref count is
      properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
      code due to some historic reasons.
      
      If the reference to dst is always taken properly, we should be able to
      simplify the logic in dst_release() to destroy dst when dst->__refcnt
      drops from 1 to 0. And this should be the only condition to determine
      if we can call dst_destroy().
      And as dst is always ref counted, there is no need for a dst garbage
      list to hold the dst entries that already get removed by the routing
      code but are still held by other users. And the task to periodically
      check the list to free dst if ref count become 0 is also not needed
      anymore.
      
      This patch introduces a temporary flag DST_NOGC(no garbage collector).
      If it is set in the dst, dst_release() will call dst_destroy() when
      dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
      and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
      issue.
      This temporary flag is mainly used so that we can make the transition
      component by component without breaking other parts.
      This flag will be removed after all components are properly transitioned.
      
      This patch also introduces a new function dst_release_immediate() which
      destroys dst without waiting on the rcu when refcnt drops to 0. It will
      be used in later patches.
      
      Follow-up patches will correct all the places to properly take ref count
      on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
      be used to release the dst instead of dst_free() and its related
      functions.
      And final clean-up patch will remove the DST_NOGC flag.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f56f409
  4. 16 6月, 2017 5 次提交
    • M
      net: Add IFLA_XDP_PROG_ID · 58038695
      Martin KaFai Lau 提交于
      Expose prog_id through IFLA_XDP_PROG_ID.  This patch
      makes modification to generic_xdp.  The later patches will
      modify other xdp-supported drivers.
      
      prog_id is added to struct net_dev_xdp.
      
      iproute2 patch will be followed. Here is how the 'ip link'
      will look like:
      > ip link show eth0
      3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp(prog_id:1) qdisc fq_codel state UP mode DEFAULT group default qlen 1000
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58038695
    • J
      networking: make skb_push & __skb_push return void pointers · d58ff351
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
          @@
          expression SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - fn(SKB, LEN)[0]
          + *(u8 *)fn(SKB, LEN)
      
      Note that the last part there converts from push(...)[0] to the
      more idiomatic *(u8 *)push(...).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d58ff351
    • J
      networking: make skb_pull & friends return void pointers · af72868b
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = {
                  skb_pull,
                  __skb_pull,
                  skb_pull_inline,
                  __pskb_pull_tail,
                  __pskb_pull,
                  pskb_pull
          };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = {
                  skb_pull,
                  __skb_pull,
                  skb_pull_inline,
                  __pskb_pull_tail,
                  __pskb_pull,
                  pskb_pull
          };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af72868b
    • J
      networking: make skb_put & friends return void pointers · 4df864c1
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions (skb_put, __skb_put and pskb_put) return void *
      and remove all the casts across the tree, adding a (u8 *) cast only
      where the unsigned char pointer was used directly, all done with the
      following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_put, __skb_put };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_put, __skb_put };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
      which actually doesn't cover pskb_put since there are only three
      users overall.
      
      A handful of stragglers were converted manually, notably a macro in
      drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
      instances in net/bluetooth/hci_sock.c. In the former file, I also
      had to fix one whitespace problem spatch introduced.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4df864c1
    • J
      networking: convert many more places to skb_put_zero() · b080db58
      Johannes Berg 提交于
      There were many places that my previous spatch didn't find,
      as pointed out by yuan linyu in various patches.
      
      The following spatch found many more and also removes the
      now unnecessary casts:
      
          @@
          identifier p, p2;
          expression len;
          expression skb;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, len);
          |
          -memset(p, 0, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, sizeof(*p));
          |
          -memset(p, 0, sizeof(*p));
          )
      
          @@
          expression skb, len;
          @@
          -memset(skb_put(skb, len), 0, len);
          +skb_put_zero(skb, len);
      
      Apply it to the tree (with one manual fixup to keep the
      comment in vxlan.c, which spatch removed.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b080db58
  5. 15 6月, 2017 2 次提交
    • P
      net: use skb_unref() in napi_consume_skb() · 7608894e
      Paolo Abeni 提交于
      The commit 83ada39bb79d ("net: factor out a helper to decrement the
      skb refcount") provided and used a helper for decrementing skb usage,
      but I missed at least a spot for it.
      
      This change remove some more duplicated code reusing skb_unref() in
      napi_consume_skb(), too. The helper uses an additional, unneeded
      unlikely(!skb) test - napi_consume_skb() already check it a few lines
      above - but the compiler is smart enough to optimize the duplicated
      test out.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7608894e
    • Y
      bpf: permits narrower load from bpf program context fields · 31fd8581
      Yonghong Song 提交于
      Currently, verifier will reject a program if it contains an
      narrower load from the bpf context structure. For example,
              __u8 h = __sk_buff->hash, or
              __u16 p = __sk_buff->protocol
              __u32 sample_period = bpf_perf_event_data->sample_period
      which are narrower loads of 4-byte or 8-byte field.
      
      This patch solves the issue by:
        . Introduce a new parameter ctx_field_size to carry the
          field size of narrower load from prog type
          specific *__is_valid_access validator back to verifier.
        . The non-zero ctx_field_size for a memory access indicates
          (1). underlying prog type specific convert_ctx_accesses
               supporting non-whole-field access
          (2). the current insn is a narrower or whole field access.
        . In verifier, for such loads where load memory size is
          less than ctx_field_size, verifier transforms it
          to a full field load followed by proper masking.
        . Currently, __sk_buff and bpf_perf_event_data->sample_period
          are supporting narrowing loads.
        . Narrower stores are still not allowed as typical ctx stores
          are just normal stores.
      
      Because of this change, some tests in verifier will fail and
      these tests are removed. As a bonus, rename some out of bound
      __sk_buff->cb access to proper field name and remove two
      redundant "skb cb oob" tests.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31fd8581
  6. 14 6月, 2017 1 次提交
  7. 13 6月, 2017 1 次提交
  8. 12 6月, 2017 2 次提交
  9. 11 6月, 2017 4 次提交
  10. 10 6月, 2017 2 次提交
    • A
      net: rps: send out pending IPI's on CPU hotplug · 773fc8f6
      ashwanth@codeaurora.org 提交于
      IPI's from the victim cpu are not handled in dev_cpu_callback.
      So these pending IPI's would be sent to the remote cpu only when
      NET_RX is scheduled on the victim cpu and since this trigger is
      unpredictable it would result in packet latencies on the remote cpu.
      
      This patch add support to send the pending ipi's of victim cpu.
      Signed-off-by: NAshwanth Goli <ashwanth@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      773fc8f6
    • K
      Fix an intermittent pr_emerg warning about lo becoming free. · f186ce61
      Krister Johansen 提交于
      It looks like this:
      
      Message from syslogd@flamingo at Apr 26 00:45:00 ...
       kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
      
      They seem to coincide with net namespace teardown.
      
      The message is emitted by netdev_wait_allrefs().
      
      Forced a kdump in netdev_run_todo, but found that the refcount on the lo
      device was already 0 at the time we got to the panic.
      
      Used bcc to check the blocking in netdev_run_todo.  The only places
      where we're off cpu there are in the rcu_barrier() and msleep() calls.
      That behavior is expected.  The msleep time coincides with the amount of
      time we spend waiting for the refcount to reach zero; the rcu_barrier()
      wait times are not excessive.
      
      After looking through the list of callbacks that the netdevice notifiers
      invoke in this path, it appears that the dst_dev_event is the most
      interesting.  The dst_ifdown path places a hold on the loopback_dev as
      part of releasing the dev associated with the original dst cache entry.
      Most of our notifier callbacks are straight-forward, but this one a)
      looks complex, and b) places a hold on the network interface in
      question.
      
      I constructed a new bcc script that watches various events in the
      liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
      the loopback device until the invalidated dst entry gets freed.
      
      [      __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
          __dst_free
          rcu_nocb_kthread
          kthread
          ret_from_fork
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f186ce61
  11. 09 6月, 2017 1 次提交
  12. 08 6月, 2017 5 次提交
    • E
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet 提交于
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      instructed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06044751
    • M
      net: Zero ifla_vf_info in rtnl_fill_vfinfo() · 0eed9cf5
      Mintz, Yuval 提交于
      Some of the structure's fields are not initialized by the
      rtnetlink. If driver doesn't set those in ndo_get_vf_config(),
      they'd leak memory to user.
      Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
      CC: Michal Schmidt <mschmidt@redhat.com>
      Reviewed-by: NGreg Rose <gvrose8192@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eed9cf5
    • E
      tcp: Namespaceify sysctl_tcp_timestamps · 5d2ed052
      Eric Dumazet 提交于
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d2ed052
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
    • A
      net: don't call strlen on non-terminated string in dev_set_alias() · c28294b9
      Alexander Potapenko 提交于
      KMSAN reported a use of uninitialized memory in dev_set_alias(),
      which was caused by calling strlcpy() (which in turn called strlen())
      on the user-supplied non-terminated string.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c28294b9
  13. 07 6月, 2017 1 次提交
  14. 05 6月, 2017 2 次提交
    • H
      devlink: fix potential memort leak · 6044bd4a
      Haishuang Yan 提交于
      We must free allocated skb when genlmsg_put() return fails.
      
      Fixes: 1555d204 ("devlink: Support for pipeline debug (dpipe)")
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6044bd4a
    • J
      skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow · 48a1df65
      Jason A. Donenfeld 提交于
      This is a defense-in-depth measure in response to bugs like
      4d6fa57b ("macsec: avoid heap overflow in skb_to_sgvec"). There's
      not only a potential overflow of sglist items, but also a stack overflow
      potential, so we fix this by limiting the amount of recursion this function
      is allowed to do. Not actually providing a bounded base case is a future
      disaster that we can easily avoid here.
      
      As a small matter of house keeping, we take this opportunity to move the
      documentation comment over the actual function the documentation is for.
      
      While this could be implemented by using an explicit stack of skbuffs,
      when implementing this, the function complexity increased considerably,
      and I don't think such complexity and bloat is actually worth it. So,
      instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
      and measured the stack usage there. I also reverted the recent MIPS
      changes that give it a separate IRQ stack, so that I could experience
      some worst-case situations. I found that limiting it to 24 layers deep
      yielded a good stack usage with room for safety, as well as being much
      deeper than any driver actually ever creates.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48a1df65