1. 27 1月, 2015 1 次提交
  2. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  3. 16 1月, 2015 1 次提交
  4. 14 1月, 2015 1 次提交
  5. 04 1月, 2015 4 次提交
    • T
      netlink: Lockless lookup with RCU grace period in socket release · 21e4902a
      Thomas Graf 提交于
      Defers the release of the socket reference using call_rcu() to
      allow using an RCU read-side protected call to rhashtable_lookup()
      
      This restores behaviour and performance gains as previously
      introduced by e341694e ("netlink: Convert netlink_lookup() to use
      RCU protected hash table") without the side effect of severely
      delayed socket destruction.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21e4902a
    • T
      rhashtable: Per bucket locks & deferred expansion/shrinking · 97defe1e
      Thomas Graf 提交于
      Introduces an array of spinlocks to protect bucket mutations. The number
      of spinlocks per CPU is configurable and selected based on the hash of
      the bucket. This allows for parallel insertions and removals of entries
      which do not share a lock.
      
      The patch also defers expansion and shrinking to a worker queue which
      allows insertion and removal from atomic context. Insertions and
      deletions may occur in parallel to it and are only held up briefly
      while the particular bucket is linked or unzipped.
      
      Mutations of the bucket table pointer is protected by a new mutex, read
      access is RCU protected.
      
      In the event of an expansion or shrinking, the new bucket table allocated
      is exposed as a so called future table as soon as the resize process
      starts.  Lookups, deletions, and insertions will briefly use both tables.
      The future table becomes the main table after an RCU grace period and
      initial linking of the old to the new table was performed. Optimization
      of the chains to make use of the new number of buckets follows only the
      new table is in use.
      
      The side effect of this is that during that RCU grace period, a bucket
      traversal using any rht_for_each() variant on the main table will not see
      any insertions performed during the RCU grace period which would at that
      point land in the future table. The lookup will see them as it searches
      both tables if needed.
      
      Having multiple insertions and removals occur in parallel requires nelems
      to become an atomic counter.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97defe1e
    • T
      rhashtable: Convert bucket iterators to take table and index · 88d6ed15
      Thomas Graf 提交于
      This patch is in preparation to introduce per bucket spinlocks. It
      extends all iterator macros to take the bucket table and bucket
      index. It also introduces a new rht_dereference_bucket() to
      handle protected accesses to buckets.
      
      It introduces a barrier() to the RCU iterators to the prevent
      the compiler from caching the first element.
      
      The lockdep verifier is introduced as stub which always succeeds
      and properly implement in the next patch when the locks are
      introduced.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88d6ed15
    • T
      rhashtable: Do hashing inside of rhashtable_lookup_compare() · 8d24c0b4
      Thomas Graf 提交于
      Hash the key inside of rhashtable_lookup_compare() like
      rhashtable_lookup() does. This allows to simplify the hashing
      functions and keep them private.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d24c0b4
  6. 30 12月, 2014 1 次提交
  7. 27 12月, 2014 5 次提交
  8. 19 12月, 2014 2 次提交
    • T
      netlink: Don't reorder loads/stores before marking mmap netlink frame as available · a18e6a18
      Thomas Graf 提交于
      Each mmap Netlink frame contains a status field which indicates
      whether the frame is unused, reserved, contains data or needs to
      be skipped. Both loads and stores may not be reordeded and must
      complete before the status field is changed and another CPU might
      pick up the frame for use. Use an smp_mb() to cover needs of both
      types of callers to netlink_set_status(), callers which have been
      reading data frame from the frame, and callers which have been
      filling or releasing and thus writing to the frame.
      
      - Example code path requiring a smp_rmb():
        memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
        netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
      
      - Example code path requiring a smp_wmb():
        hdr->nm_uid	= from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
        hdr->nm_gid	= from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
        netlink_frame_flush_dcache(hdr);
        netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
      
      Fixes: f9c228 ("netlink: implement memory mapped recvmsg()")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a18e6a18
    • D
      netlink: Always copy on mmap TX. · 4682a035
      David Miller 提交于
      Checking the file f_count and the nlk->mapped count is not completely
      sufficient to prevent the mmap'd area contents from changing from
      under us during netlink mmap sendmsg() operations.
      
      Be careful to sample the header's length field only once, because this
      could change from under us as well.
      
      Fixes: 5fd96123 ("netlink: implement memory mapped sendmsg()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      4682a035
  9. 11 12月, 2014 1 次提交
  10. 10 12月, 2014 1 次提交
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
  11. 24 11月, 2014 1 次提交
  12. 20 11月, 2014 1 次提交
  13. 14 11月, 2014 3 次提交
  14. 13 11月, 2014 1 次提交
  15. 06 11月, 2014 1 次提交
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
  16. 22 10月, 2014 1 次提交
  17. 09 10月, 2014 1 次提交
    • A
      fix misuses of f_count() in ppp and netlink · 24dff96a
      Al Viro 提交于
      we used to check for "nobody else could start doing anything with
      that opened file" by checking that refcount was 2 or less - one
      for descriptor table and one we'd acquired in fget() on the way to
      wherever we are.  That was race-prone (somebody else might have
      had a reference to descriptor table and do fget() just as we'd
      been checking) and it had become flat-out incorrect back when
      we switched to fget_light() on those codepaths - unlike fget(),
      it doesn't grab an extra reference unless the descriptor table
      is shared.  The same change allowed a race-free check, though -
      we are safe exactly when refcount is less than 2.
      
      It was a long time ago; pre-2.6.12 for ioctl() (the codepath leading
      to ppp one) and 2.6.17 for sendmsg() (netlink one).  OTOH,
      netlink hadn't grown that check until 3.9 and ppp used to live
      in drivers/net, not drivers/net/ppp until 3.1.  The bug existed
      well before that, though, and the same fix used to apply in old
      location of file.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24dff96a
  18. 15 8月, 2014 1 次提交
  19. 08 8月, 2014 1 次提交
  20. 07 8月, 2014 1 次提交
  21. 05 8月, 2014 1 次提交
  22. 03 8月, 2014 1 次提交
    • T
      netlink: Convert netlink_lookup() to use RCU protected hash table · e341694e
      Thomas Graf 提交于
      Heavy Netlink users such as Open vSwitch spend a considerable amount of
      time in netlink_lookup() due to the read-lock on nl_table_lock. Use of
      RCU relieves the lock contention.
      
      Makes use of the new resizable hash table to avoid locking on the
      lookup.
      
      The hash table will grow if entries exceeds 75% of table size up to a
      total table size of 64K. It will automatically shrink if usage falls
      below 30%.
      
      Also splits nl_table_lock into a separate mutex to protect hash table
      mutations and allow synchronize_rcu() to sleep while waiting for readers
      during expansion and shrinking.
      
      Before:
         9.16%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
         6.42%  kpktgend_0  [pktgen]           [k] mod_cur_headers
         6.26%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
         6.23%  kpktgend_0  [kernel.kallsyms]  [k] memset
         4.79%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
         4.37%  kpktgend_0  [kernel.kallsyms]  [k] memcpy
         3.60%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
         2.69%  kpktgend_0  [kernel.kallsyms]  [k] jhash2
      
      After:
        15.26%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
         8.12%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
         7.92%  kpktgend_0  [pktgen]           [k] mod_cur_headers
         5.11%  kpktgend_0  [kernel.kallsyms]  [k] memset
         4.11%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
         4.06%  kpktgend_0  [kernel.kallsyms]  [k] _raw_spin_lock
         3.90%  kpktgend_0  [kernel.kallsyms]  [k] jhash2
         [...]
         0.67%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e341694e
  23. 01 8月, 2014 1 次提交
  24. 17 7月, 2014 1 次提交
  25. 10 7月, 2014 1 次提交
    • B
      netlink: Fix handling of error from netlink_dump(). · ac30ef83
      Ben Pfaff 提交于
      netlink_dump() returns a negative errno value on error.  Until now,
      netlink_recvmsg() directly recorded that negative value in sk->sk_err, but
      that's wrong since sk_err takes positive errno values.  (This manifests as
      userspace receiving a positive return value from the recv() system call,
      falsely indicating success.) This bug was introduced in the commit that
      started checking the netlink_dump() return value, commit b44d211e (netlink:
      handle errors from netlink_dump()).
      
      Multithreaded Netlink dumps are one way to trigger this behavior in
      practice, as described in the commit message for the userspace workaround
      posted here:
          http://openvswitch.org/pipermail/dev/2014-June/042339.html
      
      This commit also fixes the same bug in netlink_poll(), introduced in commit
      cd1df525 (netlink: add flow control for memory mapped I/O).
      Signed-off-by: NBen Pfaff <blp@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac30ef83
  26. 08 7月, 2014 1 次提交
  27. 03 6月, 2014 2 次提交
    • E
      netlink: Only check file credentials for implicit destinations · 2d7a85f4
      Eric W. Biederman 提交于
      It was possible to get a setuid root or setcap executable to write to
      it's stdout or stderr (which has been set made a netlink socket) and
      inadvertently reconfigure the networking stack.
      
      To prevent this we check that both the creator of the socket and
      the currentl applications has permission to reconfigure the network
      stack.
      
      Unfortunately this breaks Zebra which always uses sendto/sendmsg
      and creates it's socket without any privileges.
      
      To keep Zebra working don't bother checking if the creator of the
      socket has privilege when a destination address is specified.  Instead
      rely exclusively on the privileges of the sender of the socket.
      
      Note from Andy: This is exactly Eric's code except for some comment
      clarifications and formatting fixes.  Neither I nor, I think, anyone
      else is thrilled with this approach, but I'm hesitant to wait on a
      better fix since 3.15 is almost here.
      
      Note to stable maintainers: This is a mess.  An earlier series of
      patches in 3.15 fix a rather serious security issue (CVE-2014-0181),
      but they did so in a way that breaks Zebra.  The offending series
      includes:
      
          commit aa4cf945
          Author: Eric W. Biederman <ebiederm@xmission.com>
          Date:   Wed Apr 23 14:28:03 2014 -0700
      
              net: Add variants of capable for use on netlink messages
      
      If a given kernel version is missing that series of fixes, it's
      probably worth backporting it and this patch.  if that series is
      present, then this fix is critical if you care about Zebra.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d7a85f4
    • D
      genetlink: remove superfluous assignment · 2f91abd4
      Denis ChengRq 提交于
      the local variable ops and n_ops were just read out from family,
      and not changed, hence no need to assign back.
      
      Validation functions should operate on const parameters and not
      change anything.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f91abd4
  28. 25 4月, 2014 2 次提交