1. 22 6月, 2015 1 次提交
    • D
      netlink: add API to retrieve all group memberships · b42be38b
      David Herrmann 提交于
      This patch adds getsockopt(SOL_NETLINK, NETLINK_LIST_MEMBERSHIPS) to
      retrieve all groups a socket is a member of. Currently, we have to use
      getsockname() and look at the nl.nl_groups bitmask. However, this mask is
      limited to 32 groups. Hence, similar to NETLINK_ADD_MEMBERSHIP and
      NETLINK_DROP_MEMBERSHIP, this adds a separate sockopt to manager higher
      groups IDs than 32.
      
      This new NETLINK_LIST_MEMBERSHIPS option takes a pointer to __u32 and the
      size of the array. The array is filled with the full membership-set of the
      socket, and the required array size is returned in optlen. Hence,
      user-space can retry with a properly sized array in case it was too small.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b42be38b
  2. 18 5月, 2015 1 次提交
    • H
      netlink: Use random autobind rover · b9fbe709
      Herbert Xu 提交于
      Currently we use a global rover to select a port ID that is unique.
      This used to work consistently when it was protected with a global
      lock.  However as we're now lockless, the global rover can exhibit
      pathological behaviour should multiple threads all stomp on it at
      the same time.
      
      Granted this will eventually resolve itself but the process is
      suboptimal.
      
      This patch replaces the global rover with a pseudorandom starting
      point to avoid this issue.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9fbe709
  3. 17 5月, 2015 1 次提交
  4. 15 5月, 2015 1 次提交
    • E
      netlink: move nl_table in read_mostly section · 91dd93f9
      Eric Dumazet 提交于
      netlink sockets creation and deletion heavily modify nl_table_users
      and nl_table_lock.
      
      If nl_table is sharing one cache line with one of them, netlink
      performance is really bad on SMP.
      
      ffffffff81ff5f00 B nl_table
      ffffffff81ff5f0c b nl_table_users
      
      Putting nl_table in read_mostly section increased performance
      of my open/delete netlink sockets test by about 80 %
      
      This came up while diagnosing a getaddrinfo() problem.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91dd93f9
  5. 11 5月, 2015 2 次提交
  6. 10 5月, 2015 2 次提交
    • N
      netlink: allow to listen "all" netns · 59324cf3
      Nicolas Dichtel 提交于
      More accurately, listen all netns that have a nsid assigned into the netns
      where the netlink socket is opened.
      For this purpose, a netlink socket option is added:
      NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
      socket will receive netlink notifications from all netns that have a nsid
      assigned into the netns where the socket has been opened. The nsid is sent
      to userland via an anscillary data.
      
      With this patch, a daemon needs only one socket to listen many netns. This
      is useful when the number of netns is high.
      
      Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
      the field nsid is valid or not. skb->cb is initialized to 0 on skb
      allocation, thus we are sure that we will never send a nsid 0 by error to
      the userland.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59324cf3
    • N
      netlink: rename private flags and states · cc3a572f
      Nicolas Dichtel 提交于
      These flags and states have the same prefix (NETLINK_) that netlink socket
      options. To avoid confusion and to be able to name a flag like a socket
      option, let's use an other prefix: NETLINK_[S|F]_.
      
      Note: a comment has been fixed, it was talking about
      NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc3a572f
  7. 04 5月, 2015 1 次提交
  8. 26 4月, 2015 1 次提交
    • E
      net: fix crash in build_skb() · 2ea2f62c
      Eric Dumazet 提交于
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2f62c
  9. 26 3月, 2015 1 次提交
  10. 25 3月, 2015 1 次提交
  11. 24 3月, 2015 1 次提交
  12. 21 3月, 2015 2 次提交
  13. 19 3月, 2015 1 次提交
  14. 03 3月, 2015 1 次提交
  15. 28 2月, 2015 1 次提交
  16. 05 2月, 2015 1 次提交
  17. 04 2月, 2015 1 次提交
    • A
      netlink: make the check for "send from tx_ring" deterministic · a8866ff6
      Al Viro 提交于
      As it is, zero msg_iovlen means that the first iovec in the kernel
      array of iovecs is left uninitialized, so checking if its ->iov_base
      is NULL is random.  Since the real users of that thing are doing
      sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and
      msg_iov[0] = {NULL, 0}, which is what this test is trying to catch.
      As suggested by davem, let's just check that msg_iovlen was 1 and
      msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches
      what we want to catch.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a8866ff6
  18. 31 1月, 2015 1 次提交
  19. 29 1月, 2015 1 次提交
    • C
      net: remove sock_iocb · 7cc05662
      Christoph Hellwig 提交于
      The sock_iocb structure is allocate on stack for each read/write-like
      operation on sockets, and contains various fields of which only the
      embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
      Get rid of the sock_iocb and put a msghdr directly on the stack and pass
      the scm_cookie explicitly to netlink_mmap_sendmsg.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7cc05662
  20. 27 1月, 2015 1 次提交
  21. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  22. 17 1月, 2015 2 次提交
    • J
      genetlink: synchronize socket closing and family removal · ee1c2442
      Johannes Berg 提交于
      In addition to the problem Jeff Layton reported, I looked at the code
      and reproduced the same warning by subscribing and removing the genl
      family with a socket still open. This is a fairly tricky race which
      originates in the fact that generic netlink allows the family to go
      away while sockets are still open - unlike regular netlink which has
      a module refcount for every open socket so in general this cannot be
      triggered.
      
      Trying to resolve this issue by the obvious locking isn't possible as
      it will result in deadlocks between unregistration and group unbind
      notification (which incidentally lockdep doesn't find due to the home
      grown locking in the netlink table.)
      
      To really resolve this, introduce a "closing socket" reference counter
      (for generic netlink only, as it's the only affected family) in the
      core netlink code and use that in generic netlink to wait for all the
      sockets that are being closed at the same time as a generic netlink
      family is removed.
      
      This fixes the race that when a socket is closed, it will should call
      the unbind, but if the family is removed at the same time the unbind
      will not find it, leading to the warning. The real problem though is
      that in this case the unbind could actually find a new family that is
      registered to have a multicast group with the same ID, and call its
      mcast_unbind() leading to confusing.
      
      Also remove the warning since it would still trigger, but is now no
      longer a problem.
      
      This also moves the code in af_netlink.c to before unreferencing the
      module to avoid having the same problem in the normal non-genl case.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee1c2442
    • J
      genetlink: disallow subscribing to unknown mcast groups · 5ad63005
      Johannes Berg 提交于
      Jeff Layton reported that he could trigger the multicast unbind warning
      in generic netlink using trinity. I originally thought it was a race
      condition between unregistering the generic netlink family and closing
      the socket, but there's a far simpler explanation: genetlink currently
      allows subscribing to groups that don't (yet) exist, and the warning is
      triggered when unsubscribing again while the group still doesn't exist.
      
      Originally, I had a warning in the subscribe case and accepted it out of
      userspace API concerns, but the warning was of course wrong and removed
      later.
      
      However, I now think that allowing userspace to subscribe to groups that
      don't exist is wrong and could possibly become a security problem:
      Consider a (new) genetlink family implementing a permission check in
      the mcast_bind() function similar to the like the audit code does today;
      it would be possible to bypass the permission check by guessing the ID
      and subscribing to the group it exists. This is only possible in case a
      family like that would be dynamically loaded, but it doesn't seem like a
      huge stretch, for example wireless may be loaded when you plug in a USB
      device.
      
      To avoid this reject such subscription attempts.
      
      If this ends up causing userspace issues we may need to add a workaround
      in af_netlink to deny such requests but not return an error.
      Reported-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ad63005
  23. 16 1月, 2015 1 次提交
  24. 14 1月, 2015 1 次提交
  25. 04 1月, 2015 4 次提交
    • T
      netlink: Lockless lookup with RCU grace period in socket release · 21e4902a
      Thomas Graf 提交于
      Defers the release of the socket reference using call_rcu() to
      allow using an RCU read-side protected call to rhashtable_lookup()
      
      This restores behaviour and performance gains as previously
      introduced by e341694e ("netlink: Convert netlink_lookup() to use
      RCU protected hash table") without the side effect of severely
      delayed socket destruction.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21e4902a
    • T
      rhashtable: Per bucket locks & deferred expansion/shrinking · 97defe1e
      Thomas Graf 提交于
      Introduces an array of spinlocks to protect bucket mutations. The number
      of spinlocks per CPU is configurable and selected based on the hash of
      the bucket. This allows for parallel insertions and removals of entries
      which do not share a lock.
      
      The patch also defers expansion and shrinking to a worker queue which
      allows insertion and removal from atomic context. Insertions and
      deletions may occur in parallel to it and are only held up briefly
      while the particular bucket is linked or unzipped.
      
      Mutations of the bucket table pointer is protected by a new mutex, read
      access is RCU protected.
      
      In the event of an expansion or shrinking, the new bucket table allocated
      is exposed as a so called future table as soon as the resize process
      starts.  Lookups, deletions, and insertions will briefly use both tables.
      The future table becomes the main table after an RCU grace period and
      initial linking of the old to the new table was performed. Optimization
      of the chains to make use of the new number of buckets follows only the
      new table is in use.
      
      The side effect of this is that during that RCU grace period, a bucket
      traversal using any rht_for_each() variant on the main table will not see
      any insertions performed during the RCU grace period which would at that
      point land in the future table. The lookup will see them as it searches
      both tables if needed.
      
      Having multiple insertions and removals occur in parallel requires nelems
      to become an atomic counter.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97defe1e
    • T
      rhashtable: Convert bucket iterators to take table and index · 88d6ed15
      Thomas Graf 提交于
      This patch is in preparation to introduce per bucket spinlocks. It
      extends all iterator macros to take the bucket table and bucket
      index. It also introduces a new rht_dereference_bucket() to
      handle protected accesses to buckets.
      
      It introduces a barrier() to the RCU iterators to the prevent
      the compiler from caching the first element.
      
      The lockdep verifier is introduced as stub which always succeeds
      and properly implement in the next patch when the locks are
      introduced.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88d6ed15
    • T
      rhashtable: Do hashing inside of rhashtable_lookup_compare() · 8d24c0b4
      Thomas Graf 提交于
      Hash the key inside of rhashtable_lookup_compare() like
      rhashtable_lookup() does. This allows to simplify the hashing
      functions and keep them private.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d24c0b4
  26. 30 12月, 2014 1 次提交
  27. 27 12月, 2014 5 次提交
  28. 19 12月, 2014 2 次提交
    • T
      netlink: Don't reorder loads/stores before marking mmap netlink frame as available · a18e6a18
      Thomas Graf 提交于
      Each mmap Netlink frame contains a status field which indicates
      whether the frame is unused, reserved, contains data or needs to
      be skipped. Both loads and stores may not be reordeded and must
      complete before the status field is changed and another CPU might
      pick up the frame for use. Use an smp_mb() to cover needs of both
      types of callers to netlink_set_status(), callers which have been
      reading data frame from the frame, and callers which have been
      filling or releasing and thus writing to the frame.
      
      - Example code path requiring a smp_rmb():
        memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
        netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
      
      - Example code path requiring a smp_wmb():
        hdr->nm_uid	= from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
        hdr->nm_gid	= from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
        netlink_frame_flush_dcache(hdr);
        netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
      
      Fixes: f9c228 ("netlink: implement memory mapped recvmsg()")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a18e6a18
    • D
      netlink: Always copy on mmap TX. · 4682a035
      David Miller 提交于
      Checking the file f_count and the nlk->mapped count is not completely
      sufficient to prevent the mmap'd area contents from changing from
      under us during netlink mmap sendmsg() operations.
      
      Be careful to sample the header's length field only once, because this
      could change from under us as well.
      
      Fixes: 5fd96123 ("netlink: implement memory mapped sendmsg()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      4682a035
新手
引导
客服 返回
顶部