1. 22 3月, 2017 1 次提交
  2. 28 1月, 2017 1 次提交
    • E
      net: adjust skb->truesize in pskb_expand_head() · 158f323b
      Eric Dumazet 提交于
      Slava Shwartsman reported a warning in skb_try_coalesce(), when we
      detect skb->truesize is completely wrong.
      
      In his case, issue came from IPv6 reassembly coping with malicious
      datagrams, that forced various pskb_may_pull() to reallocate a bigger
      skb->head than the one allocated by NIC driver before entering GRO
      layer.
      
      Current code does not change skb->truesize, leaving this burden to
      callers if they care enough.
      
      Blindly changing skb->truesize in pskb_expand_head() is not
      easy, as some producers might track skb->truesize, for example
      in xmit path for back pressure feedback (sk->sk_wmem_alloc)
      
      We can detect the cases where it should be safe to change
      skb->truesize :
      
      1) skb is not attached to a socket.
      2) If it is attached to a socket, destructor is sock_edemux()
      
      My audit gave only two callers doing their own skb->truesize
      manipulation.
      
      I had to remove skb parameter in sock_edemux macro when
      CONFIG_INET is not set to avoid a compile error.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      158f323b
  3. 17 1月, 2017 1 次提交
  4. 25 12月, 2016 1 次提交
  5. 11 12月, 2016 1 次提交
  6. 06 12月, 2016 1 次提交
  7. 30 11月, 2016 1 次提交
  8. 04 11月, 2016 2 次提交
    • W
      genetlink: fix a memory leak on error path · 00ffc1ba
      WANG Cong 提交于
      In __genl_register_family(), when genl_validate_assign_mc_groups()
      fails, we forget to free the memory we possibly allocate for
      family->attrbuf.
      
      Note, some callers call genl_unregister_family() to clean up
      on error path, it doesn't work because the family is inserted
      to the global list in the nearly last step.
      
      Cc: Jakub Kicinski <kubakici@wp.pl>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00ffc1ba
    • E
      netlink: netlink_diag_dump() runs without locks · 93636d1f
      Eric Dumazet 提交于
      A recent commit removed locking from netlink_diag_dump() but forgot
      one error case.
      
      =====================================
      [ BUG: bad unlock balance detected! ]
      4.9.0-rc3+ #336 Not tainted
      -------------------------------------
      syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
      ) at:
      [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
      but there are no more locks to release!
      
      other info that might help us debug this:
      3 locks held by syz-executor/4018:
       #0: [   36.220068]  (
      sock_diag_mutex[   36.220068] ){+.+.+.}
      , at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
       #1: [   36.220068]  (
      sock_diag_table_mutex[   36.220068] ){+.+.+.}
      , at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
       #2: [   36.220068]  (
      nlk->cb_mutex[   36.220068] ){+.+.+.}
      , at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0
      
      stack backtrace:
      CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
       ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
       dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
       [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
      kernel/locking/lockdep.c:3388
       [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
       [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
       [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
       [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
       [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
       [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110
      
      Fixes: ad202074 ("netlink: Use rhashtable walk interface in diag dump")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93636d1f
  9. 02 11月, 2016 1 次提交
  10. 30 10月, 2016 1 次提交
    • P
      genetlink: Fix generic netlink family unregister · 0e82c763
      pravin shelar 提交于
      This patch fixes a typo in unregister operation.
      
      Following crash is fixed by this patch. It can be easily reproduced
      by repeating modprobe and rmmod module that uses genetlink.
      
      [  261.446686] BUG: unable to handle kernel paging request at ffffffffa0264088
      [  261.448921] IP: [<ffffffff813cb70e>] strcmp+0xe/0x30
      [  261.450494] PGD 1c09067
      [  261.451266] PUD 1c0a063
      [  261.452091] PMD 8068d5067
      [  261.452525] PTE 0
      [  261.453164]
      [  261.453618] Oops: 0000 [#1] SMP
      [  261.454577] Modules linked in: openvswitch(+) ...
      [  261.480753] RIP: 0010:[<ffffffff813cb70e>]  [<ffffffff813cb70e>] strcmp+0xe/0x30
      [  261.483069] RSP: 0018:ffffc90003c0bc28  EFLAGS: 00010282
      [  261.510145] Call Trace:
      [  261.510896]  [<ffffffff816f10ca>] genl_family_find_byname+0x5a/0x70
      [  261.512819]  [<ffffffff816f2319>] genl_register_family+0xb9/0x630
      [  261.514805]  [<ffffffffa02840bc>] dp_init+0xbc/0x120 [openvswitch]
      [  261.518268]  [<ffffffff8100217d>] do_one_initcall+0x3d/0x160
      [  261.525041]  [<ffffffff811808a9>] do_init_module+0x60/0x1f1
      [  261.526754]  [<ffffffff8110687f>] load_module+0x22af/0x2860
      [  261.530144]  [<ffffffff81107026>] SYSC_finit_module+0x96/0xd0
      [  261.531901]  [<ffffffff8110707e>] SyS_finit_module+0xe/0x10
      [  261.533605]  [<ffffffff8100391e>] do_syscall_64+0x6e/0x180
      [  261.535284]  [<ffffffff817c2faf>] entry_SYSCALL64_slow_path+0x25/0x25
      [  261.546512] RIP  [<ffffffff813cb70e>] strcmp+0xe/0x30
      [  261.550198] ---[ end trace 76505a814dd68770 ]---
      
      Fixes: 2ae0f17d ("genetlink: use idr to track families").
      Reported-by: NJarno Rajahalme <jarno@ovn.org>
      CC: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NPravin B Shelar <pshelar@ovn.org>
      Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e82c763
  11. 28 10月, 2016 5 次提交
  12. 07 10月, 2016 1 次提交
    • E
      netlink: do not enter direct reclaim from netlink_dump() · d35c99ff
      Eric Dumazet 提交于
      Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
      allocations.
      
      Due to struct skb_shared_info ~320 bytes overhead, we end up using
      order-3 (on x86) page allocations, that might trigger direct reclaim and
      add stress.
      
      The intent was really to attempt a large allocation but immediately
      fallback to a smaller one (order-1 on x86) in case of memory stress.
      
      On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
      meet the goal. Old kernels would need to remove __GFP_WAIT
      
      While we are at it, since we do an order-3 allocation, allow to use
      all the allocated bytes instead of 16384 to reduce syscalls during
      large dumps.
      
      iproute2 already uses 32KB recvmsg() buffer sizes.
      
      Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)
      
      Fixes: 9063e21f ("netlink: autosize skb lengthes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Reviewed-by: NGreg Rose <grose@lightfleet.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d35c99ff
  13. 08 9月, 2016 1 次提交
    • A
      netlink: don't forget to release a rhashtable_iter structure · 733ade23
      Andrey Vagin 提交于
      This bug was detected by kmemleak:
      unreferenced object 0xffff8804269cc3c0 (size 64):
        comm "criu", pid 1042, jiffies 4294907360 (age 13.713s)
        hex dump (first 32 bytes):
          a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00  .2.,............
          00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
        backtrace:
          [<ffffffff8184dffa>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff8124720f>] kmem_cache_alloc_trace+0x10f/0x280
          [<ffffffffa02864cc>] __netlink_diag_dump+0x26c/0x290 [netlink_diag]
      
      v2: don't remove a reference on a rhashtable_iter structure to
          release it from netlink_diag_dump_done
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Fixes: ad202074 ("netlink: Use rhashtable walk interface in diag dump")
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      733ade23
  14. 02 9月, 2016 1 次提交
  15. 20 8月, 2016 1 次提交
  16. 10 6月, 2016 1 次提交
  17. 17 5月, 2016 1 次提交
  18. 11 4月, 2016 1 次提交
    • D
      netlink: don't send NETLINK_URELEASE for unbound sockets · e2726020
      Dmitry Ivanov 提交于
      All existing users of NETLINK_URELEASE use it to clean up resources that
      were previously allocated to a socket via some command. As a result, no
      users require getting this notification for unbound sockets.
      
      Sending it for unbound sockets, however, is a problem because any user
      (including unprivileged users) can create a socket that uses the same ID
      as an existing socket. Binding this new socket will fail, but if the
      NETLINK_URELEASE notification is generated for such sockets, the users
      thereof will be tricked into thinking the socket that they allocated the
      resources for is closed.
      
      In the nl80211 case, this will cause destruction of virtual interfaces
      that still belong to an existing hostapd process; this is the case that
      Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
      of netlink log/queue it will cause them to stop reporting events, as if
      NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.
      
      Fix this problem by checking that the socket is bound before generating
      the NETLINK_URELEASE notification.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Ivanov <dima@ubnt.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2726020
  19. 05 4月, 2016 1 次提交
  20. 23 3月, 2016 1 次提交
  21. 19 2月, 2016 3 次提交
    • F
      nfnetlink: Revert "nfnetlink: add support for memory mapped netlink" · c5b0db32
      Florian Westphal 提交于
      reverts commit 3ab1f683 ("nfnetlink: add support for memory mapped
      netlink")'
      
      Like previous commits in the series, remove wrappers that are not needed
      after mmapped netlink removal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b0db32
    • F
      Revert "genl: Add genlmsg_new_unicast() for unicast message allocation" · 263ea090
      Florian Westphal 提交于
      This reverts commit bb9b18fb ("genl: Add genlmsg_new_unicast() for
      unicast message allocation")'.
      
      Nothing wrong with it; its no longer needed since this was only for
      mmapped netlink support.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263ea090
    • F
      netlink: remove mmapped netlink support · d1b4c689
      Florian Westphal 提交于
      mmapped netlink has a number of unresolved issues:
      
      - TX zerocopy support had to be disabled more than a year ago via
        commit 4682a035 ("netlink: Always copy on mmap TX.")
        because the content of the mmapped area can change after netlink
        attribute validation but before message processing.
      
      - RX support was implemented mainly to speed up nfqueue dumping packet
        payload to userspace.  However, since commit ae08ce00
        ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
        with the socket-based interface too (via the skb_zerocopy helper).
      
      The other problem is that skbs attached to mmaped netlink socket
      behave different from normal skbs:
      
      - they don't have a shinfo area, so all functions that use skb_shinfo()
      (e.g. skb_clone) cannot be used.
      
      - reserving headroom prevents userspace from seeing the content as
      it expects message to start at skb->head.
      See for instance
      commit aa3a0220 ("netlink: not trim skb for mmaped socket when dump").
      
      - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
      crash because it needs the sk to check if a tx ring is attached.
      
      Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf35
      ("netfilter: nfnetlink: use original skbuff when acking batches").
      
      mmaped netlink also didn't play nicely with the skb_zerocopy helper
      used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
      commit 6bb0fef4 ("netlink, mmap: fix edge-case leakages in nf queue
      zero-copy")' but at the cost of also needing to provide remaining
      length to the allocation function.
      
      nfqueue also has problems when used with mmaped rx netlink:
      - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
        Problem is that in the mmap case, the allocation time also determines
        the ordering in which the frame will be seen by userspace (A
        allocating before B means that A is located in earlier ring slot,
        but this also means that B might get a lower sequence number then A
        since seqno is decided later.  To fix this we would need to extend the
        spinlocked region to also cover the allocation and message setup which
        isn't desirable.
      - nfqueue can now be configured to queue large (GSO) skbs to userspace.
        Queing GSO packets is faster than having to force a software segmentation
        in the kernel, so this is a desirable option.  However, with a mmap based
        ring one has to use 64kb per ring slot element, else mmap has to fall back
        to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
      
      To use the mmap interface, userspace not only has to probe for mmap netlink
      support, it also has to implement a recv/socket receive path in order to
      handle messages that exceed the size of an rx ring element.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b4c689
  22. 11 2月, 2016 1 次提交
    • T
      openvswitch: allow management from inside user namespaces · 4a92602a
      Tycho Andersen 提交于
      Operations with the GENL_ADMIN_PERM flag fail permissions checks because
      this flag means we call netlink_capable, which uses the init user ns.
      
      Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
      which should be allowed inside a user namespace.
      
      The motivation for this is to be able to run openvswitch in unprivileged
      containers. I've tested this and it seems to work, but I really have no
      idea about the security consequences of this patch, so thoughts would be
      much appreciated.
      
      v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
      v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
          massive one
      Reported-by: NJames Page <james.page@canonical.com>
      Signed-off-by: NTycho Andersen <tycho.andersen@canonical.com>
      CC: Eric Biederman <ebiederm@xmission.com>
      CC: Pravin Shelar <pshelar@ovn.org>
      CC: Justin Pettit <jpettit@nicira.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a92602a
  23. 30 1月, 2016 1 次提交
  24. 13 1月, 2016 2 次提交
  25. 16 12月, 2015 1 次提交
  26. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  27. 22 10月, 2015 1 次提交
    • D
      netlink: fix locking around NETLINK_LIST_MEMBERSHIPS · 47191d65
      David Herrmann 提交于
      Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
      the membership state to user-space. However, grabing the netlink table is
      effectively a write_lock_irq(), and as such we should not be triggering
      page-faults in the critical section.
      
      This can be easily reproduced by the following snippet:
          int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
          void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
          int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));
      
      This should work just fine, but currently triggers EFAULT and a possible
      WARN_ON below handle_mm_fault().
      
      Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
      lock. The write-lock was overkill in the first place, and the read-lock
      allows page-faults just fine.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47191d65
  28. 19 10月, 2015 1 次提交
    • A
      netlink: Trim skb to alloc size to avoid MSG_TRUNC · db65a3aa
      Arad, Ronen 提交于
      netlink_dump() allocates skb based on the calculated min_dump_alloc or
      a per socket max_recvmsg_len.
      min_alloc_size is maximum space required for any single netdev
      attributes as calculated by rtnl_calcit().
      max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
      It is capped at 16KiB.
      The intention is to avoid small allocations and to minimize the number
      of calls required to obtain dump information for all net devices.
      
      netlink_dump packs as many small messages as could fit within an skb
      that was sized for the largest single netdev information. The actual
      space available within an skb is larger than what is requested. It could
      be much larger and up to near 2x with align to next power of 2 approach.
      
      Allowing netlink_dump to use all the space available within the
      allocated skb increases the buffer size a user has to provide to avoid
      truncaion (i.e. MSG_TRUNG flag set).
      
      It was observed that with many VLANs configured on at least one netdev,
      a larger buffer of near 64KiB was necessary to avoid "Message truncated"
      error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
      min_alloc_size was only little over 32KiB.
      
      This patch trims skb to allocated size in order to allow the user to
      avoid truncation with more reasonable buffer size.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db65a3aa
  29. 09 10月, 2015 1 次提交
  30. 25 9月, 2015 2 次提交
    • J
      genetlink: simplify genl_notify · 92c14d9b
      Jiri Benc 提交于
      The genl_notify function has too many arguments for no real reason - all
      callers use genl_info to get them anyway. Just pass the genl_info down to
      genl_notify.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92c14d9b
    • H
      netlink: Replace rhash_portid with bound · da314c99
      Herbert Xu 提交于
      On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
      >
      > store_release and load_acquire are different from the usual memory
      > barriers and can't be paired this way.  You have to pair store_release
      > and load_acquire.  Besides, it isn't a particularly good idea to
      
      OK I've decided to drop the acquire/release helpers as they don't
      help us at all and simply pessimises the code by using full memory
      barriers (on some architectures) where only a write or read barrier
      is needed.
      
      > depend on memory barriers embedded in other data structures like the
      > above.  Here, especially, rhashtable_insert() would have write barrier
      > *before* the entry is hashed not necessarily *after*, which means that
      > in the above case, a socket which appears to have set bound to a
      > reader might not visible when the reader tries to look up the socket
      > on the hashtable.
      
      But you are right we do need an explicit write barrier here to
      ensure that the hashing is visible.
      
      > There's no reason to be overly smart here.  This isn't a crazy hot
      > path, write barriers tend to be very cheap, store_release more so.
      > Please just do smp_store_release() and note what it's paired with.
      
      It's not about being overly smart.  It's about actually understanding
      what's going on with the code.  I've seen too many instances of
      people simply sprinkling synchronisation primitives around without
      any knowledge of what is happening underneath, which is just a recipe
      for creating hard-to-debug races.
      
      > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
      > >  		}
      > >  	}
      > >
      > > -	if (!nlk->portid) {
      > > +	if (!nlk->bound) {
      >
      > I don't think you can skip load_acquire here just because this is the
      > second deref of the variable.  That doesn't change anything.  Race
      > condition could still happen between the first and second tests and
      > skipping the second would lead to the same kind of bug.
      
      The reason this one is OK is because we do not use nlk->portid or
      try to get nlk from the hash table before we return to user-space.
      
      However, there is a real bug here that none of these acquire/release
      helpers discovered.  The two bound tests here used to be a single
      one.  Now that they are separate it is entirely possible for another
      thread to come in the middle and bind the socket.  So we need to
      repeat the portid check in order to maintain consistency.
      
      > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
      > >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
      > >  		return -EPERM;
      > >
      > > -	if (!nlk->portid)
      > > +	if (!nlk->bound)
      >
      > Don't we need load_acquire here too?  Is this path holding a lock
      > which makes that unnecessary?
      
      Ditto.
      
      ---8<---
      The commit 1f770c0a ("netlink:
      Fix autobind race condition that leads to zero port ID") created
      some new races that can occur due to inconcsistencies between the
      two port IDs.
      
      Tejun is right that a barrier is unavoidable.  Therefore I am
      reverting to the original patch that used a boolean to indicate
      that a user netlink socket has been bound.
      
      Barriers have been added where necessary to ensure that a valid
      portid and the hashed socket is visible.
      
      I have also changed netlink_insert to only return EBUSY if the
      socket is bound to a portid different to the requested one.  This
      combined with only reading nlk->bound once in netlink_bind fixes
      a race where two threads that bind the socket at the same time
      with different port IDs may both succeed.
      
      Fixes: 1f770c0a ("netlink: Fix autobind race condition that leads to zero port ID")
      Reported-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Nacked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da314c99
  31. 21 9月, 2015 1 次提交