1. 07 9月, 2017 2 次提交
    • X
      netlink: access nlk groups safely in netlink bind and getname · f7736080
      Xin Long 提交于
      Now there is no lock protecting nlk ngroups/groups' accessing in
      netlink bind and getname. It's safe from nlk groups' setting in
      netlink_release, but not from netlink_realloc_groups called by
      netlink_setsockopt.
      
      netlink_lock_table is needed in both netlink bind and getname when
      accessing nlk groups.
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7736080
    • X
      netlink: fix an use-after-free issue for nlk groups · be82485f
      Xin Long 提交于
      ChunYu found a netlink use-after-free issue by syzkaller:
      
      [28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr ffff8807185e2378
      [28448.969918] Call Trace:
      [...]
      [28449.117207]  __nla_put+0x37/0x40
      [28449.132027]  nla_put+0xf5/0x130
      [28449.146261]  sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
      [28449.176608]  __netlink_diag_dump+0x25a/0x700 [netlink_diag]
      [28449.202215]  netlink_diag_dump+0x176/0x240 [netlink_diag]
      [28449.226834]  netlink_dump+0x488/0xbb0
      [28449.298014]  __netlink_dump_start+0x4e8/0x760
      [28449.317924]  netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
      [28449.413414]  sock_diag_rcv_msg+0x207/0x390
      [28449.432409]  netlink_rcv_skb+0x149/0x380
      [28449.467647]  sock_diag_rcv+0x2d/0x40
      [28449.484362]  netlink_unicast+0x562/0x7b0
      [28449.564790]  netlink_sendmsg+0xaa8/0xe60
      [28449.661510]  sock_sendmsg+0xcf/0x110
      [28449.865631]  __sys_sendmsg+0xf3/0x240
      [28450.000964]  SyS_sendmsg+0x32/0x50
      [28450.016969]  do_syscall_64+0x25c/0x6c0
      [28450.154439]  entry_SYSCALL64_slow_path+0x25/0x25
      
      It was caused by no protection between nlk groups' free in netlink_release
      and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
      exists in netlink_seq_show().
      
      This patch is to defer nlk groups' free in deferred_put_nlk_sk.
      Reported-by: NChunYu Wang <chunwang@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be82485f
  2. 01 7月, 2017 3 次提交
  3. 16 6月, 2017 2 次提交
    • J
      networking: make skb_put & friends return void pointers · 4df864c1
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions (skb_put, __skb_put and pskb_put) return void *
      and remove all the casts across the tree, adding a (u8 *) cast only
      where the unsigned char pointer was used directly, all done with the
      following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_put, __skb_put };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_put, __skb_put };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
      which actually doesn't cover pskb_put since there are only three
      users overall.
      
      A handful of stragglers were converted manually, notably a macro in
      drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
      instances in net/bluetooth/hci_sock.c. In the former file, I also
      had to fix one whitespace problem spatch introduced.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4df864c1
    • J
      networking: introduce and use skb_put_data() · 59ae1d12
      Johannes Berg 提交于
      A common pattern with skb_put() is to just want to memcpy()
      some data into the new space, introduce skb_put_data() for
      this.
      
      An spatch similar to the one for skb_put_zero() converts many
      of the places using it:
      
          @@
          identifier p, p2;
          expression len, skb, data;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_data(skb, data, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_data(skb, data, len);
          )
          (
          p2 = (t2)p;
          -memcpy(p2, data, len);
          |
          -memcpy(p, data, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb, data;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_data(skb, data, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_data(skb, data, sizeof(t));
          )
          (
          p2 = (t2)p;
          -memcpy(p2, data, sizeof(*p));
          |
          -memcpy(p, data, sizeof(*p));
          )
      
          @@
          expression skb, len, data;
          @@
          -memcpy(skb_put(skb, len), data, len);
          +skb_put_data(skb, data, len);
      
      (again, manually post-processed to retain some comments)
      Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59ae1d12
  4. 01 6月, 2017 1 次提交
  5. 14 4月, 2017 2 次提交
  6. 05 4月, 2017 1 次提交
  7. 22 3月, 2017 1 次提交
  8. 28 1月, 2017 1 次提交
    • E
      net: adjust skb->truesize in pskb_expand_head() · 158f323b
      Eric Dumazet 提交于
      Slava Shwartsman reported a warning in skb_try_coalesce(), when we
      detect skb->truesize is completely wrong.
      
      In his case, issue came from IPv6 reassembly coping with malicious
      datagrams, that forced various pskb_may_pull() to reallocate a bigger
      skb->head than the one allocated by NIC driver before entering GRO
      layer.
      
      Current code does not change skb->truesize, leaving this burden to
      callers if they care enough.
      
      Blindly changing skb->truesize in pskb_expand_head() is not
      easy, as some producers might track skb->truesize, for example
      in xmit path for back pressure feedback (sk->sk_wmem_alloc)
      
      We can detect the cases where it should be safe to change
      skb->truesize :
      
      1) skb is not attached to a socket.
      2) If it is attached to a socket, destructor is sock_edemux()
      
      My audit gave only two callers doing their own skb->truesize
      manipulation.
      
      I had to remove skb parameter in sock_edemux macro when
      CONFIG_INET is not set to avoid a compile error.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      158f323b
  9. 17 1月, 2017 1 次提交
  10. 25 12月, 2016 1 次提交
  11. 11 12月, 2016 1 次提交
  12. 06 12月, 2016 1 次提交
  13. 30 11月, 2016 1 次提交
  14. 07 10月, 2016 1 次提交
    • E
      netlink: do not enter direct reclaim from netlink_dump() · d35c99ff
      Eric Dumazet 提交于
      Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
      allocations.
      
      Due to struct skb_shared_info ~320 bytes overhead, we end up using
      order-3 (on x86) page allocations, that might trigger direct reclaim and
      add stress.
      
      The intent was really to attempt a large allocation but immediately
      fallback to a smaller one (order-1 on x86) in case of memory stress.
      
      On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
      meet the goal. Old kernels would need to remove __GFP_WAIT
      
      While we are at it, since we do an order-3 allocation, allow to use
      all the allocated bytes instead of 16384 to reduce syscalls during
      large dumps.
      
      iproute2 already uses 32KB recvmsg() buffer sizes.
      
      Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)
      
      Fixes: 9063e21f ("netlink: autosize skb lengthes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Reviewed-by: NGreg Rose <grose@lightfleet.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d35c99ff
  15. 17 5月, 2016 1 次提交
  16. 11 4月, 2016 1 次提交
    • D
      netlink: don't send NETLINK_URELEASE for unbound sockets · e2726020
      Dmitry Ivanov 提交于
      All existing users of NETLINK_URELEASE use it to clean up resources that
      were previously allocated to a socket via some command. As a result, no
      users require getting this notification for unbound sockets.
      
      Sending it for unbound sockets, however, is a problem because any user
      (including unprivileged users) can create a socket that uses the same ID
      as an existing socket. Binding this new socket will fail, but if the
      NETLINK_URELEASE notification is generated for such sockets, the users
      thereof will be tricked into thinking the socket that they allocated the
      resources for is closed.
      
      In the nl80211 case, this will cause destruction of virtual interfaces
      that still belong to an existing hostapd process; this is the case that
      Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
      of netlink log/queue it will cause them to stop reporting events, as if
      NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.
      
      Fix this problem by checking that the socket is bound before generating
      the NETLINK_URELEASE notification.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Ivanov <dima@ubnt.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2726020
  17. 05 4月, 2016 1 次提交
  18. 23 3月, 2016 1 次提交
  19. 19 2月, 2016 2 次提交
    • F
      nfnetlink: Revert "nfnetlink: add support for memory mapped netlink" · c5b0db32
      Florian Westphal 提交于
      reverts commit 3ab1f683 ("nfnetlink: add support for memory mapped
      netlink")'
      
      Like previous commits in the series, remove wrappers that are not needed
      after mmapped netlink removal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b0db32
    • F
      netlink: remove mmapped netlink support · d1b4c689
      Florian Westphal 提交于
      mmapped netlink has a number of unresolved issues:
      
      - TX zerocopy support had to be disabled more than a year ago via
        commit 4682a035 ("netlink: Always copy on mmap TX.")
        because the content of the mmapped area can change after netlink
        attribute validation but before message processing.
      
      - RX support was implemented mainly to speed up nfqueue dumping packet
        payload to userspace.  However, since commit ae08ce00
        ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
        with the socket-based interface too (via the skb_zerocopy helper).
      
      The other problem is that skbs attached to mmaped netlink socket
      behave different from normal skbs:
      
      - they don't have a shinfo area, so all functions that use skb_shinfo()
      (e.g. skb_clone) cannot be used.
      
      - reserving headroom prevents userspace from seeing the content as
      it expects message to start at skb->head.
      See for instance
      commit aa3a0220 ("netlink: not trim skb for mmaped socket when dump").
      
      - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
      crash because it needs the sk to check if a tx ring is attached.
      
      Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf35
      ("netfilter: nfnetlink: use original skbuff when acking batches").
      
      mmaped netlink also didn't play nicely with the skb_zerocopy helper
      used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
      commit 6bb0fef4 ("netlink, mmap: fix edge-case leakages in nf queue
      zero-copy")' but at the cost of also needing to provide remaining
      length to the allocation function.
      
      nfqueue also has problems when used with mmaped rx netlink:
      - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
        Problem is that in the mmap case, the allocation time also determines
        the ordering in which the frame will be seen by userspace (A
        allocating before B means that A is located in earlier ring slot,
        but this also means that B might get a lower sequence number then A
        since seqno is decided later.  To fix this we would need to extend the
        spinlocked region to also cover the allocation and message setup which
        isn't desirable.
      - nfqueue can now be configured to queue large (GSO) skbs to userspace.
        Queing GSO packets is faster than having to force a software segmentation
        in the kernel, so this is a desirable option.  However, with a mmap based
        ring one has to use 64kb per ring slot element, else mmap has to fall back
        to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
      
      To use the mmap interface, userspace not only has to probe for mmap netlink
      support, it also has to implement a recv/socket receive path in order to
      handle messages that exceed the size of an rx ring element.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b4c689
  20. 30 1月, 2016 1 次提交
  21. 16 12月, 2015 1 次提交
  22. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  23. 22 10月, 2015 1 次提交
    • D
      netlink: fix locking around NETLINK_LIST_MEMBERSHIPS · 47191d65
      David Herrmann 提交于
      Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
      the membership state to user-space. However, grabing the netlink table is
      effectively a write_lock_irq(), and as such we should not be triggering
      page-faults in the critical section.
      
      This can be easily reproduced by the following snippet:
          int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
          void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
          int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));
      
      This should work just fine, but currently triggers EFAULT and a possible
      WARN_ON below handle_mm_fault().
      
      Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
      lock. The write-lock was overkill in the first place, and the read-lock
      allows page-faults just fine.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47191d65
  24. 19 10月, 2015 1 次提交
    • A
      netlink: Trim skb to alloc size to avoid MSG_TRUNC · db65a3aa
      Arad, Ronen 提交于
      netlink_dump() allocates skb based on the calculated min_dump_alloc or
      a per socket max_recvmsg_len.
      min_alloc_size is maximum space required for any single netdev
      attributes as calculated by rtnl_calcit().
      max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
      It is capped at 16KiB.
      The intention is to avoid small allocations and to minimize the number
      of calls required to obtain dump information for all net devices.
      
      netlink_dump packs as many small messages as could fit within an skb
      that was sized for the largest single netdev information. The actual
      space available within an skb is larger than what is requested. It could
      be much larger and up to near 2x with align to next power of 2 approach.
      
      Allowing netlink_dump to use all the space available within the
      allocated skb increases the buffer size a user has to provide to avoid
      truncaion (i.e. MSG_TRUNG flag set).
      
      It was observed that with many VLANs configured on at least one netdev,
      a larger buffer of near 64KiB was necessary to avoid "Message truncated"
      error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
      min_alloc_size was only little over 32KiB.
      
      This patch trims skb to allocated size in order to allow the user to
      avoid truncation with more reasonable buffer size.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db65a3aa
  25. 25 9月, 2015 1 次提交
    • H
      netlink: Replace rhash_portid with bound · da314c99
      Herbert Xu 提交于
      On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
      >
      > store_release and load_acquire are different from the usual memory
      > barriers and can't be paired this way.  You have to pair store_release
      > and load_acquire.  Besides, it isn't a particularly good idea to
      
      OK I've decided to drop the acquire/release helpers as they don't
      help us at all and simply pessimises the code by using full memory
      barriers (on some architectures) where only a write or read barrier
      is needed.
      
      > depend on memory barriers embedded in other data structures like the
      > above.  Here, especially, rhashtable_insert() would have write barrier
      > *before* the entry is hashed not necessarily *after*, which means that
      > in the above case, a socket which appears to have set bound to a
      > reader might not visible when the reader tries to look up the socket
      > on the hashtable.
      
      But you are right we do need an explicit write barrier here to
      ensure that the hashing is visible.
      
      > There's no reason to be overly smart here.  This isn't a crazy hot
      > path, write barriers tend to be very cheap, store_release more so.
      > Please just do smp_store_release() and note what it's paired with.
      
      It's not about being overly smart.  It's about actually understanding
      what's going on with the code.  I've seen too many instances of
      people simply sprinkling synchronisation primitives around without
      any knowledge of what is happening underneath, which is just a recipe
      for creating hard-to-debug races.
      
      > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
      > >  		}
      > >  	}
      > >
      > > -	if (!nlk->portid) {
      > > +	if (!nlk->bound) {
      >
      > I don't think you can skip load_acquire here just because this is the
      > second deref of the variable.  That doesn't change anything.  Race
      > condition could still happen between the first and second tests and
      > skipping the second would lead to the same kind of bug.
      
      The reason this one is OK is because we do not use nlk->portid or
      try to get nlk from the hash table before we return to user-space.
      
      However, there is a real bug here that none of these acquire/release
      helpers discovered.  The two bound tests here used to be a single
      one.  Now that they are separate it is entirely possible for another
      thread to come in the middle and bind the socket.  So we need to
      repeat the portid check in order to maintain consistency.
      
      > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
      > >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
      > >  		return -EPERM;
      > >
      > > -	if (!nlk->portid)
      > > +	if (!nlk->bound)
      >
      > Don't we need load_acquire here too?  Is this path holding a lock
      > which makes that unnecessary?
      
      Ditto.
      
      ---8<---
      The commit 1f770c0a ("netlink:
      Fix autobind race condition that leads to zero port ID") created
      some new races that can occur due to inconcsistencies between the
      two port IDs.
      
      Tejun is right that a barrier is unavoidable.  Therefore I am
      reverting to the original patch that used a boolean to indicate
      that a user netlink socket has been bound.
      
      Barriers have been added where necessary to ensure that a valid
      portid and the hashed socket is visible.
      
      I have also changed netlink_insert to only return EBUSY if the
      socket is bound to a portid different to the requested one.  This
      combined with only reading nlk->bound once in netlink_bind fixes
      a race where two threads that bind the socket at the same time
      with different port IDs may both succeed.
      
      Fixes: 1f770c0a ("netlink: Fix autobind race condition that leads to zero port ID")
      Reported-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Nacked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da314c99
  26. 21 9月, 2015 1 次提交
  27. 12 9月, 2015 1 次提交
    • D
      netlink, mmap: transform mmap skb into full skb on taps · 1853c949
      Daniel Borkmann 提交于
      Ken-ichirou reported that running netlink in mmap mode for receive in
      combination with nlmon will throw a NULL pointer dereference in
      __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
      to handle kernel paging request". The problem is the skb_clone() in
      __netlink_deliver_tap_skb() for skbs that are mmaped.
      
      I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
      skb has it pointed to netlink_skb_destructor(), set in the handler
      netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
      that in such cases, __kfree_skb() doesn't perform a skb_release_data()
      via skb_release_all(), where skb->head is possibly being freed through
      kfree(head) into slab allocator, although netlink mmap skb->head points
      to the mmap buffer. Similarly, the same has to be done also for large
      netlink skbs where the data area is vmalloced. Therefore, as discussed,
      make a copy for these rather rare cases for now. This fixes the issue
      on my and Ken-ichirou's test-cases.
      
      Reference: http://thread.gmane.org/gmane.linux.network/371129
      Fixes: bcbde0d4 ("net: netlink: virtual tap device management")
      Reported-by: NKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1853c949
  28. 10 9月, 2015 2 次提交
    • D
      netlink, mmap: fix edge-case leakages in nf queue zero-copy · 6bb0fef4
      Daniel Borkmann 提交于
      When netlink mmap on receive side is the consumer of nf queue data,
      it can happen that in some edge cases, we write skb shared info into
      the user space mmap buffer:
      
      Assume a possible rx ring frame size of only 4096, and the network skb,
      which is being zero-copied into the netlink skb, contains page frags
      with an overall skb->len larger than the linear part of the netlink
      skb.
      
      skb_zerocopy(), which is generic and thus not aware of the fact that
      shared info cannot be accessed for such skbs then tries to write and
      fill frags, thus leaking kernel data/pointers and in some corner cases
      possibly writing out of bounds of the mmap area (when filling the
      last slot in the ring buffer this way).
      
      I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
      an advertised length larger than 4096, where the linear part is visible
      at the slot beginning, and the leaked sizeof(struct skb_shared_info)
      has been written to the beginning of the next slot (also corrupting
      the struct nl_mmap_hdr slot header incl. status etc), since skb->end
      points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.
      
      The fix adds and lets __netlink_alloc_skb() take the actual needed
      linear room for the network skb + meta data into account. It's completely
      irrelevant for non-mmaped netlink sockets, but in case mmap sockets
      are used, it can be decided whether the available skb_tailroom() is
      really large enough for the buffer, or whether it needs to internally
      fallback to a normal alloc_skb().
      
      >From nf queue side, the information whether the destination port is
      an mmap RX ring is not really available without extra port-to-socket
      lookup, thus it can only be determined in lower layers i.e. when
      __netlink_alloc_skb() is called that checks internally for this. I
      chose to add the extra ldiff parameter as mmap will then still work:
      We have data_len and hlen in nfqnl_build_packet_message(), data_len
      is the full length (capped at queue->copy_range) for skb_zerocopy()
      and hlen some possible part of data_len that needs to be copied; the
      rem_len variable indicates the needed remaining linear mmap space.
      
      The only other workaround in nf queue internally would be after
      allocation time by f.e. cap'ing the data_len to the skb_tailroom()
      iff we deal with an mmap skb, but that would 1) expose the fact that
      we use a mmap skb to upper layers, and 2) trim the skb where we
      otherwise could just have moved the full skb into the normal receive
      queue.
      
      After the patch, in my test case the ring slot doesn't fit and therefore
      shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
      thus needs to be picked up via recv().
      
      Fixes: 3ab1f683 ("nfnetlink: add support for memory mapped netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bb0fef4
    • D
      netlink, mmap: don't walk rx ring on poll if receive queue non-empty · a66e3656
      Daniel Borkmann 提交于
      In case of netlink mmap, there can be situations where received frames
      have to be placed into the normal receive queue. The ring buffer indicates
      this through NL_MMAP_STATUS_COPY, so the user is asked to pick them up
      via recvmsg(2) syscall, and to put the slot back to NL_MMAP_STATUS_UNUSED.
      
      Commit 0ef70770 ("netlink: rx mmap: fix POLLIN condition") changed
      polling, so that we walk in the worst case the whole ring through the
      new netlink_has_valid_frame(), for example, when the ring would have no
      NL_MMAP_STATUS_VALID, but at least one NL_MMAP_STATUS_COPY frame.
      
      Since we do a datagram_poll() already earlier to pick up a mask that could
      possibly contain POLLIN | POLLRDNORM already (due to NL_MMAP_STATUS_COPY),
      we can skip checking the rx ring entirely.
      
      In case the kernel is compiled with !CONFIG_NETLINK_MMAP, then all this is
      irrelevant anyway as netlink_poll() is just defined as datagram_poll().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a66e3656
  29. 31 8月, 2015 1 次提交
    • K
      netlink: rx mmap: fix POLLIN condition · 0ef70770
      Ken-ichirou MATSUZAWA 提交于
      Poll() returns immediately after setting the kernel current frame
      (ring->head) to SKIP from user space even though there is no new
      frame. And in a case of all frames is VALID, user space program
      unintensionally sets (only) kernel current frame to UNUSED, then
      calls poll(), it will not return immediately even though there are
      VALID frames.
      
      To avoid situations like above, I think we need to scan all frames
      to find VALID frames at poll() like netlink_alloc_skb(),
      netlink_forward_ring() finding an UNUSED frame at skb allocation.
      Signed-off-by: NKen-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ef70770
  30. 29 8月, 2015 2 次提交
  31. 24 8月, 2015 1 次提交
  32. 11 8月, 2015 1 次提交
    • D
      netlink: make sure -EBUSY won't escape from netlink_insert · 4e7c1330
      Daniel Borkmann 提交于
      Linus reports the following deadlock on rtnl_mutex; triggered only
      once so far (extract):
      
      [12236.694209] NetworkManager  D 0000000000013b80     0  1047      1 0x00000000
      [12236.694218]  ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
      [12236.694224]  ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
      [12236.694230]  ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
      [12236.694235] Call Trace:
      [12236.694250]  [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10
      [12236.694257]  [<ffffffff815d133a>] ? schedule+0x2a/0x70
      [12236.694263]  [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10
      [12236.694271]  [<ffffffff815d2c3f>] ? __mutex_lock_slowpath+0x7f/0xf0
      [12236.694280]  [<ffffffff815d2cc6>] ? mutex_lock+0x16/0x30
      [12236.694291]  [<ffffffff814f1f90>] ? rtnetlink_rcv+0x10/0x30
      [12236.694299]  [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180
      [12236.694309]  [<ffffffff814f5ad3>] ? rtnl_getlink+0x113/0x190
      [12236.694319]  [<ffffffff814f202a>] ? rtnetlink_rcv_msg+0x7a/0x210
      [12236.694331]  [<ffffffff8124565c>] ? sock_has_perm+0x5c/0x70
      [12236.694339]  [<ffffffff814f1fb0>] ? rtnetlink_rcv+0x30/0x30
      [12236.694346]  [<ffffffff8150d62c>] ? netlink_rcv_skb+0x9c/0xc0
      [12236.694354]  [<ffffffff814f1f9f>] ? rtnetlink_rcv+0x1f/0x30
      [12236.694360]  [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180
      [12236.694367]  [<ffffffff8150d344>] ? netlink_sendmsg+0x484/0x5d0
      [12236.694376]  [<ffffffff810a236f>] ? __wake_up+0x2f/0x50
      [12236.694387]  [<ffffffff814cad23>] ? sock_sendmsg+0x33/0x40
      [12236.694396]  [<ffffffff814cb05e>] ? ___sys_sendmsg+0x22e/0x240
      [12236.694405]  [<ffffffff814cab75>] ? ___sys_recvmsg+0x135/0x1a0
      [12236.694415]  [<ffffffff811a9d12>] ? eventfd_write+0x82/0x210
      [12236.694423]  [<ffffffff811a0f9e>] ? fsnotify+0x32e/0x4c0
      [12236.694429]  [<ffffffff8108cb70>] ? wake_up_q+0x60/0x60
      [12236.694434]  [<ffffffff814cba09>] ? __sys_sendmsg+0x39/0x70
      [12236.694440]  [<ffffffff815d4797>] ? entry_SYSCALL_64_fastpath+0x12/0x6a
      
      It seems so far plausible that the recursive call into rtnetlink_rcv()
      looks suspicious. One way, where this could trigger is that the senders
      NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
      the rtnl_getlink() request's answer would be sent to the kernel instead
      to the actual user process, thus grabbing rtnl_mutex() twice.
      
      One theory would be that netlink_autobind() triggered via netlink_sendmsg()
      internally overwrites the -EBUSY error to 0, but where it is wrongly
      originating from __netlink_insert() instead. That would reset the
      socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
      later on. As commit d470e3b4 ("[NETLINK]: Fix two socket hashing bugs.")
      also puts it, -EBUSY should not be propagated from netlink_insert().
      
      It looks like it's very unlikely to reproduce. We need to trigger the
      rhashtable_insert_rehash() handler under a situation where rehashing
      currently occurs (one /rare/ way would be to hit ht->elasticity limits
      while not filled enough to expand the hashtable, but that would rather
      require a specifically crafted bind() sequence with knowledge about
      destination slots, seems unlikely). It probably makes sense to guard
      __netlink_insert() in any case and remap that error. It was suggested
      that EOVERFLOW might be better than an already overloaded ENOMEM.
      
      Reference: http://thread.gmane.org/gmane.linux.network/372676Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e7c1330