1. 19 1月, 2018 1 次提交
  2. 16 1月, 2018 1 次提交
  3. 12 12月, 2017 1 次提交
    • K
      netlink: Add netns check on taps · 93c64764
      Kevin Cernekee 提交于
      Currently, a nlmon link inside a child namespace can observe systemwide
      netlink activity.  Filter the traffic so that nlmon can only sniff
      netlink messages from its own netns.
      
      Test case:
      
          vpnns -- bash -c "ip link add nlmon0 type nlmon; \
                            ip link set nlmon0 up; \
                            tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
          sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
              spi 0x1 mode transport \
              auth sha1 0x6162633132330000000000000000000000000000 \
              enc aes 0x00000000000000000000000000000000
          grep --binary abc123 /tmp/nlmon.pcap
      Signed-off-by: NKevin Cernekee <cernekee@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c64764
  4. 14 11月, 2017 1 次提交
  5. 13 11月, 2017 1 次提交
    • J
      af_netlink: ensure that NLMSG_DONE never fails in dumps · 0642840b
      Jason A. Donenfeld 提交于
      The way people generally use netlink_dump is that they fill in the skb
      as much as possible, breaking when nla_put returns an error. Then, they
      get called again and start filling out the next skb, and again, and so
      forth. The mechanism at work here is the ability for the iterative
      dumping function to detect when the skb is filled up and not fill it
      past the brim, waiting for a fresh skb for the rest of the data.
      
      However, if the attributes are small and nicely packed, it is possible
      that a dump callback function successfully fills in attributes until the
      skb is of size 4080 (libmnl's default page-sized receive buffer size).
      The dump function completes, satisfied, and then, if it happens to be
      that this is actually the last skb, and no further ones are to be sent,
      then netlink_dump will add on the NLMSG_DONE part:
      
        nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
      
      It is very important that netlink_dump does this, of course. However, in
      this example, that call to nlmsg_put_answer will fail, because the
      previous filling by the dump function did not leave it enough room. And
      how could it possibly have done so? All of the nla_put variety of
      functions simply check to see if the skb has enough tailroom,
      independent of the context it is in.
      
      In order to keep the important assumptions of all netlink dump users, it
      is therefore important to give them an skb that has this end part of the
      tail already reserved, so that the call to nlmsg_put_answer does not
      fail. Otherwise, library authors are forced to find some bizarre sized
      receive buffer that has a large modulo relative to the common sizes of
      messages received, which is ugly and buggy.
      
      This patch thus saves the NLMSG_DONE for an additional message, for the
      case that things are dangerously close to the brim. This requires
      keeping track of the errno from ->dump() across calls.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0642840b
  6. 01 11月, 2017 1 次提交
  7. 18 10月, 2017 2 次提交
    • J
      netlink: fix netlink_ack() extack race · 48044eb4
      Johannes Berg 提交于
      It seems that it's possible to toggle NETLINK_F_EXT_ACK
      through setsockopt() while another thread/CPU is building
      a message inside netlink_ack(), which could then trigger
      the WARN_ON()s I added since if it goes from being turned
      off to being turned on between allocating and filling the
      message, the skb could end up being too small.
      
      Avoid this whole situation by storing the value of this
      flag in a separate variable and using that throughout the
      function instead.
      
      Fixes: 2d4bc933 ("netlink: extended ACK reporting")
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48044eb4
    • J
      netlink: use NETLINK_CB(in_skb).sk instead of looking it up · a2084f56
      Johannes Berg 提交于
      When netlink_ack() reports an allocation error to the sending
      socket, there's no need to look up the sending socket since
      it's available in the SKB's CB. Use that instead of going to
      the trouble of looking it up.
      
      Note that the pointer is only available since Eric Biederman's
      commit 3fbc2905 ("netlink: Make the sending netlink socket availabe in NETLINK_CB")
      which is far newer than the original lookup code (Oct 2003)
      (though the field was called 'ssk' in that commit and only got
      renamed to 'sk' later, I'd actually argue 'ssk' was better - or
      perhaps it should've been 'source_sk' - since there are so many
      different 'sk's involved.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2084f56
  8. 10 10月, 2017 1 次提交
    • J
      netlink: do not set cb_running if dump's start() errs · 41c87425
      Jason A. Donenfeld 提交于
      It turns out that multiple places can call netlink_dump(), which means
      it's still possible to dereference partially initialized values in
      dump() that were the result of a faulty returned start().
      
      This fixes the issue by calling start() _before_ setting cb_running to
      true, so that there's no chance at all of hitting the dump() function
      through any indirect paths.
      
      It also moves the call to start() to be when the mutex is held. This has
      the nice side effect of serializing invocations to start(), which is
      likely desirable anyway. It also prevents any possible other races that
      might come out of this logic.
      
      In testing this with several different pieces of tricky code to trigger
      these issues, this commit fixes all avenues that I'm aware of.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41c87425
  9. 30 9月, 2017 1 次提交
    • J
      netlink: do not proceed if dump's start() errs · fef0035c
      Jason A. Donenfeld 提交于
      Drivers that use the start method for netlink dumping rely on dumpit not
      being called if start fails. For example, ila_xlat.c allocates memory
      and assigns it to cb->args[0] in its start() function. It might fail to
      do that and return -ENOMEM instead. However, even when returning an
      error, dumpit will be called, which, in the example above, quickly
      dereferences the memory in cb->args[0], which will OOPS the kernel. This
      is but one example of how this goes wrong.
      
      Since start() has always been a function with an int return type, it
      therefore makes sense to use it properly, rather than ignoring it. This
      patch thus returns early and does not call dumpit() when start() fails.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fef0035c
  10. 07 9月, 2017 2 次提交
    • X
      netlink: access nlk groups safely in netlink bind and getname · f7736080
      Xin Long 提交于
      Now there is no lock protecting nlk ngroups/groups' accessing in
      netlink bind and getname. It's safe from nlk groups' setting in
      netlink_release, but not from netlink_realloc_groups called by
      netlink_setsockopt.
      
      netlink_lock_table is needed in both netlink bind and getname when
      accessing nlk groups.
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7736080
    • X
      netlink: fix an use-after-free issue for nlk groups · be82485f
      Xin Long 提交于
      ChunYu found a netlink use-after-free issue by syzkaller:
      
      [28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr ffff8807185e2378
      [28448.969918] Call Trace:
      [...]
      [28449.117207]  __nla_put+0x37/0x40
      [28449.132027]  nla_put+0xf5/0x130
      [28449.146261]  sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
      [28449.176608]  __netlink_diag_dump+0x25a/0x700 [netlink_diag]
      [28449.202215]  netlink_diag_dump+0x176/0x240 [netlink_diag]
      [28449.226834]  netlink_dump+0x488/0xbb0
      [28449.298014]  __netlink_dump_start+0x4e8/0x760
      [28449.317924]  netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
      [28449.413414]  sock_diag_rcv_msg+0x207/0x390
      [28449.432409]  netlink_rcv_skb+0x149/0x380
      [28449.467647]  sock_diag_rcv+0x2d/0x40
      [28449.484362]  netlink_unicast+0x562/0x7b0
      [28449.564790]  netlink_sendmsg+0xaa8/0xe60
      [28449.661510]  sock_sendmsg+0xcf/0x110
      [28449.865631]  __sys_sendmsg+0xf3/0x240
      [28450.000964]  SyS_sendmsg+0x32/0x50
      [28450.016969]  do_syscall_64+0x25c/0x6c0
      [28450.154439]  entry_SYSCALL64_slow_path+0x25/0x25
      
      It was caused by no protection between nlk groups' free in netlink_release
      and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
      exists in netlink_seq_show().
      
      This patch is to defer nlk groups' free in deferred_put_nlk_sk.
      Reported-by: NChunYu Wang <chunwang@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be82485f
  11. 01 7月, 2017 3 次提交
  12. 16 6月, 2017 2 次提交
    • J
      networking: make skb_put & friends return void pointers · 4df864c1
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions (skb_put, __skb_put and pskb_put) return void *
      and remove all the casts across the tree, adding a (u8 *) cast only
      where the unsigned char pointer was used directly, all done with the
      following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_put, __skb_put };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_put, __skb_put };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
      which actually doesn't cover pskb_put since there are only three
      users overall.
      
      A handful of stragglers were converted manually, notably a macro in
      drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
      instances in net/bluetooth/hci_sock.c. In the former file, I also
      had to fix one whitespace problem spatch introduced.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4df864c1
    • J
      networking: introduce and use skb_put_data() · 59ae1d12
      Johannes Berg 提交于
      A common pattern with skb_put() is to just want to memcpy()
      some data into the new space, introduce skb_put_data() for
      this.
      
      An spatch similar to the one for skb_put_zero() converts many
      of the places using it:
      
          @@
          identifier p, p2;
          expression len, skb, data;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_data(skb, data, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_data(skb, data, len);
          )
          (
          p2 = (t2)p;
          -memcpy(p2, data, len);
          |
          -memcpy(p, data, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb, data;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_data(skb, data, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_data(skb, data, sizeof(t));
          )
          (
          p2 = (t2)p;
          -memcpy(p2, data, sizeof(*p));
          |
          -memcpy(p, data, sizeof(*p));
          )
      
          @@
          expression skb, len, data;
          @@
          -memcpy(skb_put(skb, len), data, len);
          +skb_put_data(skb, data, len);
      
      (again, manually post-processed to retain some comments)
      Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59ae1d12
  13. 01 6月, 2017 1 次提交
  14. 14 4月, 2017 2 次提交
  15. 05 4月, 2017 1 次提交
  16. 22 3月, 2017 1 次提交
  17. 28 1月, 2017 1 次提交
    • E
      net: adjust skb->truesize in pskb_expand_head() · 158f323b
      Eric Dumazet 提交于
      Slava Shwartsman reported a warning in skb_try_coalesce(), when we
      detect skb->truesize is completely wrong.
      
      In his case, issue came from IPv6 reassembly coping with malicious
      datagrams, that forced various pskb_may_pull() to reallocate a bigger
      skb->head than the one allocated by NIC driver before entering GRO
      layer.
      
      Current code does not change skb->truesize, leaving this burden to
      callers if they care enough.
      
      Blindly changing skb->truesize in pskb_expand_head() is not
      easy, as some producers might track skb->truesize, for example
      in xmit path for back pressure feedback (sk->sk_wmem_alloc)
      
      We can detect the cases where it should be safe to change
      skb->truesize :
      
      1) skb is not attached to a socket.
      2) If it is attached to a socket, destructor is sock_edemux()
      
      My audit gave only two callers doing their own skb->truesize
      manipulation.
      
      I had to remove skb parameter in sock_edemux macro when
      CONFIG_INET is not set to avoid a compile error.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      158f323b
  18. 17 1月, 2017 1 次提交
  19. 25 12月, 2016 1 次提交
  20. 11 12月, 2016 1 次提交
  21. 06 12月, 2016 1 次提交
  22. 30 11月, 2016 1 次提交
  23. 07 10月, 2016 1 次提交
    • E
      netlink: do not enter direct reclaim from netlink_dump() · d35c99ff
      Eric Dumazet 提交于
      Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
      allocations.
      
      Due to struct skb_shared_info ~320 bytes overhead, we end up using
      order-3 (on x86) page allocations, that might trigger direct reclaim and
      add stress.
      
      The intent was really to attempt a large allocation but immediately
      fallback to a smaller one (order-1 on x86) in case of memory stress.
      
      On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
      meet the goal. Old kernels would need to remove __GFP_WAIT
      
      While we are at it, since we do an order-3 allocation, allow to use
      all the allocated bytes instead of 16384 to reduce syscalls during
      large dumps.
      
      iproute2 already uses 32KB recvmsg() buffer sizes.
      
      Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)
      
      Fixes: 9063e21f ("netlink: autosize skb lengthes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Reviewed-by: NGreg Rose <grose@lightfleet.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d35c99ff
  24. 17 5月, 2016 1 次提交
  25. 11 4月, 2016 1 次提交
    • D
      netlink: don't send NETLINK_URELEASE for unbound sockets · e2726020
      Dmitry Ivanov 提交于
      All existing users of NETLINK_URELEASE use it to clean up resources that
      were previously allocated to a socket via some command. As a result, no
      users require getting this notification for unbound sockets.
      
      Sending it for unbound sockets, however, is a problem because any user
      (including unprivileged users) can create a socket that uses the same ID
      as an existing socket. Binding this new socket will fail, but if the
      NETLINK_URELEASE notification is generated for such sockets, the users
      thereof will be tricked into thinking the socket that they allocated the
      resources for is closed.
      
      In the nl80211 case, this will cause destruction of virtual interfaces
      that still belong to an existing hostapd process; this is the case that
      Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
      of netlink log/queue it will cause them to stop reporting events, as if
      NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.
      
      Fix this problem by checking that the socket is bound before generating
      the NETLINK_URELEASE notification.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Ivanov <dima@ubnt.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2726020
  26. 05 4月, 2016 1 次提交
  27. 23 3月, 2016 1 次提交
  28. 19 2月, 2016 2 次提交
    • F
      nfnetlink: Revert "nfnetlink: add support for memory mapped netlink" · c5b0db32
      Florian Westphal 提交于
      reverts commit 3ab1f683 ("nfnetlink: add support for memory mapped
      netlink")'
      
      Like previous commits in the series, remove wrappers that are not needed
      after mmapped netlink removal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b0db32
    • F
      netlink: remove mmapped netlink support · d1b4c689
      Florian Westphal 提交于
      mmapped netlink has a number of unresolved issues:
      
      - TX zerocopy support had to be disabled more than a year ago via
        commit 4682a035 ("netlink: Always copy on mmap TX.")
        because the content of the mmapped area can change after netlink
        attribute validation but before message processing.
      
      - RX support was implemented mainly to speed up nfqueue dumping packet
        payload to userspace.  However, since commit ae08ce00
        ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
        with the socket-based interface too (via the skb_zerocopy helper).
      
      The other problem is that skbs attached to mmaped netlink socket
      behave different from normal skbs:
      
      - they don't have a shinfo area, so all functions that use skb_shinfo()
      (e.g. skb_clone) cannot be used.
      
      - reserving headroom prevents userspace from seeing the content as
      it expects message to start at skb->head.
      See for instance
      commit aa3a0220 ("netlink: not trim skb for mmaped socket when dump").
      
      - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
      crash because it needs the sk to check if a tx ring is attached.
      
      Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf35
      ("netfilter: nfnetlink: use original skbuff when acking batches").
      
      mmaped netlink also didn't play nicely with the skb_zerocopy helper
      used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
      commit 6bb0fef4 ("netlink, mmap: fix edge-case leakages in nf queue
      zero-copy")' but at the cost of also needing to provide remaining
      length to the allocation function.
      
      nfqueue also has problems when used with mmaped rx netlink:
      - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
        Problem is that in the mmap case, the allocation time also determines
        the ordering in which the frame will be seen by userspace (A
        allocating before B means that A is located in earlier ring slot,
        but this also means that B might get a lower sequence number then A
        since seqno is decided later.  To fix this we would need to extend the
        spinlocked region to also cover the allocation and message setup which
        isn't desirable.
      - nfqueue can now be configured to queue large (GSO) skbs to userspace.
        Queing GSO packets is faster than having to force a software segmentation
        in the kernel, so this is a desirable option.  However, with a mmap based
        ring one has to use 64kb per ring slot element, else mmap has to fall back
        to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
      
      To use the mmap interface, userspace not only has to probe for mmap netlink
      support, it also has to implement a recv/socket receive path in order to
      handle messages that exceed the size of an rx ring element.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b4c689
  29. 30 1月, 2016 1 次提交
  30. 16 12月, 2015 1 次提交
  31. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  32. 22 10月, 2015 1 次提交
    • D
      netlink: fix locking around NETLINK_LIST_MEMBERSHIPS · 47191d65
      David Herrmann 提交于
      Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
      the membership state to user-space. However, grabing the netlink table is
      effectively a write_lock_irq(), and as such we should not be triggering
      page-faults in the critical section.
      
      This can be easily reproduced by the following snippet:
          int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
          void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
          int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));
      
      This should work just fine, but currently triggers EFAULT and a possible
      WARN_ON below handle_mm_fault().
      
      Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
      lock. The write-lock was overkill in the first place, and the read-lock
      allows page-faults just fine.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47191d65
  33. 19 10月, 2015 1 次提交
    • A
      netlink: Trim skb to alloc size to avoid MSG_TRUNC · db65a3aa
      Arad, Ronen 提交于
      netlink_dump() allocates skb based on the calculated min_dump_alloc or
      a per socket max_recvmsg_len.
      min_alloc_size is maximum space required for any single netdev
      attributes as calculated by rtnl_calcit().
      max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
      It is capped at 16KiB.
      The intention is to avoid small allocations and to minimize the number
      of calls required to obtain dump information for all net devices.
      
      netlink_dump packs as many small messages as could fit within an skb
      that was sized for the largest single netdev information. The actual
      space available within an skb is larger than what is requested. It could
      be much larger and up to near 2x with align to next power of 2 approach.
      
      Allowing netlink_dump to use all the space available within the
      allocated skb increases the buffer size a user has to provide to avoid
      truncaion (i.e. MSG_TRUNG flag set).
      
      It was observed that with many VLANs configured on at least one netdev,
      a larger buffer of near 64KiB was necessary to avoid "Message truncated"
      error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
      min_alloc_size was only little over 32KiB.
      
      This patch trims skb to allocated size in order to allow the user to
      avoid truncation with more reasonable buffer size.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db65a3aa