1. 31 1月, 2012 1 次提交
  2. 24 12月, 2011 1 次提交
    • D
      netlink: Undo const marker in netlink_is_kernel(). · 035c4c16
      David S. Miller 提交于
      We can't do this without propagating the const to nlk_sk()
      too, otherwise:
      
      net/netlink/af_netlink.c: In function ‘netlink_is_kernel’:
      net/netlink/af_netlink.c:103:2: warning: passing argument 1 of ‘nlk_sk’ discards ‘const’ qualifier from pointer target type [enabled by default]
      net/netlink/af_netlink.c:96:36: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      035c4c16
  3. 23 12月, 2011 2 次提交
  4. 29 9月, 2011 1 次提交
    • E
      af_unix: dont send SCM_CREDENTIALS by default · 16e57262
      Eric Dumazet 提交于
      Since commit 7361c36c (af_unix: Allow credentials to work across
      user and pid namespaces) af_unix performance dropped a lot.
      
      This is because we now take a reference on pid and cred in each write(),
      and release them in read(), usually done from another process,
      eventually from another cpu. This triggers false sharing.
      
      # Events: 154K cycles
      #
      # Overhead  Command       Shared Object        Symbol
      # ........  .......  ..................  .........................
      #
          10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
           8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
           7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
           6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
           4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
           4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
           4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
           2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
           2.24%  hackbench  [kernel.kallsyms]   [k] sub_preempt_count
           1.75%  hackbench  [kernel.kallsyms]   [k] fget_light
           1.51%  hackbench  [kernel.kallsyms]   [k]
      __mutex_lock_interruptible_slowpath
           1.42%  hackbench  [kernel.kallsyms]   [k] sock_alloc_send_pskb
      
      This patch includes SCM_CREDENTIALS information in a af_unix message/skb
      only if requested by the sender, [man 7 unix for details how to include
      ancillary data using sendmsg() system call]
      
      Note: This might break buggy applications that expected SCM_CREDENTIAL
      from an unaware write() system call, and receiver not using SO_PASSCRED
      socket option.
      
      If SOCK_PASSCRED is set on source or destination socket, we still
      include credentials for mere write() syscalls.
      
      Performance boost in hackbench : more than 50% gain on a 16 thread
      machine (2 quad-core cpus, 2 threads per core)
      
      hackbench 20 thread 2000
      
      4.228 sec instead of 9.102 sec
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16e57262
  5. 12 8月, 2011 1 次提交
  6. 23 6月, 2011 1 次提交
    • J
      netlink: advertise incomplete dumps · 670dc283
      Johannes Berg 提交于
      Consider the following situation:
       * a dump that would show 8 entries, four in the first
         round, and four in the second
       * between the first and second rounds, 6 entries are
         removed
       * now the second round will not show any entry, and
         even if there is a sequence/generation counter the
         application will not know
      
      To solve this problem, add a new flag NLM_F_DUMP_INTR
      to the netlink header that indicates the dump wasn't
      consistent, this flag can also be set on the MSG_DONE
      message that terminates the dump, and as such above
      situation can be detected.
      
      To achieve this, add a sequence counter to the netlink
      callback struct. Of course, netlink code still needs
      to use this new functionality. The correct way to do
      that is to always set cb->seq when a dumpit callback
      is invoked and call nl_dump_check_consistent() for
      each new message. The core code will also call this
      function for the final MSG_DONE message.
      
      To make it usable with generic netlink, a new function
      genlmsg_nlhdr() is needed to obtain the netlink header
      from the genetlink user header.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      670dc283
  7. 17 6月, 2011 1 次提交
  8. 10 6月, 2011 1 次提交
    • G
      rtnetlink: Compute and store minimum ifinfo dump size · c7ac8679
      Greg Rose 提交于
      The message size allocated for rtnl ifinfo dumps was limited to
      a single page.  This is not enough for additional interface info
      available with devices that support SR-IOV and caused a bug in
      which VF info would not be displayed if more than approximately
      40 VFs were created per interface.
      
      Implement a new function pointer for the rtnl_register service that will
      calculate the amount of data required for the ifinfo dump and allocate
      enough data to satisfy the request.
      Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c7ac8679
  9. 24 5月, 2011 1 次提交
    • D
      net: convert %p usage to %pK · 71338aa7
      Dan Rosenberg 提交于
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      "(nil)".
      
      The supporting code for kptr_restrict and %pK are currently in the -mm
      tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
      pointers to the syslog are not covered, since this would eliminate useful
      information for postmortem debugging and the reading of the syslog is
      already optionally protected by the dmesg_restrict sysctl.
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71338aa7
  10. 08 5月, 2011 1 次提交
  11. 04 3月, 2011 2 次提交
  12. 01 3月, 2011 1 次提交
    • A
      netlink: handle errors from netlink_dump() · b44d211e
      Andrey Vagin 提交于
      netlink_dump() may failed, but nobody handle its error.
      It generates output data, when a previous portion has been returned to
      user space. This mechanism works when all data isn't go in skb. If we
      enter in netlink_recvmsg() and skb is absent in the recv queue, the
      netlink_dump() will not been executed. So if netlink_dump() is failed
      one time, the new data never appear and the reader will sleep forever.
      
      netlink_dump() is called from two places:
      
      1. from netlink_sendmsg->...->netlink_dump_start().
         In this place we can report error directly and it will be returned
         by sendmsg().
      
      2. from netlink_recvmsg
         There we can't report error directly, because we have a portion of
         valid output data and call netlink_dump() for prepare the next portion.
         If netlink_dump() is failed, the socket will be mark as error and the
         next recvmsg will be failed.
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b44d211e
  13. 25 10月, 2010 1 次提交
    • E
      netlink: fix netlink_change_ngroups() · 5c398dc8
      Eric Dumazet 提交于
      commit 6c04bb18 (netlink: use call_rcu for netlink_change_ngroups)
      used a somewhat convoluted and racy way to perform call_rcu().
      
      The old block of memory is freed after a grace period, but the rcu_head
      used to track it is located in new block.
      
      This can clash if we call two times or more netlink_change_ngroups(),
      and a block is freed before another. call_rcu() called on different cpus
      makes no guarantee in order of callbacks.
      
      Fix this using a more standard way of handling this : Each block of
      memory contains its own rcu_head, so that no 'use after free' can
      happens.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Johannes Berg <johannes@sipsolutions.net>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c398dc8
  14. 01 9月, 2010 1 次提交
  15. 19 8月, 2010 1 次提交
    • J
      netlink: fix compat recvmsg · 68d6ac6d
      Johannes Berg 提交于
      Since
      commit 1dacc76d
      Author: Johannes Berg <johannes@sipsolutions.net>
      Date:   Wed Jul 1 11:26:02 2009 +0000
      
          net/compat/wext: send different messages to compat tasks
      
      we had a race condition when setting and then
      restoring frag_list. Eric attempted to fix it,
      but the fix created even worse problems.
      
      However, the original motivation I had when I
      added the code that turned out to be racy is
      no longer clear to me, since we only copy up
      to skb->len to userspace, which doesn't include
      the frag_list length. As a result, not doing
      any frag_list clearing and restoring avoids
      the race condition, while not introducing any
      other problems.
      
      Additionally, while preparing this patch I found
      that since none of the remaining netlink code is
      really aware of the frag_list, we need to use the
      original skb's information for packet information
      and credentials. This fixes, for example, the
      group information received by compat tasks.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@kernel.org [2.6.31+, for 2.6.35 revert 1235f504]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68d6ac6d
  16. 16 8月, 2010 1 次提交
  17. 27 7月, 2010 1 次提交
  18. 21 7月, 2010 1 次提交
    • N
      drop_monitor: convert some kfree_skb call sites to consume_skb · 70d4bf6d
      Neil Horman 提交于
      Convert a few calls from kfree_skb to consume_skb
      
      Noticed while I was working on dropwatch that I was detecting lots of internal
      skb drops in several places.  While some are legitimate, several were not,
      freeing skbs that were at the end of their life, rather than being discarded due
      to an error.  This patch converts those calls sites from using kfree_skb to
      consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
      drops.  Tested successfully by myself
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70d4bf6d
  19. 17 6月, 2010 1 次提交
  20. 22 5月, 2010 1 次提交
  21. 02 4月, 2010 1 次提交
  22. 27 3月, 2010 1 次提交
  23. 21 3月, 2010 1 次提交
  24. 28 2月, 2010 1 次提交
  25. 04 2月, 2010 1 次提交
    • A
      netlink: fix for too early rmmod · 974c37e9
      Alexey Dobriyan 提交于
      Netlink code does module autoload if protocol userspace is asking for is
      not ready. However, module can dissapear right after it was autoloaded.
      Example: modprobe/rmmod stress-testing and xfrm_user.ko providing NETLINK_XFRM.
      
      netlink_create() in such situation _will_ create userspace socket and
      _will_not_ pin module. Now if module was removed and we're going to call
      ->netlink_rcv into nothing:
      
      BUG: unable to handle kernel paging request at ffffffffa02f842a
      					       ^^^^^^^^^^^^^^^^
      	modules are loaded near these addresses here
      
      IP: [<ffffffffa02f842a>] 0xffffffffa02f842a
      PGD 161f067 PUD 1623063 PMD baa12067 PTE 0
      Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/uevent
      CPU 1
      Pid: 11515, comm: ip Not tainted 2.6.33-rc5-netns-00594-gaaa5728-dirty #6 P5E/P5E
      RIP: 0010:[<ffffffffa02f842a>]  [<ffffffffa02f842a>] 0xffffffffa02f842a
      RSP: 0018:ffff8800baa3db48  EFLAGS: 00010292
      RAX: ffff8800baa3dfd8 RBX: ffff8800be353640 RCX: 0000000000000000
      RDX: ffffffff81959380 RSI: ffff8800bab7f130 RDI: 0000000000000001
      RBP: ffff8800baa3db58 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000011
      R13: ffff8800be353640 R14: ffff8800bcdec240 R15: ffff8800bd488010
      FS:  00007f93749656f0(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: ffffffffa02f842a CR3: 00000000ba82b000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ip (pid: 11515, threadinfo ffff8800baa3c000, task ffff8800bab7eb30)
      Stack:
       ffffffff813637c0 ffff8800bd488000 ffff8800baa3dba8 ffffffff8136397d
      <0> 0000000000000000 ffffffff81344adc 7fffffffffffffff 0000000000000000
      <0> ffff8800baa3ded8 ffff8800be353640 ffff8800bcdec240 0000000000000000
      Call Trace:
       [<ffffffff813637c0>] ? netlink_unicast+0x100/0x2d0
       [<ffffffff8136397d>] netlink_unicast+0x2bd/0x2d0
      
      	netlink_unicast_kernel:
      		nlk->netlink_rcv(skb);
      
       [<ffffffff81344adc>] ? memcpy_fromiovec+0x6c/0x90
       [<ffffffff81364263>] netlink_sendmsg+0x1d3/0x2d0
       [<ffffffff8133975b>] sock_sendmsg+0xbb/0xf0
       [<ffffffff8106cdeb>] ? __lock_acquire+0x27b/0xa60
       [<ffffffff810a18c3>] ? might_fault+0x73/0xd0
       [<ffffffff810a18c3>] ? might_fault+0x73/0xd0
       [<ffffffff8106db22>] ? __lock_release+0x82/0x170
       [<ffffffff810a190e>] ? might_fault+0xbe/0xd0
       [<ffffffff810a18c3>] ? might_fault+0x73/0xd0
       [<ffffffff81344c77>] ? verify_iovec+0x47/0xd0
       [<ffffffff8133a509>] sys_sendmsg+0x1a9/0x360
       [<ffffffff813c2be5>] ? _raw_spin_unlock_irqrestore+0x65/0x70
       [<ffffffff8106aced>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff813c2bc2>] ? _raw_spin_unlock_irqrestore+0x42/0x70
       [<ffffffff81197004>] ? __up_read+0x84/0xb0
       [<ffffffff8106ac95>] ? trace_hardirqs_on_caller+0x145/0x190
       [<ffffffff813c207f>] ? trace_hardirqs_on_thunk+0x3a/0x3f
       [<ffffffff8100262b>] system_call_fastpath+0x16/0x1b
      Code:  Bad RIP value.
      RIP  [<ffffffffa02f842a>] 0xffffffffa02f842a
       RSP <ffff8800baa3db48>
      CR2: ffffffffa02f842a
      
      If module was quickly removed after autoloading, return -E.
      
      Return -EPROTONOSUPPORT if module was quickly removed after autoloading.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      974c37e9
  26. 26 11月, 2009 1 次提交
  27. 17 11月, 2009 1 次提交
  28. 11 11月, 2009 1 次提交
  29. 06 11月, 2009 1 次提交
  30. 07 10月, 2009 1 次提交
  31. 01 10月, 2009 1 次提交
  32. 27 9月, 2009 1 次提交
  33. 25 9月, 2009 1 次提交
    • J
      genetlink: fix netns vs. netlink table locking (2) · b8273570
      Johannes Berg 提交于
      Similar to commit d136f1bd,
      there's a bug when unregistering a generic netlink family,
      which is caught by the might_sleep() added in that commit:
      
          BUG: sleeping function called from invalid context at net/netlink/af_netlink.c:183
          in_atomic(): 1, irqs_disabled(): 0, pid: 1510, name: rmmod
          2 locks held by rmmod/1510:
           #0:  (genl_mutex){+.+.+.}, at: [<ffffffff8138283b>] genl_unregister_family+0x2b/0x130
           #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8138270c>] __genl_unregister_mc_group+0x1c/0x120
          Pid: 1510, comm: rmmod Not tainted 2.6.31-wl #444
          Call Trace:
           [<ffffffff81044ff9>] __might_sleep+0x119/0x150
           [<ffffffff81380501>] netlink_table_grab+0x21/0x100
           [<ffffffff813813a3>] netlink_clear_multicast_users+0x23/0x60
           [<ffffffff81382761>] __genl_unregister_mc_group+0x71/0x120
           [<ffffffff81382866>] genl_unregister_family+0x56/0x130
           [<ffffffffa0007d85>] nl80211_exit+0x15/0x20 [cfg80211]
           [<ffffffffa000005a>] cfg80211_exit+0x1a/0x40 [cfg80211]
      
      Fix in the same way by grabbing the netlink table lock
      before doing rcu_read_lock().
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8273570
  34. 22 9月, 2009 1 次提交
  35. 15 9月, 2009 1 次提交
    • J
      genetlink: fix netns vs. netlink table locking · d136f1bd
      Johannes Berg 提交于
      Since my commits introducing netns awareness into
      genetlink we can get this problem:
      
      BUG: scheduling while atomic: modprobe/1178/0x00000002
      2 locks held by modprobe/1178:
       #0:  (genl_mutex){+.+.+.}, at: [<ffffffff8135ee1a>] genl_register_mc_grou
       #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8135eeb5>] genl_register_mc_g
      Pid: 1178, comm: modprobe Not tainted 2.6.31-rc8-wl-34789-g95cb731-dirty #
      Call Trace:
       [<ffffffff8103e285>] __schedule_bug+0x85/0x90
       [<ffffffff81403138>] schedule+0x108/0x588
       [<ffffffff8135b131>] netlink_table_grab+0xa1/0xf0
       [<ffffffff8135c3a7>] netlink_change_ngroups+0x47/0x100
       [<ffffffff8135ef0f>] genl_register_mc_group+0x12f/0x290
      
      because I overlooked that netlink_table_grab() will
      schedule, thinking it was just the rwlock. However,
      in the contention case, that isn't actually true.
      
      Fix this by letting the code grab the netlink table
      lock first and then the RCU for netns protection.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d136f1bd
  36. 25 8月, 2009 1 次提交
  37. 15 7月, 2009 1 次提交
    • J
      net/compat/wext: send different messages to compat tasks · 1dacc76d
      Johannes Berg 提交于
      Wireless extensions have the unfortunate problem that events
      are multicast netlink messages, and are not independent of
      pointer size. Thus, currently 32-bit tasks on 64-bit platforms
      cannot properly receive events and fail with all kinds of
      strange problems, for instance wpa_supplicant never notices
      disassociations, due to the way the 64-bit event looks (to a
      32-bit process), the fact that the address is all zeroes is
      lost, it thinks instead it is 00:00:00:00:01:00.
      
      The same problem existed with the ioctls, until David Miller
      fixed those some time ago in an heroic effort.
      
      A different problem caused by this is that we cannot send the
      ASSOCREQIE/ASSOCRESPIE events because sending them causes a
      32-bit wpa_supplicant on a 64-bit system to overwrite its
      internal information, which is worse than it not getting the
      information at all -- so we currently resort to sending a
      custom string event that it then parses. This, however, has a
      severe size limitation we are frequently hitting with modern
      access points; this limitation would can be lifted after this
      patch by sending the correct binary, not custom, event.
      
      A similar problem apparently happens for some other netlink
      users on x86_64 with 32-bit tasks due to the alignment for
      64-bit quantities.
      
      In order to fix these problems, I have implemented a way to
      send compat messages to tasks. When sending an event, we send
      the non-compat event data together with a compat event data in
      skb_shinfo(main_skb)->frag_list. Then, when the event is read
      from the socket, the netlink code makes sure to pass out only
      the skb that is compatible with the task. This approach was
      suggested by David Miller, my original approach required
      always sending two skbs but that had various small problems.
      
      To determine whether compat is needed or not, I have used the
      MSG_CMSG_COMPAT flag, and adjusted the call path for recv and
      recvfrom to include it, even if those calls do not have a cmsg
      parameter.
      
      I have not solved one small part of the problem, and I don't
      think it is necessary to: if a 32-bit application uses read()
      rather than any form of recvmsg() it will still get the wrong
      (64-bit) event. However, neither do applications actually do
      this, nor would it be a regression.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dacc76d
  38. 13 7月, 2009 1 次提交
    • J
      netlink: use call_rcu for netlink_change_ngroups · 6c04bb18
      Johannes Berg 提交于
      For the network namespace work in generic netlink I need
      to be able to call this function under rcu_read_lock(),
      otherwise the locking becomes a nightmare and more locks
      would be needed. Instead, just embed a struct rcu_head
      (actually a struct listeners_rcu_head that also carries
      the pointer to the memory block) into the listeners
      memory so we can use call_rcu() instead of synchronising
      and then freeing. No rcu_barrier() is needed since this
      code cannot be modular.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c04bb18