1. 18 12月, 2013 1 次提交
  2. 14 12月, 2013 1 次提交
    • L
      packet: fix using smp_processor_id() in preemptible code · 1cbac010
      Li Zhong 提交于
      This patches fixes the following warning by replacing smp_processor_id()
      with raw_smp_processor_id():
      
      [   11.120893] BUG: using smp_processor_id() in preemptible [00000000] code: arping/3510
      [   11.120913] caller is .packet_sendmsg+0xc14/0xe68
      [   11.120920] CPU: 13 PID: 3510 Comm: arping Not tainted 3.13.0-rc3-next-20131211-dirty #1
      [   11.120926] Call Trace:
      [   11.120932] [c0000001f803f6f0] [c0000000000138dc] .show_stack+0x110/0x25c (unreliable)
      [   11.120942] [c0000001f803f7e0] [c00000000083dd24] .dump_stack+0xa0/0x37c
      [   11.120951] [c0000001f803f870] [c000000000493fd4] .debug_smp_processor_id+0xfc/0x12c
      [   11.120959] [c0000001f803f900] [c0000000007eba78] .packet_sendmsg+0xc14/0xe68
      [   11.120968] [c0000001f803fa80] [c000000000700968] .sock_sendmsg+0xa0/0xe0
      [   11.120975] [c0000001f803fbf0] [c0000000007014d8] .SyS_sendto+0x100/0x148
      [   11.120983] [c0000001f803fd60] [c0000000006fff10] .SyS_socketcall+0x1c4/0x2e8
      [   11.120990] [c0000001f803fe30] [c00000000000a1e4] syscall_exit+0x0/0x9c
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cbac010
  3. 10 12月, 2013 2 次提交
    • D
      packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa
      Daniel Borkmann 提交于
      This patch introduces a PACKET_QDISC_BYPASS socket option, that
      allows for using a similar xmit() function as in pktgen instead
      of taking the dev_queue_xmit() path. This can be very useful when
      PF_PACKET applications are required to be used in a similar
      scenario as pktgen, but with full, flexible packet payload that
      needs to be provided, for example.
      
      On default, nothing changes in behaviour for normal PF_PACKET
      TX users, so everything stays as is for applications. New users,
      however, can now set PACKET_QDISC_BYPASS if needed to prevent
      own packets from i) reentering packet_rcv() and ii) to directly
      push the frame to the driver.
      
      In doing so we can increase pps (here 64 byte packets) for
      PF_PACKET a bit:
      
        # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
        1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
        2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
        3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
        4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
        5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
        6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
        7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
        8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
        9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
       10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
       11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
       [...]
       20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081
      
       [**]: qdisc path with packet_rcv(), how probably most people
             seem to use it (hopefully not anymore if not needed)
      
      The test was done using a modified trafgen, sending a simple
      static 64 bytes packet, on all CPUs.  The trick in the fast
      "qdisc path" case, is to avoid reentering packet_rcv() by
      setting the RAW socket protocol to zero, like:
      socket(PF_PACKET, SOCK_RAW, 0);
      
      Tradeoffs are documented as well in this patch, clearly, if
      queues are busy, we will drop more packets, tc disciplines are
      ignored, and these packets are not visible to taps anymore. For
      a pktgen like scenario, we argue that this is acceptable.
      
      The pointer to the xmit function has been placed in packet
      socket structure hole between cached_dev and prot_hook that
      is hot anyway as we're working on cached_dev in each send path.
      
      Done in joint work together with Jesper Dangaard Brouer.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d346a3fa
    • D
      packet: fix send path when running with proto == 0 · 66e56cd4
      Daniel Borkmann 提交于
      Commit e40526cb introduced a cached dev pointer, that gets
      hooked into register_prot_hook(), __unregister_prot_hook() to
      update the device used for the send path.
      
      We need to fix this up, as otherwise this will not work with
      sockets created with protocol = 0, plus with sll_protocol = 0
      passed via sockaddr_ll when doing the bind.
      
      So instead, assign the pointer directly. The compiler can inline
      these helper functions automagically.
      
      While at it, also assume the cached dev fast-path as likely(),
      and document this variant of socket creation as it seems it is
      not widely used (seems not even the author of TX_RING was aware
      of that in his reference example [1]). Tested with reproducer
      from e40526cb.
      
       [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example
      
      Fixes: e40526cb ("packet: fix use after free race in send path when dev is released")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NSalam Noureddine <noureddine@aristanetworks.com>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66e56cd4
  4. 07 12月, 2013 1 次提交
  5. 30 11月, 2013 1 次提交
  6. 22 11月, 2013 1 次提交
    • D
      packet: fix use after free race in send path when dev is released · e40526cb
      Daniel Borkmann 提交于
      Salam reported a use after free bug in PF_PACKET that occurs when
      we're sending out frames on a socket bound device and suddenly the
      net device is being unregistered. It appears that commit 827d9780
      introduced a possible race condition between {t,}packet_snd() and
      packet_notifier(). In the case of a bound socket, packet_notifier()
      can drop the last reference to the net_device and {t,}packet_snd()
      might end up suddenly sending a packet over a freed net_device.
      
      To avoid reverting 827d9780 and thus introducing a performance
      regression compared to the current state of things, we decided to
      hold a cached RCU protected pointer to the net device and maintain
      it on write side via bind spin_lock protected register_prot_hook()
      and __unregister_prot_hook() calls.
      
      In {t,}packet_snd() path, we access this pointer under rcu_read_lock
      through packet_cached_dev_get() that holds reference to the device
      to prevent it from being freed through packet_notifier() while
      we're in send path. This is okay to do as dev_put()/dev_hold() are
      per-cpu counters, so this should not be a performance issue. Also,
      the code simplifies a bit as we don't need need_rls_dev anymore.
      
      Fixes: 827d9780 ("af-packet: Use existing netdev reference for bound sockets.")
      Reported-by: NSalam Noureddine <noureddine@aristanetworks.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NSalam Noureddine <noureddine@aristanetworks.com>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e40526cb
  7. 21 11月, 2013 1 次提交
    • H
      net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426
      Hannes Frederic Sowa 提交于
      This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
      set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
      to return msg_name to the user.
      
      This prevents numerous uninitialized memory leaks we had in the
      recvmsg handlers and makes it harder for new code to accidentally leak
      uninitialized memory.
      
      Optimize for the case recvfrom is called with NULL as address. We don't
      need to copy the address at all, so set it to NULL before invoking the
      recvmsg handler. We can do so, because all the recvmsg handlers must
      cope with the case a plain read() is called on them. read() also sets
      msg_name to NULL.
      
      Also document these changes in include/linux/net.h as suggested by David
      Miller.
      
      Changes since RFC:
      
      Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      affect sendto as it would bail out earlier while trying to copy-in the
      address. It also more naturally reflects the logic by the callers of
      verify_iovec.
      
      With this change in place I could remove "
      if (!uaddr || msg_sys->msg_namelen == 0)
      	msg->msg_name = NULL
      ".
      
      This change does not alter the user visible error logic as we ignore
      msg_namelen as long as msg_name is NULL.
      
      Also remove two unnecessary curly brackets in ___sys_recvmsg and change
      comments to netdev style.
      
      Cc: David Miller <davem@davemloft.net>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3d33426
  8. 30 8月, 2013 2 次提交
  9. 21 8月, 2013 1 次提交
  10. 10 8月, 2013 1 次提交
    • E
      net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271
      Eric Dumazet 提交于
      Adding paged frags skbs to af_unix sockets introduced a performance
      regression on large sends because of additional page allocations, even
      if each skb could carry at least 100% more payload than before.
      
      We can instruct sock_alloc_send_pskb() to attempt high order
      allocations.
      
      Most of the time, it does a single page allocation instead of 8.
      
      I added an additional parameter to sock_alloc_send_pskb() to
      let other users to opt-in for this new feature on followup patches.
      
      Tested:
      
      Before patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    46861.15
      
      After patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    57981.11
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d64271
  11. 08 8月, 2013 1 次提交
  12. 03 8月, 2013 3 次提交
  13. 23 7月, 2013 1 次提交
  14. 13 6月, 2013 1 次提交
  15. 29 5月, 2013 1 次提交
  16. 04 5月, 2013 1 次提交
    • D
      packet: tpacket_v3: do not trigger bug() on wrong header status · 8da3056c
      Daniel Borkmann 提交于
      Jakub reported that it is fairly easy to trigger the BUG() macro
      from user space with TPACKET_V3's RX_RING by just giving a wrong
      header status flag. We already had a similar situation in commit
      7f5c3e3a (``af_packet: remove BUG statement in
      tpacket_destruct_skb'') where this was the case in the TX_RING
      side that could be triggered from user space. So really, don't use
      BUG() or BUG_ON() unless there's really no way out, and i.e.
      don't use it for consistency checking when there's user space
      involved, no excuses, especially not if you're slapping the user
      with WARN + dump_stack + BUG all at once. The two functions are
      of concern:
      
        prb_retire_current_block() [when block status != TP_STATUS_KERNEL]
        prb_open_block() [when block_status != TP_STATUS_KERNEL]
      
      Calls to prb_open_block() are guarded by ealier checks if block_status
      is really TP_STATUS_KERNEL (racy!), but the first one BUG() is easily
      triggable from user space. System behaves still stable after they are
      removed. Also remove that yoda condition entirely, since it's already
      guarded.
      Reported-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8da3056c
  17. 25 4月, 2013 4 次提交
    • D
      packet: account statistics only in tpacket_stats_u · ee80fbf3
      Daniel Borkmann 提交于
      Currently, packet_sock has a struct tpacket_stats stats member for
      TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
      ``union tpacket_stats_u stats_u'' was introduced, where however only
      statistics for TPACKET_V3 are held, and when copied to user space,
      TPACKET_V3 does some hackery and access also tpacket_stats' stats,
      although everything could have been done within the union itself.
      
      Unify accounting within the tpacket_stats_u union so that we can
      remove 8 bytes from packet_sock that are there unnecessary. Note that
      even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
      this still works due to the union with same types + offsets, that are
      exposed to the user space.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80fbf3
    • D
      packet: if hw/sw ts enabled in rx/tx ring, report which ts we got · b9c32fb2
      Daniel Borkmann 提交于
      Currently, there is no way to find out which timestamp is reported in
      tpacket{,2,3}_hdr's tp_sec, tp_{n,u}sec members. It can be one of
      SOF_TIMESTAMPING_SYS_HARDWARE, SOF_TIMESTAMPING_RAW_HARDWARE,
      SOF_TIMESTAMPING_SOFTWARE, or a fallback variant late call from the
      PF_PACKET code in software.
      
      Therefore, report in the tp_status member of the ring buffer which
      timestamp has been reported for RX and TX path. This should not break
      anything for the following reasons: i) in RX ring path, the user needs
      to test for tp_status & TP_STATUS_USER, and later for other flags as
      well such as TP_STATUS_VLAN_VALID et al, so adding other flags will
      do no harm; ii) in TX ring path, time stamps with PACKET_TIMESTAMP
      socketoption are not available resp. had no effect except that the
      application setting this is buggy. Next to TP_STATUS_AVAILABLE, the
      user also should check for other flags such as TP_STATUS_WRONG_FORMAT
      to reclaim frames to the application. Thus, in case TX ts are turned
      off (default case), nothing happens to the application logic, and in
      case we want to use this new feature, we now can also check which of
      the ts source is reported in the status field as provided in the docs.
      Reported-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9c32fb2
    • D
      packet: enable hardware tx timestamping on tpacket ring · 7a51384c
      Daniel Borkmann 提交于
      Currently, we only have software timestamping for the TX ring buffer
      path, but this limitation stems rather from the implementation. By
      just reusing tpacket_get_timestamp(), we can also allow hardware
      timestamping just as in the RX path.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a51384c
    • W
      packet: tx timestamping on tpacket ring · 2e31396f
      Willem de Bruijn 提交于
      When transmit timestamping is enabled at the socket level, record a
      timestamp on packets written to a PACKET_TX_RING. Tx timestamps are
      always looped to the application over the socket error queue. Software
      timestamps are also written back into the packet frame header in the
      packet ring.
      Reported-by: NPaul Chavent <paul.chavent@onera.fr>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e31396f
  18. 20 4月, 2013 1 次提交
  19. 17 4月, 2013 1 次提交
  20. 15 4月, 2013 1 次提交
  21. 28 3月, 2013 1 次提交
  22. 27 3月, 2013 1 次提交
  23. 20 3月, 2013 1 次提交
    • W
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn 提交于
      Changes:
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      correctness.
      
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      
      Also, passes psock_fanout.c unit test added to selftests.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77f65ebd
  24. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  25. 19 2月, 2013 2 次提交
  26. 04 2月, 2013 1 次提交
    • P
      packet: fix leakage of tx_ring memory · 9665d5d6
      Phil Sutter 提交于
      When releasing a packet socket, the routine packet_set_ring() is reused
      to free rings instead of allocating them. But when calling it for the
      first time, it fills req->tp_block_nr with the value of rb->pg_vec_len
      which in the second invocation makes it bail out since req->tp_block_nr
      is greater zero but req->tp_block_size is zero.
      
      This patch solves the problem by passing a zeroed auto-variable to
      packet_set_ring() upon each invocation from packet_release().
      
      As far as I can tell, this issue exists even since 69e3c75f (net: TX_RING
      and packet mmap), i.e. the original inclusion of TX ring support into
      af_packet, but applies only to sockets with both RX and TX ring
      allocated, which is probably why this was unnoticed all the time.
      Signed-off-by: NPhil Sutter <phil.sutter@viprinet.com>
      Cc: Johann Baudy <johann.baudy@gnu-log.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9665d5d6
  27. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root to control llc, netfilter, netlink, packet, and xfrm · df008c91
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Allow creation of af_key sockets.
      Allow creation of llc sockets.
      Allow creation of af_packet sockets.
      
      Allow sending xfrm netlink control messages.
      
      Allow binding to netlink multicast groups.
      Allow sending to netlink multicast groups.
      Allow adding and dropping netlink multicast groups.
      Allow sending to all netlink multicast groups and port ids.
      
      Allow reading the netfilter SO_IP_SET socket option.
      Allow sending netfilter netlink messages.
      Allow setting and getting ip_vs netfilter socket options.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df008c91
  28. 08 11月, 2012 1 次提交
  29. 26 10月, 2012 1 次提交
  30. 24 8月, 2012 1 次提交
  31. 23 8月, 2012 2 次提交
    • P
      packet: Protect packet sk list with mutex (v2) · 0fa7fa98
      Pavel Emelyanov 提交于
      Change since v1:
      
      * Fixed inuse counters access spotted by Eric
      
      In patch eea68e2f (packet: Report socket mclist info via diag module) I've
      introduced a "scheduling in atomic" problem in packet diag module -- the
      socket list is traversed under rcu_read_lock() while performed under it sk
      mclist access requires rtnl lock (i.e. -- mutex) to be taken.
      
      [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
      [152363.820573] 4 locks held by crtools/12517:
      [152363.820581]  #0:  (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
      [152363.820613]  #1:  (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
      [152363.820644]  #2:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
      [152363.820693]  #3:  (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af
      
      Similar thing was then re-introduced by further packet diag patches (fanount
      mutex and pgvec mutex for rings) :(
      
      Apart from being terribly sorry for the above, I propose to change the packet
      sk list protection from spinlock to mutex. This lock currently protects two
      modifications:
      
      * sklist
      * prot inuse counters
      
      The sklist modifications can be just reprotected with mutex since they already
      occur in a sleeping context. The inuse counters modifications are trickier -- the
      __this_cpu_-s are used inside, thus requiring the caller to handle the potential
      issues with contexts himself. Since packet sockets' counters are modified in two
      places only (packet_create and packet_release) we only need to protect the context
      from being preempted. BH disabling is not required in this case.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fa7fa98
    • D
      af_packet: use define instead of constant · 9e67030a
      danborkmann@iogearbox.net 提交于
      Instead of using a hard-coded value for the status variable, it would make
      the code more readable to use its destined define from linux/if_packet.h.
      
      Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e67030a