1. 12 3月, 2015 3 次提交
    • W
      sock: fix possible NULL sk dereference in __skb_tstamp_tx · 3a8dd971
      Willem de Bruijn 提交于
      Test that sk != NULL before reading sk->sk_tsflags.
      
      Fixes: 49ca0d8b ("net-timestamp: no-payload option")
      Reported-by: NOne Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a8dd971
    • E
      xps: must clear sender_cpu before forwarding · c29390c6
      Eric Dumazet 提交于
      John reported that my previous commit added a regression
      on his router.
      
      This is because sender_cpu & napi_id share a common location,
      so get_xps_queue() can see garbage and perform an out of bound access.
      
      We need to make sure sender_cpu is cleared before doing the transmit,
      otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
      this bug.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJohn <jw@nuclearfallout.net>
      Bisected-by: NJohn <jw@nuclearfallout.net>
      Fixes: 2bd82484 ("xps: fix xps for stacked devices")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c29390c6
    • A
      net: sysctl_net_core: check SNDBUF and RCVBUF for min length · b1cb59cf
      Alexey Kodanev 提交于
      sysctl has sysctl.net.core.rmem_*/wmem_* parameters which can be
      set to incorrect values. Given that 'struct sk_buff' allocates from
      rcvbuf, incorrectly set buffer length could result to memory
      allocation failures. For example, set them as follows:
      
          # sysctl net.core.rmem_default=64
            net.core.wmem_default = 64
          # sysctl net.core.wmem_default=64
            net.core.wmem_default = 64
          # ping localhost -s 1024 -i 0 > /dev/null
      
      This could result to the following failure:
      
      skbuff: skb_over_panic: text:ffffffff81628db4 len:-32 put:-32
      head:ffff88003a1cc200 data:ffff88003a1cc200 tail:0xffffffe0 end:0xc0 dev:<NULL>
      kernel BUG at net/core/skbuff.c:102!
      invalid opcode: 0000 [#1] SMP
      ...
      task: ffff88003b7f5550 ti: ffff88003ae88000 task.ti: ffff88003ae88000
      RIP: 0010:[<ffffffff8155fbd1>]  [<ffffffff8155fbd1>] skb_put+0xa1/0xb0
      RSP: 0018:ffff88003ae8bc68  EFLAGS: 00010296
      RAX: 000000000000008d RBX: 00000000ffffffe0 RCX: 0000000000000000
      RDX: ffff88003fdcf598 RSI: ffff88003fdcd9c8 RDI: ffff88003fdcd9c8
      RBP: ffff88003ae8bc88 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000001 R11: 00000000000002b2 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88003d3f7300 R15: ffff88000012a900
      FS:  00007fa0e2b4a840(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000d0f7e0 CR3: 000000003b8fb000 CR4: 00000000000006f0
      Stack:
       ffff88003a1cc200 00000000ffffffe0 00000000000000c0 ffffffff818cab1d
       ffff88003ae8bd68 ffffffff81628db4 ffff88003ae8bd48 ffff88003b7f5550
       ffff880031a09408 ffff88003b7f5550 ffff88000012aa48 ffff88000012ab00
      Call Trace:
       [<ffffffff81628db4>] unix_stream_sendmsg+0x2c4/0x470
       [<ffffffff81556f56>] sock_write_iter+0x146/0x160
       [<ffffffff811d9612>] new_sync_write+0x92/0xd0
       [<ffffffff811d9cd6>] vfs_write+0xd6/0x180
       [<ffffffff811da499>] SyS_write+0x59/0xd0
       [<ffffffff81651532>] system_call_fastpath+0x12/0x17
      Code: 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00
            00 00 48 c7 c7 30 db 91 81 48 89 04 24 31 c0 e8 4f a8 0e 00 <0f> 0b
            eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83
      RIP  [<ffffffff8155fbd1>] skb_put+0xa1/0xb0
      RSP <ffff88003ae8bc68>
      Kernel panic - not syncing: Fatal exception
      
      Moreover, the possible minimum is 1, so we can get another kernel panic:
      ...
      BUG: unable to handle kernel paging request at ffff88013caee5c0
      IP: [<ffffffff815604cf>] __alloc_skb+0x12f/0x1f0
      ...
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1cb59cf
  2. 11 3月, 2015 2 次提交
  3. 01 3月, 2015 3 次提交
    • E
      net: do not use rcu in rtnl_dump_ifinfo() · cac5e65e
      Eric Dumazet 提交于
      We did a failed attempt in the past to only use rcu in rtnl dump
      operations (commit e67f88dd "net: dont hold rtnl mutex during
      netlink dump callbacks")
      
      Now that dumps are holding RTNL anyway, there is no need to also
      use rcu locking, as it forbids any scheduling ability, like
      GFP_KERNEL allocations that controlling path should use instead
      of GFP_ATOMIC whenever possible.
      
      This should fix following splat Cong Wang reported :
      
       [ INFO: suspicious RCU usage. ]
       3.19.0+ #805 Tainted: G        W
      
       include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 0
       2 locks held by ip/771:
        #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
        #1:  (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e
      
       stack backtrace:
       CPU: 3 PID: 771 Comm: ip Tainted: G        W       3.19.0+ #805
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
        ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
        00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
       Call Trace:
        [<ffffffff81a27457>] dump_stack+0x4c/0x65
        [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
        [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
        [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
        [<ffffffff8109e67d>] __might_sleep+0x78/0x80
        [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
        [<ffffffff817c1383>] alloc_netid+0x61/0x69
        [<ffffffff817c14c3>] __peernet2id+0x79/0x8d
        [<ffffffff817c1ab7>] peernet2id+0x13/0x1f
        [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
        [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
        [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
        [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
        [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
        [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
        [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
        [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
        [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
        [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
        [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
        [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
        [<ffffffff811ac556>] ? __fget_light+0x2d/0x52
        [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
        [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cac5e65e
    • E
      net: Verify permission to link_net in newlink · 06615bed
      Eric W. Biederman 提交于
      When applicable verify that the caller has permisson to the underlying
      network namespace for a newly created network device.
      
      Similary checks exist for the network namespace a network device will
      be created in.
      
      Fixes: 317f4810 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06615bed
    • E
      net: Verify permission to dest_net in newlink · 505ce415
      Eric W. Biederman 提交于
      When applicable verify that the caller has permision to create a
      network device in another network namespace.  This check is already
      present when moving a network device between network namespaces in
      setlink so all that is needed is to duplicate that check in newlink.
      
      This change almost backports cleanly, but there are context conflicts
      as the code that follows was added in v4.0-rc1
      
      Fixes: b51642f6 net: Enable a userns root rtnl calls that are safe for unprivilged users
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      505ce415
  4. 25 2月, 2015 1 次提交
  5. 23 2月, 2015 1 次提交
  6. 22 2月, 2015 1 次提交
  7. 21 2月, 2015 2 次提交
  8. 20 2月, 2015 1 次提交
  9. 16 2月, 2015 1 次提交
  10. 15 2月, 2015 2 次提交
  11. 14 2月, 2015 1 次提交
  12. 12 2月, 2015 1 次提交
  13. 09 2月, 2015 2 次提交
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  14. 08 2月, 2015 2 次提交
  15. 06 2月, 2015 2 次提交
  16. 05 2月, 2015 3 次提交
    • E
      net: remove some sparse warnings · 2ce1ee17
      Eric Dumazet 提交于
      netdev_adjacent_add_links() and netdev_adjacent_del_links()
      are static.
      
      queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ce1ee17
    • M
      net/core: Add event for a change in slave state · 61bd3857
      Moni Shoua 提交于
      Add event which provides an indication on a change in the state
      of a bonding slave. The event handler should cast the pointer to the
      appropriate type (struct netdev_bonding_info) in order to get the
      full info about the slave.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bd3857
    • E
      xps: fix xps for stacked devices · 2bd82484
      Eric Dumazet 提交于
      A typical qdisc setup is the following :
      
      bond0 : bonding device, using HTB hierarchy
      eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
      
      XPS allows to spread packets on specific tx queues, based on the cpu
      doing the send.
      
      Problem is that dequeues from bond0 qdisc can happen on random cpus,
      due to the fact that qdisc_run() can dequeue a batch of packets.
      
      CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
      CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
      
      CPUB -> dequeue packet P1 from bond0
              enqueue packet on eth1/eth2
      CPUC -> dequeue packet P2 from bond0
              enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
      
      get_xps_queue() then might select wrong queue for P1, since current cpu
      might be different than CPUA.
      
      P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
      if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
      
      Effect of this bug is TCP reorders, and more generally not optimal
      TX queue placement. (A victim bulk flow can be migrated to the wrong TX
      queue for a while)
      
      To fix this, we have to record sender cpu number the first time
      dev_queue_xmit() is called for one tx skb.
      
      We can union napi_id (used on receive path) and sender_cpu,
      granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
      this union idea)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bd82484
  17. 04 2月, 2015 1 次提交
  18. 03 2月, 2015 2 次提交
    • W
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn 提交于
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b245be1f
    • W
      net-timestamp: no-payload option · 49ca0d8b
      Willem de Bruijn 提交于
      Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
      timestamps, this loops timestamps on top of empty packets.
      
      Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
      cmsg reception (aside from timestamps) are no longer possible. This
      works together with a follow on patch that allows administrators to
      only allow tx timestamping if it does not loop payload or metadata.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes (rfc -> v1)
        - add documentation
        - remove unnecessary skb->len test (thanks to Richard Cochran)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ca0d8b
  19. 02 2月, 2015 1 次提交
  20. 31 1月, 2015 1 次提交
  21. 30 1月, 2015 2 次提交
  22. 29 1月, 2015 1 次提交
  23. 27 1月, 2015 1 次提交
  24. 24 1月, 2015 2 次提交
  25. 23 1月, 2015 1 次提交