1. 22 5月, 2015 23 次提交
    • E
      sfc: protect filter table against use-after-free · 0d322413
      Edward Cree 提交于
      If MCDI timeouts are encountered during efx_ef10_filter_table_remove(),
      an FLR will be queued, but efx->filter_state will still be kfree()d.
      The queued FLR will then call efx_ef10_filter_table_restore(), which
      will try to use efx->filter_state. This previously caused a panic.
      This patch adds an rwsem to protect the existence of efx->filter_state,
      separately from the spinlock protecting its contents.  Users which can
      race against efx_ef10_filter_table_remove() should down_read this rwsem.
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d322413
    • S
      sfc: Store the efx_nic struct of the current VF in the VF data struct · f1122a34
      Shradha Shah 提交于
      Initialised in efx_probe_vf and removal is dealt with in
      efx_ef10_remove.
      
      vf->efx is needed in future patches to change the MAC address
      of the VF via the parent PF, while the driver is bound to the
      VF.
      Example: ip link set dev vf NUM mac LLADDR
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1122a34
    • S
      sfc: save old MAC address in case sriov_mac_address_changed fails · cfc77c2f
      Shradha Shah 提交于
      Otherwise the PF and VF can disagree on the VF's MAC address and
      this leads to strange behaviour, up to and including kernel panics.
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc77c2f
    • S
      sfc: Store vf_index in nic_data for Ef10. · 88a37de6
      Shradha Shah 提交于
      Added function efx_ef10_get_vf_index to store the vf_index
      in nic_data during probe
      
      vf_index is needed in future patches to access a particular
      VF in the VF data structure.
      
      Moved efx_ef10_probe_pf and efx_ef10_probe_vf in order to
      used efx_ef10_remove
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a37de6
    • S
      sfc: MC_CMD_SET_MAC can only be called by the link control Function · 862f894c
      Shradha Shah 提交于
      MC_CMD_SET_MAC is privileged and can only by called by the link
      control function.
      
      This patch adds efx_ef10_mac_reconfigure_vf which avoids the call
      to MC_CMD_SET_MAC by the Virtual function
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      862f894c
    • S
      af6a074d
    • S
      sfc: Add permissions to MCDI commands · 75122ec8
      Shradha Shah 提交于
      There is one primary function per adaptor, one link control function
      per port and the rest as categorised as general.
      
      This patch adds privileges to the MCDI commands based on which
      functions are allowed to call them.
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75122ec8
    • V
      stmmac: replace open coded __netdev_alloc_skb_ip_align() with actual call · 4ec49a37
      Vineet Gupta 提交于
      This also matches with the sibling call netdev_alloc_skb_ip_align() made in
      rx fast path.
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ec49a37
    • J
      qlge: Move jiffies_to_usecs immediately before loop · 3f6e785f
      Joe Perches 提交于
      30 usecs (or really, 1 jiffy) can go by pretty fast.
      
      Move the set of the timeout immediately before the loop.
      
      Remove the unnecessary max(1ul, usecs_to_jiffies(30)) as
      usecs_to_jiffies with a non-zero constant is guaranteed
      to be non-zero.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f6e785f
    • D
      Merge branch 'rocker-transaction-fixes' · 4ac2dc89
      David S. Miller 提交于
      Simon Horman says:
      
      ====================
      rocker: transaction fixes
      
      this series addresses what appear to be errors in the handling of
      prepare and then commit transactions in the rocker driver.
      
      In all cases the problem is that data structures visible outside of
      the transaction are modified during the prepare phase.
      
      In the case of the first two patches this results in the kernel reporting a
      BUG. I have noted test-cases in the change logs.
      
      The third patch is also a bug fix, as noted by  Toshiaki Makita,
      however I have not been able to reliably reproduce the problem and
      thus have not provided a test case.
      
      The last patch is a correctness fix that does not fix a bug
      that manifests as far as I can tell.
      
      Changes: v3->v4
      * All patches
        - Add Jiri Pirko's ack
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Setting of entry values in all transaction phases
          as suggested by Toshiaki Makita
      * "rocker: make rocker_port_internal_vlan_id_{get,put}() non-transactional"
        - Remove Fixes tag as I believe this is a correctness rather than a bug fix
      
      Changes: v2->v3
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Correct inverted logic
        - Added ack from Scott Feldman
      
      Changes: v1->v2
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Revised changelog to reflect information from Toshiaki Makita
          that there is a bug that can manifest
        - Update address and ttl regardless of the value of the transaction state
      * All other patches
        - Added acks from Scott Feldman
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ac2dc89
    • S
      rocker: make rocker_port_internal_vlan_id_{get, put}() non-transactional · df6a2067
      Simon Horman 提交于
      The motivation for this is that rocker_port_internal_vlan_id_{get,put} appear
      to only partially implement the transaction model: memory allocation
      and freeing is transactional, but hash and bitmap manipulation is not.
      
      The latter could be fixed, however, as it is not currently exercised
      due to trans always being SWITCHDEV_TRANS_NONE it seems cleaner
      to make rocker_port_internal_vlan_id_get non-transactional.
      
      This problem was introduced by c4f20321 ("rocker: support
      prepare-commit transaction model").
      
      Found by inspection.
      I do not believe that this change should have any run-time effect.
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df6a2067
    • S
      rocker: do not make neighbour entry changes when preparing transactions · 550ecc92
      Simon Horman 提交于
      rocker_port_ipv4_nh() and in turn rocker_port_ipv4_neigh() may be
      be called with trans == SWITCHDEV_TRANS_PREPARE and then
      trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_obj_set() via
      fib_table_insert().
      
      The first time that rocker_port_ipv4_nh() is called, with
      trans == SWITCHDEV_TRANS_PREPARE, _rocker_neigh_add() adds a new entry to
      the neigh table.
      
      And the second time  rocker_port_ipv4_nh() is called, with
      trans == SWITCHDEV_TRANS_COMMIT, that entry is found. This causes
      rocker_port_ipv4_nh() to believe it is not adding an entry and thus it
      frees "entry", which is still present in rocker driver's neigh table.
      
      This problem does not appear to affect deletion as my analysis is that
      deletion is always performed with trans == SWITCHDEV_TRANS_NONE.
      
      For completeness _rocker_neigh_{add,del,prepare} are updated not to
      manipulate fib table entries if trans == SWITCHDEV_TRANS_PREPARE.
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Reported-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      550ecc92
    • S
      rocker: do not modify fdb table in rocker_port_fdb() when preparing transactions · 42e94889
      Simon Horman 提交于
      rocker_port_fdb_flush() may be called be called with
      trans == SWITCHDEV_TRANS_PREPARE and then trans == SWITCHDEV_TRANS_COMMIT from
      switchdev_port_attr_set() via switchdev_port_obj_add().
      
      Adding the new entry to the FDB table when trans == SWITCHDEV_TRANS_PREPARE
      may result in a memory leak because when trans == SWITCHDEV_TRANS_PREPARE
      rocker_flow_tbl_bridge() will allocate memory when called via
      rocker_port_fdb_learn(). However, when trans == SWITCHDEV_TRANS_COMMIT
      the presence of the FDB entry in the FDB table causes
      rocker_port_fdb() to set the ROCKER_OP_FLAG_REFRESH flag which results
      in rocker_port_fdb_learn() skipping the call to rocker_flow_tbl_bridge()
      which would free the memory allocated by it when
      trans == SWITCHDEV_TRANS_PREPARE.
      
      ip link add br0 type bridge
      ip link set up dev eth0
      ip link set dev eth0 master br0
      bridge fdb add 52:54:00:12:35:08 dev eth0
      bridge fdb add 52:54:00:12:35:09 dev eth0
      [    2.600730] ------------[ cut here ]------------
      [    2.601002] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4369!
      [    2.601373] invalid opcode: 0000 [#1] SMP
      [    2.601963] Modules linked in:
      [    2.602355] CPU: 0 PID: 64 Comm: bridge Not tainted 4.1.0-rc3-01048-g6d0f50c50211-dirty #1075
      [    2.602721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
      [    2.602721] task: ffff880019facef0 ti: ffff88001f96c000 task.ti: ffff88001f96c000
      [    2.602721] RIP: 0010:[<ffffffff811f1470>]  [<ffffffff811f1470>] rocker_port_obj_add+0x150/0x160
      [    2.602721] RSP: 0018:ffff88001f96fa98  EFLAGS: 00000212
      [    2.602721] RAX: ffff880019d4fa68 RBX: ffff88001f96fb18 RCX: 0000000000000000
      [    2.602721] RDX: ffff880019d4f000 RSI: ffff88001f96fb18 RDI: ffff880019d4f000
      [    2.602721] RBP: 0000000000000001 R08: 0000000000000000 R09: ffff88001f904620
      [    2.602721] R10: ffff88001f96fb60 R11: ffff880019e9d100 R12: ffff88001f96fb18
      [    2.602721] R13: ffff880019d4f680 R14: ffff88001f904610 R15: ffff8800198f7b80
      [    2.602721] FS:  00007f3eee917700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
      [    2.602721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    2.602721] CR2: 00007f3eee4a15cb CR3: 000000001f933000 CR4: 00000000000006b0
      [    2.602721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    2.602721] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
      [    2.602721] Stack:
      [    2.602721]  0000000000000000 ffff88001f96fb18 ffff880019d4f000 ffff88001f96fb18
      [    2.602721]  ffff880019d4f000 ffffffff81332105 ffff88001f96fb50 ffffffff814464c0
      [    2.602721]  ffff88001f96fb18 ffff88001f904600 ffff880019d4f000 ffffffff813326e5
      [    2.602721] Call Trace:
      [    2.602721]  [<ffffffff81332105>] ? __switchdev_port_obj_add+0x25/0x90
      [    2.602721]  [<ffffffff813326e5>] ? switchdev_port_obj_add+0x25/0xc0
      [    2.602721]  [<ffffffff813327b1>] ? switchdev_port_fdb_add+0x31/0x40
      [    2.602721]  [<ffffffff8123911f>] ? rtnl_fdb_add+0xff/0x1e0
      [    2.602721]  [<ffffffff81237d8e>] ? rtnetlink_rcv_msg+0x7e/0x250
      [    2.602721]  [<ffffffff8121d1ce>] ? __skb_recv_datagram+0xfe/0x4b0
      [    2.602721]  [<ffffffff81237d10>] ? rtnetlink_rcv+0x30/0x30
      [    2.602721]  [<ffffffff81247958>] ? netlink_rcv_skb+0xa8/0xd0
      [    2.602721]  [<ffffffff81237cff>] ? rtnetlink_rcv+0x1f/0x30
      [    2.602721]  [<ffffffff81247220>] ? netlink_unicast+0x150/0x200
      [    2.602721]  [<ffffffff81247714>] ? netlink_sendmsg+0x374/0x3e0
      [    2.602721]  [<ffffffff8120f8df>] ? sock_sendmsg+0xf/0x30
      [    2.602721]  [<ffffffff8120ffd3>] ? ___sys_sendmsg+0x1f3/0x200
      [    2.602721]  [<ffffffff812100e5>] ? ___sys_recvmsg+0x105/0x140
      [    2.602721]  [<ffffffff810a36f0>] ? SyS_readahead+0x90/0x90
      [    2.602721]  [<ffffffff81098dfd>] ? filemap_map_pages+0x1ed/0x210
      [    2.602721]  [<ffffffff810b77fc>] ? handle_mm_fault+0x5fc/0xe50
      [    2.602721]  [<ffffffff81210ef9>] ? __sys_sendmsg+0x39/0x70
      [    2.602721]  [<ffffffff8133ce17>] ? system_call_fastpath+0x12/0x6a
      [    2.602721] Code: b7 8f a0 06 00 00 48 83 bf 88 06 00 00 00 74 1d 48 83 c4 08 89 ee 4c 89 ef 5b 5d 41 5c 41 5d 0f b7 c9 45 31 c0 e9 51 db ff ff 90 <0f> 0b b8 ea ff ff ff e9 cf fe ff ff 0f 1f 40 00 41 57 41 56 b9
      [    2.602721] RIP  [<ffffffff811f1470>] rocker_port_obj_add+0x150/0x160
      [    2.602721]  RSP <ffff88001f96fa98>
      [    2.615848] ---[ end trace 4f7b4f1c98077108 ]---
      
      The above is resolved by not adding the new FDB entry to the FDB table
      if trans == SWITCHDEV_TRANS_PREPARE.
      
      For symmetry this patch also skips deleting FDB entries from the FDB
      table trans == SWITCHDEV_TRANS_PREPARE. However, my analysis is that
      this never occurs as trans is always SWITCHDEV_TRANS_NONE when removing
      FDB entries.
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42e94889
    • S
      rocker: do not delete fdb entries in rocker_port_fdb_flush() when preparing transactions · 3098ac39
      Simon Horman 提交于
      rocker_port_fdb_flush() is called by rocker_port_stp_update() which in
      turn may be called with trans == SWITCHDEV_TRANS_PREPARE and then
      trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_attr_set() via
      br_set_state().
      
      When rocker_port_fdb_flush() is called with trans == SWITCHDEV_TRANS_PREPARE
      it calls rocker_port_fdb_learn() for each entry in the FDB table which in
      turn calls rocker_flow_tbl_bridge() which will allocate memory using
      rocker_port_kzalloc(). rocker_port_fdb_learn() will then remove the entry
      from the FDB table.
      
      Then when rocker_port_fdb_learn() is called with
      trans == SWITCHDEV_TRANS_PREPARE no calls are made to rocker_port_fdb_learn()
      because there are no longer any entries present in the FDB table. Thus the
      memory previously allocated by rocker_port_fdb_learn() is leaked resulting
      in the kernel BUG() below.
      
      Furthermore, it looks like the driver ends up with an incorrect view of the
      fdb table as the FDB entries are purged from the driver's table but not the
      hardware's table.
      
      ip link add br0 type bridge
      ip link set up dev eth0
      sleep 1
      ip link set dev eth0 master br0
      [    3.704360] ------------[ cut here ]------------
      [    3.704611] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4289!
      [    3.704962] invalid opcode: 0000 [#1] SMP
      [    3.705537] Modules linked in:
      [    3.705919] CPU: 0 PID: 63 Comm: ip Not tainted 4.1.0-rc3-01046-gb9fbe709 #1044
      [    3.706191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
      [    3.706820] task: ffff880019f70150 ti: ffff88001f92c000 task.ti: ffff88001f92c000
      [    3.707138] RIP: 0010:[<ffffffff811f0080>]  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
      [    3.707990] RSP: 0018:ffff88001f92f808  EFLAGS: 00000212
      [    3.708200] RAX: ffff880019d4fa68 RBX: ffff880019d4f000 RCX: 0000000000000000
      [    3.708471] RDX: 000000000000000c RSI: ffff88001f92f890 RDI: ffff880019d4f680
      [    3.708740] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
      [    3.708999] R10: ffff880000034024 R11: 0000000000000000 R12: ffff88001f92f890
      [    3.709276] R13: ffff88001f8f1c00 R14: 000000000000000b R15: 0000000000000000
      [    3.709303] FS:  00007f8ab66bd700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
      [    3.709303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    3.709303] CR2: 0000000000654988 CR3: 000000001f8f3000 CR4: 00000000000006b0
      [    3.709303] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    3.709303] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
      [    3.709303] Stack:
      [    3.709303]  ffff88001f8f1c00 000000000000000b ffff88001f92f890 ffff880019d4f000
      [    3.709303]  ffff88001f92f890 ffffffff813332f5 ffff88001f92f880 0000000000000000
      [    3.709303]  ffff88001f92f890 0000000000000001 ffff880019d4f000 ffffffff81333627
      [    3.709303] Call Trace:
      [    3.709303]  [<ffffffff813332f5>] ? __switchdev_port_attr_set+0x25/0x90
      [    3.709303]  [<ffffffff81333627>] ? switchdev_port_attr_set+0x27/0x120
      [    3.709303]  [<ffffffff81318e86>] ? br_set_state+0x36/0x50
      [    3.709303]  [<ffffffff8131795c>] ? br_add_if+0x37c/0x400
      [    3.709303]  [<ffffffff81238ce1>] ? do_setlink+0x7e1/0x800
      [    3.709303]  [<ffffffff8111f980>] ? radix_tree_lookup_slot+0x10/0x30
      [    3.709303]  [<ffffffff81136fba>] ? nla_parse+0xaa/0x110
      [    3.709303]  [<ffffffff81239c98>] ? rtnl_newlink+0x548/0x870
      [    3.709303]  [<ffffffff8111f900>] ? __radix_tree_lookup+0x40/0xb0
      [    3.709303]  [<ffffffff81136f3e>] ? nla_parse+0x2e/0x110
      [    3.709303]  [<ffffffff81237d7e>] ? rtnetlink_rcv_msg+0x7e/0x250
      [    3.709303]  [<ffffffff8121d1be>] ? __skb_recv_datagram+0xfe/0x4b0
      [    3.709303]  [<ffffffff81237d00>] ? rtnetlink_rcv+0x30/0x30
      [    3.709303]  [<ffffffff81247948>] ? netlink_rcv_skb+0xa8/0xd0
      [    3.709303]  [<ffffffff81237cef>] ? rtnetlink_rcv+0x1f/0x30
      [    3.709303]  [<ffffffff81247210>] ? netlink_unicast+0x150/0x200
      [    3.709303]  [<ffffffff81247704>] ? netlink_sendmsg+0x374/0x3e0
      [    3.709303]  [<ffffffff8120f8cf>] ? sock_sendmsg+0xf/0x30
      [    3.709303]  [<ffffffff8120ffc3>] ? ___sys_sendmsg+0x1f3/0x200
      [    3.709303]  [<ffffffff812100d5>] ? ___sys_recvmsg+0x105/0x140
      [    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
      [    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
      [    3.709303]  [<ffffffff81217b7d>] ? skb_dequeue+0x4d/0x60
      [    3.709303]  [<ffffffff81217bb0>] ? skb_queue_purge+0x20/0x30
      [    3.709303]  [<ffffffff810ebdcf>] ? __inode_wait_for_writeback+0x5f/0xb0
      [    3.709303]  [<ffffffff810648b0>] ? autoremove_wake_function+0x30/0x30
      [    3.709303]  [<ffffffff81210ee9>] ? __sys_sendmsg+0x39/0x70
      [    3.709303]  [<ffffffff8133e097>] ? system_call_fastpath+0x12/0x6a
      [    3.709303] Code: bb 90 06 00 00 48 c7 04 24 00 00 00 00 45 31 c9 45 31 c0 48 c7 c1 c0 b7 1e 81 89 ea e8 da da ff ff eb 95 0f 1f 84 00 00 00 00 00 <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 fe 15 75
      [    3.709303] RIP  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
      [    3.709303]  RSP <ffff88001f92f808>
      [    3.721409] ---[ end trace b7481fcb7cb032aa ]---
      Segmentation fault
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3098ac39
    • J
      spider_net: Use DECLARE_BITMAP · e26cc7ff
      Joe Perches 提交于
      Use the generic mechanism to declare a bitmap instead of unsigned long.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e26cc7ff
    • D
      Merge branch 'ebpf-tail-call' · 3f55b7ed
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce bpf_tail_call() helper
      
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      
      Use cases:
      - simplify complex programs
      - dispatch into other programs
        (for example: index in jump table can be syscall number or network protocol)
      - build dynamic chains of programs
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      patch 1 - support bpf_tail_call() in interpreter
      patch 2 - support in x64 JIT
      We've discussed what's neccessary to support it in arm64/s390 JITs
      and it looks fine.
      patch 3 - sample example for tracing
      patch 4 - sample example for networking
      
      More details in every patch.
      
      This set went through several iterations of reviews/fixes and older
      attempts can be seen:
      https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/log/?h=tail_call_v[123456]
      - tail_call_v1 does it without touching JITs but introduces overhead
        for all programs that don't use this helper function.
      - tail_call_v2 still has some overhead and x64 JIT does full stack
        unwind (prologue skipping optimization wasn't there)
      - tail_call_v3 reuses 'call' instruction encoding and has interpreter
        overhead for every normal call
      - tail_call_v4 fixes above architectural shortcomings and v5,v6 fix few
        more bugs
      
      This last tail_call_v6 approach seems to be the best.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f55b7ed
    • A
      samples/bpf: bpf_tail_call example for networking · 530b2c86
      Alexei Starovoitov 提交于
      Usage:
      $ sudo ./sockex3
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     11422636       173070
      127.0.0.1.33778 -> 127.0.0.1.59526  11260224828       341974
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     23198092       351486
      127.0.0.1.33778 -> 127.0.0.1.59526  22972698518       698616
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      
      this example is similar to sockex2 in a way that it accumulates per-flow
      statistics, but it does packet parsing differently.
      sockex2 inlines full packet parser routine into single bpf program.
      This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
      and one main program that starts the process.
      bpf_tail_call() mechanism allows each program to be small and be called
      on demand potentially multiple times, so that many vlan, mpls, ip in ip,
      gre encapsulations can be parsed. These and other protocol parsers can
      be added or removed at runtime. TLVs can be parsed in similar manner.
      Note, tail_call_cnt dynamic check limits the number of tail calls to 32.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      530b2c86
    • A
      samples/bpf: bpf_tail_call example for tracing · 5bacd780
      Alexei Starovoitov 提交于
      kprobe example that demonstrates how future seccomp programs may look like.
      It attaches to seccomp_phase1() function and tail-calls other BPF programs
      depending on syscall number.
      
      Existing optimized classic BPF seccomp programs generated by Chrome look like:
      if (sd.nr < 121) {
        if (sd.nr < 57) {
          if (sd.nr < 22) {
            if (sd.nr < 7) {
              if (sd.nr < 4) {
                if (sd.nr < 1) {
                  check sys_read
                } else {
                  if (sd.nr < 3) {
                    check sys_write and sys_open
                  } else {
                    check sys_close
                  }
                }
              } else {
            } else {
          } else {
        } else {
      } else {
      }
      
      the future seccomp using native eBPF may look like:
        bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
      which is simpler, faster and leaves more room for per-syscall checks.
      
      Usage:
      $ sudo ./tracex5
      <...>-366   [001] d...     4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
      <...>-369   [003] d...     4.870066: : mmap
      <...>-369   [003] d...     4.870077: : syscall=110 (one of get/set uid/pid/gid)
      <...>-369   [003] d...     4.870089: : syscall=107 (one of get/set uid/pid/gid)
         sh-369   [000] d...     4.891740: : read(fd=0, buf=00000000023d1000, size=512)
         sh-369   [000] d...     4.891747: : write(fd=1, buf=00000000023d3000, size=512)
         sh-369   [000] d...     4.891747: : read(fd=1, buf=00000000023d3000, size=512)
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bacd780
    • A
      x86: bpf_jit: implement bpf_tail_call() helper · b52f00e6
      Alexei Starovoitov 提交于
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      In this implementation x64 JIT bypasses stack unwind and jumps into the
      callee program after prologue, so the callee program reuses the same stack.
      
      The logic can be roughly expressed in C like:
      
      u32 tail_call_cnt;
      
      void *jumptable[2] = { &&label1, &&label2 };
      
      int bpf_prog1(void *ctx)
      {
      label1:
          ...
      }
      
      int bpf_prog2(void *ctx)
      {
      label2:
          ...
      }
      
      int bpf_prog1(void *ctx)
      {
          ...
          if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
              goto *jumptable[index]; ... and pass my 'ctx' to callee ...
      
          ... fall through if no entry in jumptable ...
      }
      
      Note that 'skip current program epilogue and next program prologue' is
      an optimization. Other JITs don't have to do it the same way.
      >From safety point of view it's valid as well, since programs always
      initialize the stack before use, so any residue in the stack left by
      the current program is not going be read. The same verifier checks are
      done for the calls from the kernel into all bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b52f00e6
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
    • D
      net: dev: reduce both ingress hook ifdefs · e7582bab
      Daniel Borkmann 提交于
      Reduce ifdef pollution slightly, no functional change. We can simply
      remove the extra alternative definition of handle_ing() and nf_ingress().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7582bab
    • E
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet 提交于
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb934478
    • E
      neigh: Better handling of transition to NUD_PROBE state · 765c9c63
      Erik Kline 提交于
      [1] When entering NUD_PROBE state via neigh_update(), perhaps received
          from userspace, correctly (re)initialize the probes count to zero.
      
          This is useful for forcing revalidation of a neighbor (for example
          if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).
      
      [2] Notify listeners when a neighbor goes into NUD_PROBE state.
      
          By sending notifications on entry to NUD_PROBE state listeners get
          more timely warnings of imminent connectivity issues.
      
          The current notifications on entry to NUD_STALE have somewhat
          limited usefulness: NUD_STALE is a perfectly normal state, as is
          NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
          a neighbor reachability problem has been confirmed (typically after
          three probes).
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-By: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      765c9c63
  2. 20 5月, 2015 9 次提交
  3. 19 5月, 2015 8 次提交
    • D
      Merge branch 'icmp_frag' · 76d7c457
      David S. Miller 提交于
      Andy Zhou says:
      
      ====================
      fragmentation ICMP
      
      Currently, we send ICMP packets when errors occur during fragmentation or
      de-fragmentation.  However, it is a bug when sending those ICMP packets
      in the context of using netfilter for bridging.
      
      Those ICMP packets are only expected in the context of routing, not in
      bridging mode.
      
      The local stack is not involved in bridging forward decisions, thus
      should be not used for deciding the reverse path for those ICMP messages.
      
      This bug only affects IPV4, not in IPv6.
      
      v1->v2:  restructure the patches into two patches that fix defragmentation and
               fragmentation respectively.
      
      	 A bit is add in IPCB to control whether ICMP packet should be
      	 generated for defragmentation.
      
      	 Fragmentation ICMP is now removed by restructuring the
      	 ip_fragment() API.
      
      v2->v3:  Add droping icmp for bridging contrack users
               drop exporting ip_fragment() API.
      
      v3->v4:  Remove unnecessary parentheses in 'return' statements
      
      v4->v5:  Drop the patch that sets and checks a bit in IPCB
               that prevents ip_defrag to send ICMP.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76d7c457
    • A
      bridge_netfilter: No ICMP packet on IPv4 fragmentation error · 49d16b23
      Andy Zhou 提交于
      When bridge netfilter re-fragments an IP packet for output, all
      packets that can not be re-fragmented to their original input size
      should be silently discarded.
      
      However, current bridge netfilter output path generates an ICMP packet
      with 'size exceeded MTU' message for such packets, this is a bug.
      
      This patch refactors the ip_fragment() API to allow two separate
      use cases. The bridge netfilter user case will not
      send ICMP, the routing output will, as before.
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49d16b23
    • A
      IPv4: skip ICMP for bridge contrack users when defrag expires · 8bc04864
      Andy Zhou 提交于
      users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN]
      should not ICMP message also.
      Reported-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bc04864
    • A
      ipv4: introduce frag_expire_skip_icmp() · 5cf42280
      Andy Zhou 提交于
      Improve readability of skip ICMP for de-fragmentation expiration logic.
      This change will also make the logic easier to maintain when the
      following patches in this series are applied.
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cf42280
    • W
      selftests/net: expect headroom in psock_fanout rollover · a2ad5d2a
      Willem de Bruijn 提交于
      psock_fanout tests the various fanout modes. Change the test for
      rollover mode to expect early rollover due to socket pressure
      as implemented in 2ccdbaa6 ("packet: rollover lock contention
      avoidance").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2ad5d2a
    • E
      sfc: nicer log message on Siena SR-IOV probe fail · 7e91a210
      Edward Cree 提交于
      We expect that MC_CMD_SRIOV will fail if the card has no VFs configured.
      So output a readable message instead of a cryptic MCDI error.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e91a210
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 0bc4c070
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next. Briefly
      speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
      Serget Popovich, more incremental updates to make br_netfilter a better
      place from Florian Westphal, ARP support to the x_tables mark match /
      target from and context Zhang Chunyu and the addition of context to know
      that the x_tables runs through nft_compat. More specifically, they are:
      
      1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
         IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.
      
      2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.
      
      3) Use skb->network_header to calculate the transport offset in
         ip_set_get_ip{4,6}_port(). From Alexander Drozdov.
      
      4) Reduce memory consumption per element due to size miscalculation,
         this patch and follow up patches from Sergey Popovich.
      
      5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
         mtype_data_reset_flags(), also from Sergey.
      
      6) Small clean for ipset macro trickery.
      
      7) Fix error reporting when both ip_set_get_hostipaddr4() and
         ip_set_get_extensions() from per-set uadt functions.
      
      8) Simplify IPSET_ATTR_PORT netlink attribute validation.
      
      9) Introduce HOST_MASK instead of hardcoded 32 in ipset.
      
      10) Return true/false instead of 0/1 in functions that return boolean
          in the ipset code.
      
      11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.
      
      12) Allow to dereference from ext_*() ipset macros.
      
      13) Get rid of incorrect definitions of HKEY_DATALEN.
      
      14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.
      
      15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.
      
      16) Release nf_bridge_info after POSTROUTING since this is only needed
          from the physdev match, also from Florian.
      
      17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
          from Denys Vlasenko.
      
      18) Oneliner to add ARP support to the x_tables mark match/target, from
          Zhang Chunyu.
      
      19) Add context to know if the x_tables extension runs from nft_compat,
          to address minor problems with three existing extensions.
      
      20) Correct return value in several seqfile *_show() functions in the
          netfilter tree, from Joe Perches.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bc4c070
    • D
      Merge branch 'qeth-next' · 17032ae3
      David S. Miller 提交于
      Ursula Braun says:
      
      ====================
      s390: network patches for net-next
      
      here are s390 related patches for net-next
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17032ae3