1. 01 8月, 2015 3 次提交
    • C
      gianfar: Fix suspend/resume for wol magic packet · 614b4242
      Claudiu Manoil 提交于
      If we disable NAPI in the first place we can mask the device's
      interrupts (and halt it) without fearing that imask may be
      concurrently accessed from interrupt context, so there's
      no need to do local_irq_save() around gfar_halt_nodisable().
      lock_rx_qs()/unlock_tx_qs() are just obsoleted and potentially
      buggy routines.  The txlock is currently used in the driver only
      to manage TX congestion, it has nothing to do with halting the
      device.  With these changes, the TX processing is stopped before
      gfar_halt().
      
      Compact gfar_halt() is used instead of gfar_halt_nodisable(),
      as it disables Rx/TX DMA h/w blocks and the Rx/TX h/w queues.
      gfar_start() re-enables all these blocks on resume.  Enabling
      the magic-packet mode remains the same, note that the RX block
      is re-enabled just before entering sleep mode.
      
      Add IRQF_NO_SUSPEND flag for the error interrupt line, to signal
      that the interrupt line must remain active during sleep in order
      to wake the system by magic packet (MAG) reception interrupt.
      (On some systems the MAG interrupt did trigger w/o this flag
      as well, but on others it didn't.)
      
      Without these fixes, when suspended during fair Tx traffic the
      interface occasionally failed to be woken up by magic packet.
      Signed-off-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      614b4242
    • C
      gianfar: Fix warning when CONFIG_PM off · 84868305
      Claudiu Manoil 提交于
      CC      drivers/net/ethernet/freescale/gianfar.o
      drivers/net/ethernet/freescale/gianfar.c:568:13: warning: 'lock_tx_qs'
      defined but not used [-Wunused-function]
       static void lock_tx_qs(struct gfar_private *priv)
                   ^
      drivers/net/ethernet/freescale/gianfar.c:576:13: warning: 'unlock_tx_qs'
      defined but not used [-Wunused-function]
       static void unlock_tx_qs(struct gfar_private *priv)
                   ^
      Reported-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84868305
    • W
      act_pedit: check binding before calling tcf_hash_release() · 5175f710
      WANG Cong 提交于
      When we share an action within a filter, the bind refcnt
      should increase, therefore we should not call tcf_hash_release().
      
      Fixes: 1a29321e ("net_sched: act: Dont increment refcnt on replace")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5175f710
  2. 31 7月, 2015 5 次提交
    • S
      net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket · 8a681736
      Sowmini Varadhan 提交于
      The newsk returned by sk_clone_lock should hold a get_net()
      reference if, and only if, the parent is not a kernel socket
      (making this similar to sk_alloc()).
      
      E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
      sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
      socket is a kernel socket (defined in sk_alloc() as having
      sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
      and should not hold a get_net() reference.
      
      Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the
            netns of kernel sockets.")
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a681736
    • D
      net: sched: fix refcount imbalance in actions · 28e6b67f
      Daniel Borkmann 提交于
      Since commit 55334a5d ("net_sched: act: refuse to remove bound action
      outside"), we end up with a wrong reference count for a tc action.
      
      Test case 1:
      
        FOO="1,6 0 0 4294967295,"
        BAR="1,6 0 0 4294967294,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
           action bpf bytecode "$FOO"
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 1 bind 1
        tc actions replace action bpf bytecode "$BAR" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
          index 1 ref 2 bind 1
        tc actions replace action bpf bytecode "$FOO" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 3 bind 1
      
      Test case 2:
      
        FOO="1,6 0 0 4294967295,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
        tc actions show action gact
          action order 0: gact action pass
          random type none pass val 0
           index 1 ref 1 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 2 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 3 bind 1
      
      What happens is that in tcf_hash_check(), we check tcf_common for a given
      index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
      found an existing action. Now there are the following cases:
      
        1) We do a late binding of an action. In that case, we leave the
           tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
           handler. This is correctly handeled.
      
        2) We replace the given action, or we try to add one without replacing
           and find out that the action at a specific index already exists
           (thus, we go out with error in that case).
      
      In case of 2), we have to undo the reference count increase from
      tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
      do so because of the 'tcfc_bindcnt > 0' check which bails out early with
      an -EPERM error.
      
      Now, while commit 55334a5d prevents 'tc actions del action ...' on an
      already classifier-bound action to drop the reference count (which could
      then become negative, wrap around etc), this restriction only accounts for
      invocations outside a specific action's ->init() handler.
      
      One possible solution would be to add a flag thus we possibly trigger
      the -EPERM ony in situations where it is indeed relevant.
      
      After the patch, above test cases have correct reference count again.
      
      Fixes: 55334a5d ("net_sched: act: refuse to remove bound action outside")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e6b67f
    • D
      Merge branch 'r8152-fixes' · 990c9b34
      David S. Miller 提交于
      Hayes Wang says:
      
      ====================
      r8152: device reset
      
      v3:
      For patch #2, remove cancel_delayed_work().
      
      v2:
      For patch #1, remove usb_autopm_get_interface(), usb_autopm_put_interface(), and
      the checking of intf->condition.
      
      For patch #2, replace the original method with usb_queue_reset_device() to reset
      the device.
      
      v1:
      Although the driver works normally, we find the device may get all 0xff data when
      transmitting packets on certain platforms. It would break the device and no packet
      could be transmitted. The reset is necessary to recover the hw for this situation.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      990c9b34
    • H
      r8152: reset device when tx timeout · 37608f3e
      hayeswang 提交于
      The device reset is necessary if the hw becomes abnormal and stops
      transmitting packets.
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37608f3e
    • H
      r8152: add pre_reset and post_reset · e501139a
      hayeswang 提交于
      Add rtl8152_pre_reset() and rtl8152_post_reset() which are used when
      calling usb_reset_device(). The two functions could reduce the time
      of reset when calling usb_reset_device() after probe().
      Signed-off-by: NHayes Wang <hayeswang@realtek.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e501139a
  3. 30 7月, 2015 23 次提交
    • S
      qlcnic: Fix corruption while copying · 15f1bb1f
      Shahed Shaikh 提交于
      Use proper typecasting while performing byte-by-byte copy
      Signed-off-by: NShahed Shaikh <shahed.shaikh@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15f1bb1f
    • D
      act_bpf: fix memory leaks when replacing bpf programs · f4eaed28
      Daniel Borkmann 提交于
      We currently trigger multiple memory leaks when replacing bpf
      actions, besides others:
      
        comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
        hex dump (first 32 bytes):
          01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00  ................
          18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00  ...m............
        backtrace:
          [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0
          [<ffffffff8120a37a>] __vmalloc+0x4a/0x50
          [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0
          [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0
          [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
          [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0
          [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0
          [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0
          [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
          [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf]
          [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910
          [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240
          [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0
          [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40
          [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0
      
      Issue is that the old content from tcf_bpf is allocated and needs
      to be released when we replace it. We seem to do that since the
      beginning of act_bpf on the filter and insns, later on the name as
      well.
      
      Example test case, after patch:
      
        # FOO="1,6 0 0 4294967295,"
        # BAR="1,6 0 0 4294967294,"
        # tc actions add action bpf bytecode "$FOO" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
         index 2 ref 1 bind 0
        # tc actions replace action bpf bytecode "$BAR" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
         index 2 ref 1 bind 0
        # tc actions replace action bpf bytecode "$FOO" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
         index 2 ref 1 bind 0
        # tc actions del action bpf index 2
        [...]
        # echo "scan" > /sys/kernel/debug/kmemleak
        # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
        0
      
      Fixes: d23b8ad8 ("tc: add BPF based action")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4eaed28
    • D
      Merge branch 'thunderx-fixes' · f68b1231
      David S. Miller 提交于
      Aleksey Makarov says:
      
      ====================
      net: thunderx: Misc fixes
      
      Miscellaneous fixes for the ThunderX VNIC driver
      
      All the patches can be applied individually.
      It's ok to drop some if the maintainer feels uncomfortable
      with applying for 4.2.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f68b1231
    • T
      net: thunderx: Fix for crash while BGX teardown · 60f83c89
      Thanneeru Srinivasulu 提交于
      Cortina phy does not have kernel driver and we don't attach
      device with phy layer for intefaces like XFI, XLAUI etc,
      Hence check for interface type before calling disconnect.
      Signed-off-by: NThanneeru Srinivasulu <tsrinivasulu@caviumnetworks.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60f83c89
    • S
    • S
      net: thunderx: Fix crash when changing rss with mutliple traffic flows · b49087dd
      Sunil Goutham 提交于
      This fixes a crash when changing rss with multiple traffic flows.
      
      While interface teardown, disable tx queues after all NAPI threads
      are done. If done otherwise tx queues might be woken up inside NAPI
      if any CQE_TX are processed.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b49087dd
    • S
      net: thunderx: Set watchdog timeout value · 3d7a8aaa
      Sunil Goutham 提交于
      If a txq (SQ) remains in stopped state after this timeout its
      considered as stuck and interface is reinited.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d7a8aaa
    • S
      net: thunderx: Wakeup TXQ only if CQE_TX are processed · 74840b83
      Sunil Goutham 提交于
      Previously TXQ is wakedup whenever napi is executed
      and irrespective of if any CQE_TX are processed or not.
      Added 'txq_stop' and 'txq_wake' counters to aid in debugging
      if there are any future issues.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74840b83
    • S
      net: thunderx: Suppress alloc_pages() failure warnings · f8ce9666
      Sunil Goutham 提交于
      Suppressing standard alloc_pages() warnings. Some kernel configs limit
      alloc size and the network driver may fail. Do not drop a kernel
      warning in this case, instead just drop a oneliner that the network
      driver could not be loaded since the buffer could not be allocated.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ce9666
    • S
      net: thunderx: Fix TSO packet statistic · 2cb468e0
      Sunil Goutham 提交于
      Fixing TSO packages not being counted.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cb468e0
    • S
      net: thunderx: Fix memory leak when changing queue count · c62cd3c4
      Sunil Goutham 提交于
      Fix for memory leak when changing queue/channel count via ethtool
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c62cd3c4
    • S
      net: thunderx: Fix RQ_DROP miscalculation · 32c1b965
      Sunil Goutham 提交于
      With earlier configured value sufficient number of CQEs are not
      being reserved for transmitted packets. Hence under heavy incoming
      traffic load, receive notifications will take away most of the CQ
      thus transmit notifications will be lost resulting in tx skbs not
      being freed.
      
      Finally SQ will be full and it will be stopped, watchdog timer
      will kick in. After this fix receive notifications will not take
      morethan half of CQ reserving the rest for transmit notifications.
      
      Also changed CQ & SQ sizes from 16k to 4k.
      This is also due to the receive notifications taking first half of
      CQ under heavy load and time taken by NAPI to clear transmit notifications
      will increase with higher queue sizes. Again results in SQ being stopped.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32c1b965
    • S
      net: thunderx: Fix memory leak while tearing down interface · 143ceb0b
      Sunil Goutham 提交于
      Fixed 'tso_hdrs' memory not being freed properly.
      Also fixed SQ skbuff maintenance issues.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143ceb0b
    • S
      net: thunderx: Fix data integrity issues with LDWB · 4b561c17
      Sunil Goutham 提交于
      Switching back to LDD transactions from LDWB.
      
      While transmitting packets out with LDWB transactions
      data integrity issues are seen very frequently.
      hence switching back to LDD.
      Signed-off-by: NSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: NRobert Richter <rrichter@cavium.com>
      Signed-off-by: NAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b561c17
    • E
      ipv6: flush nd cache on IFF_NOARP change · c8507fb2
      Eric Dumazet 提交于
      This patch is the IPv6 equivalent of commit
      6c8b4e3f ("arp: flush arp cache on IFF_NOARP change")
      
      Without it, we keep buggy neighbours in the cache, with destination
      MAC address equal to our own MAC address.
      
      Tested:
       tcpdump -i eth0 -s 0 ip6 -n -e &
       ip link set dev eth0 arp off
       ping6 remote   // sends buggy frames
       ip link set dev eth0 arp on
       ping6 remote   // should work once kernel is patched
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMario Fanelli <mariofanelli@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8507fb2
    • D
      Merge branch 'netcp-fixes' · b2428f94
      David S. Miller 提交于
      Murali Karicheri says:
      
      ====================
      net: netcp: bug fixes for dynamic module support
      
      This series fixes few bugs to allow keystone netcp modules to be
      dynamically loaded and removed. Currently it allows following
      sequence multiple times
      
       insmod cpsw_ale.ko
       insmod davinci_mdio.ko
       insmod keystone_netcp.ko
       insmod keystone_netcp_ethss.ko
       ifup eth0
       ifup eth1
       ping <hosts on eth0>
       ping <hosts on eth1>
       ifdown eth1
       ifdown eth0
       rmmod keystone_netcp_ethss.ko
       rmmod keystone_netcp.ko
       rmmod davinci_mdio.ko
       rmmod cpsw_ale.ko
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2428f94
    • K
      net: netcp: ethss: cleanup gbe_probe() and gbe_remove() functions · 31a184b7
      Karicheri, Muralidharan 提交于
      This patch clean up error handle code to use goto label properly. In some
      cases, the code unnecessarily use goto instead of just returning the error
      code.  Code also make explicit calls to devm_* APIs on error which is
      not necessary. In the gbe_remove() also it makes similar calls which is
      also unnecessary.
      
      Also fix few checkpatch warnings
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31a184b7
    • K
      net: netcp: ethss: fix up incorrect use of list api · c20afae7
      Karicheri, Muralidharan 提交于
      The code seems to assume a null is returned when the list is empty
      from first_sec_slave() to break the loop which is incorrect. Fix the
      code by using list_empty().
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c20afae7
    • K
      net: netcp: fix cleanup interface list in netcp_remove() · 01a03099
      Karicheri, Muralidharan 提交于
      Currently if user do rmmod keystone_netcp.ko following warning is
      seen :-
      
      [   59.035891] ------------[ cut here ]------------
      [   59.040535] WARNING: CPU: 2 PID: 1619 at drivers/net/ethernet/ti/
      netcp_core.c:2127 netcp_remove)
      
      This is because the interface list is not cleaned up in netcp_remove.
      This patch fixes this. Also fix some checkpatch related warnings.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01a03099
    • D
      ebpf, x86: fix general protection fault when tail call is invoked · 2482abb9
      Daniel Borkmann 提交于
      With eBPF JIT compiler enabled on x86_64, I was able to reliably trigger
      the following general protection fault out of an eBPF program with a simple
      tail call, f.e. tracex5 (or a stripped down version of it):
      
        [  927.097918] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
        [...]
        [  927.100870] task: ffff8801f228b780 ti: ffff880016a64000 task.ti: ffff880016a64000
        [  927.102096] RIP: 0010:[<ffffffffa002440d>]  [<ffffffffa002440d>] 0xffffffffa002440d
        [  927.103390] RSP: 0018:ffff880016a67a68  EFLAGS: 00010006
        [  927.104683] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000001
        [  927.105921] RDX: 0000000000000000 RSI: ffff88014e438000 RDI: ffff880016a67e00
        [  927.107137] RBP: ffff880016a67c90 R08: 0000000000000000 R09: 0000000000000001
        [  927.108351] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880016a67e00
        [  927.109567] R13: 0000000000000000 R14: ffff88026500e460 R15: ffff880220a81520
        [  927.110787] FS:  00007fe7d5c1f740(0000) GS:ffff880265000000(0000) knlGS:0000000000000000
        [  927.112021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  927.113255] CR2: 0000003e7bbb91a0 CR3: 000000006e04b000 CR4: 00000000001407e0
        [  927.114500] Stack:
        [  927.115737]  ffffc90008cdb000 ffff880016a67e00 ffff88026500e460 ffff880220a81520
        [  927.117005]  0000000100000000 000000000000001b ffff880016a67aa8 ffffffff8106c548
        [  927.118276]  00007ffcdaf22e58 0000000000000000 0000000000000000 ffff880016a67ff0
        [  927.119543] Call Trace:
        [  927.120797]  [<ffffffff8106c548>] ? lookup_address+0x28/0x30
        [  927.122058]  [<ffffffff8113d176>] ? __module_text_address+0x16/0x70
        [  927.123314]  [<ffffffff8117bf0e>] ? is_ftrace_trampoline+0x3e/0x70
        [  927.124562]  [<ffffffff810c1a0f>] ? __kernel_text_address+0x5f/0x80
        [  927.125806]  [<ffffffff8102086f>] ? print_context_stack+0x7f/0xf0
        [  927.127033]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.128254]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.129461]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.130654]  [<ffffffff8119ee4a>] trace_call_bpf+0x8a/0x140
        [  927.131837]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.133015]  [<ffffffff8119f008>] kprobe_perf_func+0x28/0x220
        [  927.134195]  [<ffffffff811a1668>] kprobe_dispatcher+0x38/0x60
        [  927.135367]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.136523]  [<ffffffff81061400>] kprobe_ftrace_handler+0xf0/0x150
        [  927.137666]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.138802]  [<ffffffff8117950c>] ftrace_ops_recurs_func+0x5c/0xb0
        [  927.139934]  [<ffffffffa022b0d5>] 0xffffffffa022b0d5
        [  927.141066]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.142199]  [<ffffffff81174b95>] seccomp_phase1+0x5/0x230
        [  927.143323]  [<ffffffff8102c0a4>] syscall_trace_enter_phase1+0xc4/0x150
        [  927.144450]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.145572]  [<ffffffff8102c0a4>] ? syscall_trace_enter_phase1+0xc4/0x150
        [  927.146666]  [<ffffffff817f9a9f>] tracesys+0xd/0x44
        [  927.147723] Code: 48 8b 46 10 48 39 d0 76 2c 8b 85 fc fd ff ff 83 f8 20 77 21 83
                             c0 01 89 85 fc fd ff ff 48 8d 44 d6 80 48 8b 00 48 83 f8 00 74
                             0a <48> 8b 40 20 48 83 c0 33 ff e0 48 89 d8 48 8b 9d d8 fd ff
                             ff 4c
        [  927.150046] RIP  [<ffffffffa002440d>] 0xffffffffa002440d
      
      The code section with the instructions that traps points into the eBPF JIT
      image of the root program (the one invoking the tail call instruction).
      
      Using bpf_jit_disasm -o on the eBPF root program image:
      
        [...]
        4e:   mov    -0x204(%rbp),%eax
              8b 85 fc fd ff ff
        54:   cmp    $0x20,%eax               <--- if (tail_call_cnt > MAX_TAIL_CALL_CNT)
              83 f8 20
        57:   ja     0x000000000000007a
              77 21
        59:   add    $0x1,%eax                <--- tail_call_cnt++
              83 c0 01
        5c:   mov    %eax,-0x204(%rbp)
              89 85 fc fd ff ff
        62:   lea    -0x80(%rsi,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 44 d6 80
        67:   mov    (%rax),%rax
              48 8b 00
        6a:   cmp    $0x0,%rax                <--- check for NULL
              48 83 f8 00
        6e:   je     0x000000000000007a
              74 0a
        70:   mov    0x20(%rax),%rax          <--- GPF triggered here! fetch of bpf_func
              48 8b 40 20                              [ matches <48> 8b 40 20 ... from above ]
        74:   add    $0x33,%rax               <--- prologue skip of new prog
              48 83 c0 33
        78:   jmpq   *%rax                    <--- jump to new prog insns
              ff e0
        [...]
      
      The problem is that rax has 5a5a5a5a5a5a5a5a, which suggests a tail call
      jump to map slot 0 is pointing to a poisoned page. The issue is the following:
      
      lea instruction has a wrong offset, i.e. it should be ...
      
        lea    0x80(%rsi,%rdx,8),%rax
      
      ... but it actually seems to be ...
      
        lea   -0x80(%rsi,%rdx,8),%rax
      
      ... where 0x80 is offsetof(struct bpf_array, prog), thus the offset needs
      to be positive instead of negative. Disassembling the interpreter, we btw
      similarly do:
      
        [...]
        c88:  lea     0x80(%rax,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 84 d0 80 00 00 00
        c90:  add     $0x1,%r13d
              41 83 c5 01
        c94:  mov     (%rax),%rax
              48 8b 00
        [...]
      
      Now the other interesting fact is that this panic triggers only when things
      like CONFIG_LOCKDEP are being used. In that case offsetof(struct bpf_array,
      prog) starts at offset 0x80 and in non-CONFIG_LOCKDEP case at offset 0x50.
      Reason is that the work_struct inside struct bpf_map grows by 48 bytes in my
      case due to the lockdep_map member (which also has CONFIG_LOCK_STAT enabled
      members).
      
      Changing the emitter to always use the 4 byte displacement in the lea
      instruction fixes the panic on my side. It increases the tail call instruction
      emission by 3 more byte, but it should cover us from various combinations
      (and perhaps other future increases on related structures).
      
      After patch, disassembly:
      
        [...]
        9e:   lea    0x80(%rsi,%rdx,8),%rax   <--- CONFIG_LOCKDEP/CONFIG_LOCK_STAT
              48 8d 84 d6 80 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
        [...]
        9e:   lea    0x50(%rsi,%rdx,8),%rax   <--- No CONFIG_LOCKDEP
              48 8d 84 d6 50 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
      Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2482abb9
    • N
      bridge: mdb: fix delmdb state in the notification · 7ae90a4f
      Nikolay Aleksandrov 提交于
      Since mdb states were introduced when deleting an entry the state was
      left as it was set in the delete request from the user which leads to
      the following output when doing a monitor (for example):
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 temp
      ^^^
      Note the "temp" state in the delete notification which is wrong since
      the entry was permanent, the state in a delete is always reported as
      "temp" regardless of the real state of the entry.
      
      After this patch:
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      
      There's one important note to make here that the state is actually not
      matched when doing a delete, so one can delete a permanent entry by
      stating "temp" in the end of the command, I've chosen this fix in order
      not to break user-space tools which rely on this (incorrect) behaviour.
      
      So to give an example after this patch and using the wrong state:
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      
      Note the state of the entry that got deleted is correct in the
      notification.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Fixes: ccb1c31a ("bridge: add flags to distinguish permanent mdb entires")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae90a4f
    • S
      bridge: mcast: give fast leave precedence over multicast router and querier · 544586f7
      Satish Ashok 提交于
      When fast leave is configured on a bridge port and an IGMP leave is
      received for a group, the group is not deleted immediately if there is
      a router detected or if multicast querier is configured.
      Ideally the group should be deleted immediately when fast leave is
      configured.
      Signed-off-by: NSatish Ashok <sashok@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      544586f7
    • T
      bridge: Fix network header pointer for vlan tagged packets · df356d5e
      Toshiaki Makita 提交于
      There are several devices that can receive vlan tagged packets with
      CHECKSUM_PARTIAL like tap, possibly veth and xennet.
      When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
      by bridge to a device with the IP_CSUM feature, they end up with checksum
      error because before entering bridge, the network header is set to
      ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
      get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.
      
      Since the network header is exepected to be ETH_HLEN in flow-dissection
      and hash-calculation in RPS in rx path, and since the header pointer fix
      is needed only in tx path, set the appropriate network header on forwarding
      packets.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df356d5e
  4. 29 7月, 2015 4 次提交
  5. 28 7月, 2015 4 次提交
  6. 27 7月, 2015 1 次提交