1. 07 8月, 2015 3 次提交
  2. 04 8月, 2015 4 次提交
  3. 03 8月, 2015 1 次提交
    • E
      fq_codel: explicitly reset flows in ->reset() · 3d0e0af4
      Eric Dumazet 提交于
      Alex reported the following crash when using fq_codel
      with htb:
      
        crash> bt
        PID: 630839  TASK: ffff8823c990d280  CPU: 14  COMMAND: "tc"
         [... snip ...]
         #8 [ffff8820ceec17a0] page_fault at ffffffff8160a8c2
            [exception RIP: htb_qlen_notify+24]
            RIP: ffffffffa0841718  RSP: ffff8820ceec1858  RFLAGS: 00010282
            RAX: 0000000000000000  RBX: 0000000000000000  RCX: ffff88241747b400
            RDX: ffff88241747b408  RSI: 0000000000000000  RDI: ffff8811fb27d000
            RBP: ffff8820ceec1868   R8: ffff88120cdeff24   R9: ffff88120cdeff30
            R10: 0000000000000bd4  R11: ffffffffa0840919  R12: ffffffffa0843340
            R13: 0000000000000000  R14: 0000000000000001  R15: ffff8808dae5c2e8
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
         #9 [...] qdisc_tree_decrease_qlen at ffffffff81565375
        #10 [...] fq_codel_dequeue at ffffffffa084e0a0 [sch_fq_codel]
        #11 [...] fq_codel_reset at ffffffffa084e2f8 [sch_fq_codel]
        #12 [...] qdisc_destroy at ffffffff81560d2d
        #13 [...] htb_destroy_class at ffffffffa08408f8 [sch_htb]
        #14 [...] htb_put at ffffffffa084095c [sch_htb]
        #15 [...] tc_ctl_tclass at ffffffff815645a3
        #16 [...] rtnetlink_rcv_msg at ffffffff81552cb0
        [... snip ...]
      
      As Jamal pointed out, there is actually no need to call dequeue
      to purge the queued skb's in reset, data structures can be just
      reset explicitly. Therefore, we reset everything except config's
      and stats, so that we would have a fresh start after device flipping.
      
      Fixes: 4b549a2e ("fq_codel: Fair Queue Codel AQM")
      Reported-by: NAlex Gartrell <agartrell@fb.com>
      Cc: Alex Gartrell <agartrell@fb.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      [xiyou.wangcong@gmail.com: added codel_vars_init() and qdisc_qstats_backlog_dec()]
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d0e0af4
  4. 01 8月, 2015 1 次提交
  5. 31 7月, 2015 2 次提交
    • S
      net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket · 8a681736
      Sowmini Varadhan 提交于
      The newsk returned by sk_clone_lock should hold a get_net()
      reference if, and only if, the parent is not a kernel socket
      (making this similar to sk_alloc()).
      
      E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
      sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
      socket is a kernel socket (defined in sk_alloc() as having
      sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
      and should not hold a get_net() reference.
      
      Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the
            netns of kernel sockets.")
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a681736
    • D
      net: sched: fix refcount imbalance in actions · 28e6b67f
      Daniel Borkmann 提交于
      Since commit 55334a5d ("net_sched: act: refuse to remove bound action
      outside"), we end up with a wrong reference count for a tc action.
      
      Test case 1:
      
        FOO="1,6 0 0 4294967295,"
        BAR="1,6 0 0 4294967294,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
           action bpf bytecode "$FOO"
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 1 bind 1
        tc actions replace action bpf bytecode "$BAR" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
          index 1 ref 2 bind 1
        tc actions replace action bpf bytecode "$FOO" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 3 bind 1
      
      Test case 2:
      
        FOO="1,6 0 0 4294967295,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
        tc actions show action gact
          action order 0: gact action pass
          random type none pass val 0
           index 1 ref 1 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 2 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 3 bind 1
      
      What happens is that in tcf_hash_check(), we check tcf_common for a given
      index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
      found an existing action. Now there are the following cases:
      
        1) We do a late binding of an action. In that case, we leave the
           tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
           handler. This is correctly handeled.
      
        2) We replace the given action, or we try to add one without replacing
           and find out that the action at a specific index already exists
           (thus, we go out with error in that case).
      
      In case of 2), we have to undo the reference count increase from
      tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
      do so because of the 'tcfc_bindcnt > 0' check which bails out early with
      an -EPERM error.
      
      Now, while commit 55334a5d prevents 'tc actions del action ...' on an
      already classifier-bound action to drop the reference count (which could
      then become negative, wrap around etc), this restriction only accounts for
      invocations outside a specific action's ->init() handler.
      
      One possible solution would be to add a flag thus we possibly trigger
      the -EPERM ony in situations where it is indeed relevant.
      
      After the patch, above test cases have correct reference count again.
      
      Fixes: 55334a5d ("net_sched: act: refuse to remove bound action outside")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e6b67f
  6. 30 7月, 2015 5 次提交
    • D
      act_bpf: fix memory leaks when replacing bpf programs · f4eaed28
      Daniel Borkmann 提交于
      We currently trigger multiple memory leaks when replacing bpf
      actions, besides others:
      
        comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
        hex dump (first 32 bytes):
          01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00  ................
          18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00  ...m............
        backtrace:
          [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0
          [<ffffffff8120a37a>] __vmalloc+0x4a/0x50
          [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0
          [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0
          [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
          [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0
          [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0
          [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0
          [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
          [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf]
          [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910
          [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240
          [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0
          [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40
          [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0
      
      Issue is that the old content from tcf_bpf is allocated and needs
      to be released when we replace it. We seem to do that since the
      beginning of act_bpf on the filter and insns, later on the name as
      well.
      
      Example test case, after patch:
      
        # FOO="1,6 0 0 4294967295,"
        # BAR="1,6 0 0 4294967294,"
        # tc actions add action bpf bytecode "$FOO" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
         index 2 ref 1 bind 0
        # tc actions replace action bpf bytecode "$BAR" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
         index 2 ref 1 bind 0
        # tc actions replace action bpf bytecode "$FOO" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
         index 2 ref 1 bind 0
        # tc actions del action bpf index 2
        [...]
        # echo "scan" > /sys/kernel/debug/kmemleak
        # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
        0
      
      Fixes: d23b8ad8 ("tc: add BPF based action")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4eaed28
    • E
      ipv6: flush nd cache on IFF_NOARP change · c8507fb2
      Eric Dumazet 提交于
      This patch is the IPv6 equivalent of commit
      6c8b4e3f ("arp: flush arp cache on IFF_NOARP change")
      
      Without it, we keep buggy neighbours in the cache, with destination
      MAC address equal to our own MAC address.
      
      Tested:
       tcpdump -i eth0 -s 0 ip6 -n -e &
       ip link set dev eth0 arp off
       ping6 remote   // sends buggy frames
       ip link set dev eth0 arp on
       ping6 remote   // should work once kernel is patched
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMario Fanelli <mariofanelli@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8507fb2
    • N
      bridge: mdb: fix delmdb state in the notification · 7ae90a4f
      Nikolay Aleksandrov 提交于
      Since mdb states were introduced when deleting an entry the state was
      left as it was set in the delete request from the user which leads to
      the following output when doing a monitor (for example):
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 temp
      ^^^
      Note the "temp" state in the delete notification which is wrong since
      the entry was permanent, the state in a delete is always reported as
      "temp" regardless of the real state of the entry.
      
      After this patch:
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      
      There's one important note to make here that the state is actually not
      matched when doing a delete, so one can delete a permanent entry by
      stating "temp" in the end of the command, I've chosen this fix in order
      not to break user-space tools which rely on this (incorrect) behaviour.
      
      So to give an example after this patch and using the wrong state:
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      
      Note the state of the entry that got deleted is correct in the
      notification.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Fixes: ccb1c31a ("bridge: add flags to distinguish permanent mdb entires")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae90a4f
    • S
      bridge: mcast: give fast leave precedence over multicast router and querier · 544586f7
      Satish Ashok 提交于
      When fast leave is configured on a bridge port and an IGMP leave is
      received for a group, the group is not deleted immediately if there is
      a router detected or if multicast querier is configured.
      Ideally the group should be deleted immediately when fast leave is
      configured.
      Signed-off-by: NSatish Ashok <sashok@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      544586f7
    • T
      bridge: Fix network header pointer for vlan tagged packets · df356d5e
      Toshiaki Makita 提交于
      There are several devices that can receive vlan tagged packets with
      CHECKSUM_PARTIAL like tap, possibly veth and xennet.
      When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
      by bridge to a device with the IP_CSUM feature, they end up with checksum
      error because before entering bridge, the network header is set to
      ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
      get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.
      
      Since the network header is exepected to be ETH_HLEN in flow-dissection
      and hash-calculation in RPS in rx path, and since the header pointer fix
      is needed only in tx path, set the appropriate network header on forwarding
      packets.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df356d5e
  7. 29 7月, 2015 4 次提交
  8. 28 7月, 2015 3 次提交
  9. 27 7月, 2015 8 次提交
  10. 25 7月, 2015 5 次提交
    • K
      cgroup: net_cls: fix false-positive "suspicious RCU usage" · cc9f4daa
      Konstantin Khlebnikov 提交于
      In dev_queue_xmit() net_cls protected with rcu-bh.
      
      [  270.730026] ===============================
      [  270.730029] [ INFO: suspicious RCU usage. ]
      [  270.730033] 4.2.0-rc3+ #2 Not tainted
      [  270.730036] -------------------------------
      [  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
      [  270.730041] other info that might help us debug this:
      [  270.730043] rcu_scheduler_active = 1, debug_locks = 1
      [  270.730045] 2 locks held by dhclient/748:
      [  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960
      [  270.730085]  #1:  (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960
      [  270.730090] stack backtrace:
      [  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
      [  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
      [  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
      [  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
      [  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
      [  270.730108] Call Trace:
      [  270.730121]  [<ffffffff817ad487>] dump_stack+0x4c/0x65
      [  270.730148]  [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120
      [  270.730153]  [<ffffffff816a62d2>] task_cls_state+0x92/0xa0
      [  270.730158]  [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
      [  270.730164]  [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0
      [  270.730166]  [<ffffffff816ab573>] tc_classify+0x33/0x90
      [  270.730170]  [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb]
      [  270.730172]  [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960
      [  270.730174]  [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960
      [  270.730176]  [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20
      [  270.730185]  [<ffffffff81787770>] dev_queue_xmit+0x10/0x20
      [  270.730187]  [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760
      [  270.730190]  [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0
      [  270.730203]  [<ffffffff81665245>] ? sock_def_readable+0x5/0x190
      [  270.730210]  [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40
      [  270.730216]  [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640
      [  270.730219]  [<ffffffff8165f367>] sock_sendmsg+0x47/0x50
      [  270.730221]  [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0
      [  270.730232]  [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0
      [  270.730234]  [<ffffffff811fe5b8>] vfs_write+0xb8/0x190
      [  270.730236]  [<ffffffff811fe8c2>] SyS_write+0x52/0xb0
      [  270.730239]  [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc9f4daa
    • W
      77e62da6
    • W
      sch_plug: purge buffered packets during reset · fe6bea7f
      WANG Cong 提交于
      Otherwise the skbuff related structures are not correctly
      refcount'ed.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe6bea7f
    • J
      ipv4: consider TOS in fib_select_default · 2392debc
      Julian Anastasov 提交于
      fib_select_default considers alternative routes only when
      res->fi is for the first alias in res->fa_head. In the
      common case this can happen only when the initial lookup
      matches the first alias with highest TOS value. This
      prevents the alternative routes to require specific TOS.
      
      This patch solves the problem as follows:
      
      - routes that require specific TOS should be returned by
      fib_select_default only when TOS matches, as already done
      in fib_table_lookup. This rule implies that depending on the
      TOS we can have many different lists of alternative gateways
      and we have to keep the last used gateway (fa_default) in first
      alias for the TOS instead of using single tb_default value.
      
      - as the aliases are ordered by many keys (TOS desc,
      fib_priority asc), we restrict the possible results to
      routes with matching TOS and lowest metric (fib_priority)
      and routes that match any TOS, again with lowest metric.
      
      For example, packet with TOS 8 can not use gw3 (not lowest
      metric), gw4 (different TOS) and gw6 (not lowest metric),
      all other gateways can be used:
      
      tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
      tos 8 via gw2 metric 2
      tos 8 via gw3 metric 3
      tos 4 via gw4
      tos 0 via gw5
      tos 0 via gw6 metric 1
      Reported-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2392debc
    • J
      ipv4: fib_select_default should match the prefix · 18a912e9
      Julian Anastasov 提交于
      fib_trie starting from 4.1 can link fib aliases from
      different prefixes in same list. Make sure the alternative
      gateways are in same table and for same prefix (0) by
      checking tb_id and fa_slen.
      
      Fixes: 79e5ad2c ("fib_trie: Remove leaf_info")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a912e9
  11. 23 7月, 2015 3 次提交
  12. 22 7月, 2015 1 次提交