1. 31 8月, 2015 1 次提交
  2. 31 7月, 2015 1 次提交
    • D
      net: sched: fix refcount imbalance in actions · 28e6b67f
      Daniel Borkmann 提交于
      Since commit 55334a5d ("net_sched: act: refuse to remove bound action
      outside"), we end up with a wrong reference count for a tc action.
      
      Test case 1:
      
        FOO="1,6 0 0 4294967295,"
        BAR="1,6 0 0 4294967294,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
           action bpf bytecode "$FOO"
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 1 bind 1
        tc actions replace action bpf bytecode "$BAR" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
          index 1 ref 2 bind 1
        tc actions replace action bpf bytecode "$FOO" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 3 bind 1
      
      Test case 2:
      
        FOO="1,6 0 0 4294967295,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
        tc actions show action gact
          action order 0: gact action pass
          random type none pass val 0
           index 1 ref 1 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 2 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 3 bind 1
      
      What happens is that in tcf_hash_check(), we check tcf_common for a given
      index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
      found an existing action. Now there are the following cases:
      
        1) We do a late binding of an action. In that case, we leave the
           tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
           handler. This is correctly handeled.
      
        2) We replace the given action, or we try to add one without replacing
           and find out that the action at a specific index already exists
           (thus, we go out with error in that case).
      
      In case of 2), we have to undo the reference count increase from
      tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
      do so because of the 'tcfc_bindcnt > 0' check which bails out early with
      an -EPERM error.
      
      Now, while commit 55334a5d prevents 'tc actions del action ...' on an
      already classifier-bound action to drop the reference count (which could
      then become negative, wrap around etc), this restriction only accounts for
      invocations outside a specific action's ->init() handler.
      
      One possible solution would be to add a flag thus we possibly trigger
      the -EPERM ony in situations where it is indeed relevant.
      
      After the patch, above test cases have correct reference count again.
      
      Fixes: 55334a5d ("net_sched: act: refuse to remove bound action outside")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e6b67f
  3. 27 7月, 2015 4 次提交
  4. 25 7月, 2015 1 次提交
    • J
      ipv4: consider TOS in fib_select_default · 2392debc
      Julian Anastasov 提交于
      fib_select_default considers alternative routes only when
      res->fi is for the first alias in res->fa_head. In the
      common case this can happen only when the initial lookup
      matches the first alias with highest TOS value. This
      prevents the alternative routes to require specific TOS.
      
      This patch solves the problem as follows:
      
      - routes that require specific TOS should be returned by
      fib_select_default only when TOS matches, as already done
      in fib_table_lookup. This rule implies that depending on the
      TOS we can have many different lists of alternative gateways
      and we have to keep the last used gateway (fa_default) in first
      alias for the TOS instead of using single tb_default value.
      
      - as the aliases are ordered by many keys (TOS desc,
      fib_priority asc), we restrict the possible results to
      routes with matching TOS and lowest metric (fib_priority)
      and routes that match any TOS, again with lowest metric.
      
      For example, packet with TOS 8 can not use gw3 (not lowest
      metric), gw4 (different TOS) and gw6 (not lowest metric),
      all other gateways can be used:
      
      tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
      tos 8 via gw2 metric 2
      tos 8 via gw3 metric 3
      tos 4 via gw4
      tos 0 via gw5
      tos 0 via gw6 metric 1
      Reported-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2392debc
  5. 20 7月, 2015 1 次提交
    • P
      netfilter: fix netns dependencies with conntrack templates · 0838aa7f
      Pablo Neira Ayuso 提交于
      Quoting Daniel Borkmann:
      
      "When adding connection tracking template rules to a netns, f.e. to
      configure netfilter zones, the kernel will endlessly busy-loop as soon
      as we try to delete the given netns in case there's at least one
      template present, which is problematic i.e. if there is such bravery that
      the priviledged user inside the netns is assumed untrusted.
      
      Minimal example:
      
        ip netns add foo
        ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
        ip netns del foo
      
      What happens is that when nf_ct_iterate_cleanup() is being called from
      nf_conntrack_cleanup_net_list() for a provided netns, we always end up
      with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
      don't get a soft-lockup as we still have a schedule() point, but the
      serving CPU spins on 100% from that point onwards.
      
      Since templates are normally allocated with nf_conntrack_alloc(), we
      also bump net->ct.count. The issue why they are not yet nf_ct_put() is
      because the per netns .exit() handler from x_tables (which would eventually
      invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
      called in the dependency chain at a *later* point in time than the per
      netns .exit() handler for the connection tracker.
      
      This is clearly a chicken'n'egg problem: after the connection tracker
      .exit() handler, we've teared down all the connection tracking
      infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
      invoked at a later point in time during the netns cleanup, as that would
      lead to a use-after-free. At the same time, we cannot make x_tables depend
      on the connection tracker module, so that the xt_ct_tg_destroy() would
      be invoked earlier in the cleanup chain."
      
      Daniel confirms this has to do with the order in which modules are loaded or
      having compiled nf_conntrack as modules while x_tables built-in. So we have no
      guarantees regarding the order in which netns callbacks are executed.
      
      Fix this by allocating the templates through kmalloc() from the respective
      SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
      Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
      is marked as unlikely since conntrack templates are rarely allocated and only
      from the configuration plane path.
      
      Note that templates are not kept in any list to avoid further dependencies with
      nf_conntrack anymore, thus, the tmpl larval list is removed.
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Tested-by: NDaniel Borkmann <daniel@iogearbox.net>
      0838aa7f
  6. 17 7月, 2015 1 次提交
  7. 16 7月, 2015 1 次提交
  8. 29 6月, 2015 2 次提交
  9. 24 6月, 2015 2 次提交
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
    • A
      net: track link-status of ipv4 nexthops · 8a3d0316
      Andy Gospodarek 提交于
      Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
      reachable via an interface where carrier is off.  No action is taken,
      but additional flags are passed to userspace to indicate carrier status.
      
      This also includes a cleanup to fib_disable_ip to more clearly indicate
      what event made the function call to replace the more cryptic force
      option previously used.
      
      v2: Split out kernel functionality into 2 patches, this patch simply
      sets and clears new nexthop flag RTNH_F_LINKDOWN.
      
      v3: Cleanups suggested by Alex as well as a bug noticed in
      fib_sync_down_dev and fib_sync_up when multipath was not enabled.
      
      v5: Whitespace and variable declaration fixups suggested by Dave.
      
      v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
      all.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a3d0316
  10. 23 6月, 2015 2 次提交
    • S
      switchdev: rename vlan vid_start to vid_begin · 3e3a78b4
      Scott Feldman 提交于
      Use vid_begin/end to be consistent with BRIDGE_VLAN_INFO_RANGE_BEGIN/END.
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3a78b4
    • E
      netfilter: nf_qeueue: Drop queue entries on nf_unregister_hook · 8405a8ff
      Eric W. Biederman 提交于
      Add code to nf_unregister_hook to flush the nf_queue when a hook is
      unregistered.  This guarantees that the pointer that the nf_queue code
      retains into the nf_hook list will remain valid while a packet is
      queued.
      
      I tested what would happen if we do not flush queued packets and was
      trivially able to obtain the oops below.  All that was required was
      to stop the nf_queue listening process, to delete all of the nf_tables,
      and to awaken the nf_queue listening process.
      
      > BUG: unable to handle kernel paging request at 0000000100000001
      > IP: [<0000000100000001>] 0x100000001
      > PGD b9c35067 PUD 0
      > Oops: 0010 [#1] SMP
      > Modules linked in:
      > CPU: 0 PID: 519 Comm: lt-nfqnl_test Not tainted
      > task: ffff8800b9c8c050 ti: ffff8800ba9d8000 task.ti: ffff8800ba9d8000
      > RIP: 0010:[<0000000100000001>]  [<0000000100000001>] 0x100000001
      > RSP: 0018:ffff8800ba9dba40  EFLAGS: 00010a16
      > RAX: ffff8800bab48a00 RBX: ffff8800ba9dba90 RCX: ffff8800ba9dba90
      > RDX: ffff8800b9c10128 RSI: ffff8800ba940900 RDI: ffff8800bab48a00
      > RBP: ffff8800b9c10128 R08: ffffffff82976660 R09: ffff8800ba9dbb28
      > R10: dead000000100100 R11: dead000000200200 R12: ffff8800ba940900
      > R13: ffffffff8313fd50 R14: ffff8800b9c95200 R15: 0000000000000000
      > FS:  00007fb91fc34700(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 0000000100000001 CR3: 00000000babfb000 CR4: 00000000000007f0
      > Stack:
      >  ffffffff8206ab0f ffffffff82982240 ffff8800bab48a00 ffff8800b9c100a8
      >  ffff8800b9c10100 0000000000000001 ffff8800ba940900 ffff8800b9c10128
      >  ffffffff8206bd65 ffff8800bfb0d5e0 ffff8800bab48a00 0000000000014dc0
      > Call Trace:
      >  [<ffffffff8206ab0f>] ? nf_iterate+0x4f/0xa0
      >  [<ffffffff8206bd65>] ? nf_reinject+0x125/0x190
      >  [<ffffffff8206dee5>] ? nfqnl_recv_verdict+0x255/0x360
      >  [<ffffffff81386290>] ? nla_parse+0x80/0xf0
      >  [<ffffffff8206c42c>] ? nfnetlink_rcv_msg+0x13c/0x240
      >  [<ffffffff811b2fec>] ? __memcg_kmem_get_cache+0x4c/0x150
      >  [<ffffffff8206c2f0>] ? nfnl_lock+0x20/0x20
      >  [<ffffffff82068159>] ? netlink_rcv_skb+0xa9/0xc0
      >  [<ffffffff820677bf>] ? netlink_unicast+0x12f/0x1c0
      >  [<ffffffff82067ade>] ? netlink_sendmsg+0x28e/0x650
      >  [<ffffffff81fdd814>] ? sock_sendmsg+0x44/0x50
      >  [<ffffffff81fde07b>] ? ___sys_sendmsg+0x2ab/0x2c0
      >  [<ffffffff810e8f73>] ? __wake_up+0x43/0x70
      >  [<ffffffff8141a134>] ? tty_write+0x1c4/0x2a0
      >  [<ffffffff81fde9f4>] ? __sys_sendmsg+0x44/0x80
      >  [<ffffffff823ff8d7>] ? system_call_fastpath+0x12/0x6a
      > Code:  Bad RIP value.
      > RIP  [<0000000100000001>] 0x100000001
      >  RSP <ffff8800ba9dba40>
      > CR2: 0000000100000001
      > ---[ end trace 08eb65d42362793f ]---
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8405a8ff
  11. 22 6月, 2015 1 次提交
  12. 19 6月, 2015 9 次提交
  13. 16 6月, 2015 3 次提交
  14. 15 6月, 2015 1 次提交
    • M
      sctp: fix ASCONF list handling · 2d45a02d
      Marcelo Ricardo Leitner 提交于
      ->auto_asconf_splist is per namespace and mangled by functions like
      sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
      
      Also, the call to inet_sk_copy_descendant() was backuping
      ->auto_asconf_list through the copy but was not honoring
      ->do_auto_asconf, which could lead to list corruption if it was
      different between both sockets.
      
      This commit thus fixes the list handling by using ->addr_wq_lock
      spinlock to protect the list. A special handling is done upon socket
      creation and destruction for that. Error handlig on sctp_init_sock()
      will never return an error after having initialized asconf, so
      sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
      will be take on sctp_close_sock(), before locking the socket, so we
      don't do it in inverse order compared to sctp_addr_wq_timeout_handler().
      
      Instead of taking the lock on sctp_sock_migrate() for copying and
      restoring the list values, it's preferred to avoid rewritting it by
      implementing sctp_copy_descendant().
      
      Issue was found with a test application that kept flipping sysctl
      default_auto_asconf on and off, but one could trigger it by issuing
      simultaneous setsockopt() calls on multiple sockets or by
      creating/destroying sockets fast enough. This is only triggerable
      locally.
      
      Fixes: 9f7d653b ("sctp: Add Auto-ASCONF support (core).")
      Reported-by: NJi Jianwen <jiji@redhat.com>
      Suggested-by: NNeil Horman <nhorman@tuxdriver.com>
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d45a02d
  15. 12 6月, 2015 5 次提交
  16. 11 6月, 2015 1 次提交
  17. 10 6月, 2015 3 次提交
  18. 09 6月, 2015 1 次提交