1. 23 5月, 2015 8 次提交
  2. 22 5月, 2015 1 次提交
    • D
      net: sched: fix call_rcu() race on classifier module unloads · c78e1746
      Daniel Borkmann 提交于
      Vijay reported that a loop as simple as ...
      
        while true; do
          tc qdisc add dev foo root handle 1: prio
          tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
          tc qdisc del dev foo root
          rmmod cls_u32
        done
      
      ... will panic the kernel. Moreover, he bisected the change
      apparently introducing it to 78fd1d0a ("netlink: Re-add
      locking to netlink_lookup() and seq walker").
      
      The removal of synchronize_net() from the netlink socket
      triggering the qdisc to be removed, seems to have uncovered
      an RCU resp. module reference count race from the tc API.
      Given that RCU conversion was done after e341694e ("netlink:
      Convert netlink_lookup() to use RCU protected hash table")
      which added the synchronize_net() originally, occasion of
      hitting the bug was less likely (not impossible though):
      
      When qdiscs that i) support attaching classifiers and,
      ii) have at least one of them attached, get deleted, they
      invoke tcf_destroy_chain(), and thus call into ->destroy()
      handler from a classifier module.
      
      After RCU conversion, all classifier that have an internal
      prio list, unlink them and initiate freeing via call_rcu()
      deferral.
      
      Meanhile, tcf_destroy() releases already reference to the
      tp->ops->owner module before the queued RCU callback handler
      has been invoked.
      
      Subsequent rmmod on the classifier module is then not prevented
      since all module references are already dropped.
      
      By the time, the kernel invokes the RCU callback handler from
      the module, that function address is then invalid.
      
      One way to fix it would be to add an rcu_barrier() to
      unregister_tcf_proto_ops() to wait for all pending call_rcu()s
      to complete.
      
      synchronize_rcu() is not appropriate as under heavy RCU
      callback load, registered call_rcu()s could be deferred
      longer than a grace period. In case we don't have any pending
      call_rcu()s, the barrier is allowed to return immediately.
      
      Since we came here via unregister_tcf_proto_ops(), there
      are no users of a given classifier anymore. Further nested
      call_rcu()s pointing into the module space are not being
      done anywhere.
      
      Only cls_bpf_delete_prog() may schedule a work item, to
      unlock pages eventually, but that is not in the range/context
      of cls_bpf anymore.
      
      Fixes: 25d8c0d5 ("net: rcu-ify tcf_proto")
      Fixes: 9888faef ("net: sched: cls_basic use RCU")
      Reported-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78e1746
  3. 21 5月, 2015 4 次提交
    • T
      net: phy: Make sure phy_start() always re-enables the phy interrupts · c15e10e7
      Tim Beale 提交于
      This is an alternative way of fixing:
       commit db9683fb ("net: phy: Make sure PHY_RESUMING state change
                            is always processed")
      
      When the PHY state transitions from PHY_HALTED to PHY_RESUMING, there are
      two things we need to do:
      1). Re-enable interrupts (and power up the physical link, if powered down)
      2). Update the PHY state and net-device based on the link status.
      
      There's no strict reason why #1 has to be done from within the main
      phy_state_machine() function. There is a risk that other changes to the
      PHY (e.g. setting speed/duplex, which calls phy_start_aneg()) could cause
      a subsequent state transition before phy_state_machine() has processed
      the PHY_RESUMING state change. This would leave the PHY with interrupts
      disabled and/or still in the BMCR_PDOWN/low-power mode.
      
      Moving enabling the interrupts and phy_resume() into phy_start() will
      guarantee this work always gets done. As the PHY is already in the HALTED
      state and interrupts are disabled, it shouldn't conflict with any work
      being done in phy_state_machine(). The downside of this change is that if
      the PHY_RESUMING state is ever entered from anywhere else, it'll also have
      to repeat this work.
      Signed-off-by: NTim Beale <tim.beale@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c15e10e7
    • D
      Merge branch 'ipv6_ecmp_fixes' · 7764b9dd
      David S. Miller 提交于
      Michal Kubecek says:
      
      ====================
      IPv6 ECMP route add/replace fixes
      
      (1) When adding a nexthop of a multipath route fails (e.g. because of a
      conflict with an existing route), we are supposed to delete nexthops
      already added. However, currently we try to also delete all nexthops we
      haven't even tried to add yet so that a "ip route add" command can
      actually remove pre-existing routes if it fails.
      
      (2) Attempt to replace a multipath route results in a broken siblings
      linked list. Following commands (like "ip route del") can then either
      follow a link into freed memory or end in an infinite loop (if the slab
      object has been reused).
      
      v2: fix an omission in first patch
      
      v3: change the semantics of replace operation to better match IPv4
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7764b9dd
    • M
      ipv6: fix ECMP route replacement · 27596472
      Michal Kubeček 提交于
      When replacing an IPv6 multipath route with "ip route replace", i.e.
      NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
      matching route without fixing its siblings, resulting in corrupted
      siblings linked list; removing one of the siblings can then end in an
      infinite loop.
      
      IPv6 ECMP implementation is a bit different from IPv4 so that route
      replacement cannot work in exactly the same way. This should be a
      reasonable approximation:
      
      1. If the new route is ECMP-able and there is a matching ECMP-able one
      already, replace it and all its siblings (if any).
      
      2. If the new route is ECMP-able and no matching ECMP-able route exists,
      replace first matching non-ECMP-able (if any) or just add the new one.
      
      3. If the new route is not ECMP-able, replace first matching
      non-ECMP-able route (if any) or add the new route.
      
      We also need to remove the NLM_F_REPLACE flag after replacing old
      route(s) by first nexthop of an ECMP route so that each subsequent
      nexthop does not replace previous one.
      
      Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27596472
    • M
      ipv6: do not delete previously existing ECMP routes if add fails · 35f1b4e9
      Michal Kubeček 提交于
      If adding a nexthop of an IPv6 multipath route fails, comment in
      ip6_route_multipath() says we are going to delete all nexthops already
      added. However, current implementation deletes even the routes it
      hasn't even tried to add yet. For example, running
      
        ip route add 1234:5678::/64 \
            nexthop via fe80::aa dev dummy1 \
            nexthop via fe80::bb dev dummy1 \
            nexthop via fe80::cc dev dummy1
      
      twice results in removing all routes first command added.
      
      Limit the second (delete) run to nexthops that succeeded in the first
      (add) run.
      
      Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f1b4e9
  4. 20 5月, 2015 7 次提交
  5. 19 5月, 2015 3 次提交
  6. 18 5月, 2015 2 次提交
  7. 17 5月, 2015 4 次提交
    • H
      rhashtable: Add cap on number of elements in hash table · 07ee0722
      Herbert Xu 提交于
      We currently have no limit on the number of elements in a hash table.
      This is a problem because some users (tipc) set a ceiling on the
      maximum table size and when that is reached the hash table may
      degenerate.  Others may encounter OOM when growing and if we allow
      insertions when that happens the hash table perofrmance may also
      suffer.
      
      This patch adds a new paramater insecure_max_entries which becomes
      the cap on the table.  If unset it defaults to max_size * 2.  If
      it is also zero it means that there is no cap on the number of
      elements in the table.  However, the table will grow whenever the
      utilisation hits 100% and if that growth fails, you will get ENOMEM
      on insertion.
      
      As allowing oversubscription is potentially dangerous, the name
      contains the word insecure.
      
      Note that the cap is not a hard limit.  This is done for performance
      reasons as enforcing a hard limit will result in use of atomic ops
      that are heavier than the ones we currently use.
      
      The reasoning is that we're only guarding against a gross over-
      subscription of the table, rather than a small breach of the limit.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07ee0722
    • T
      net: phy: Make sure PHY_RESUMING state change is always processed · db9683fb
      Tim Beale 提交于
      If phy_start_aneg() was called while the phydev is in the PHY_RESUMING
      state, then its state would immediately transition to PHY_AN (or
      PHY_FORCING). This meant the phy_state_machine() never processed the
      PHY_RESUMING state change, which meant interrupts weren't enabled for the
      PHY. If the PHY used low-power mode (i.e. using BMCR_PDOWN), then the
      physical link wouldn't get powered up again.
      
      There seems no point for phy_start_aneg() to make the PHY_RESUMING -->
      PHY_AN transition, as the state machine will do this anyway. I'm not sure
      about the case where autoneg is disabled, as my patch will change
      behaviour so that the PHY goes to PHY_NOLINK instead of PHY_FORCING. An
      alternative solution would be to move the phy_config_interrupt() and
      phy_resume() work out of the state machine and into phy_start().
      
      The background behind this: we're running linux v3.16.7 and from user-space
      we want to enable the eth port (i.e. do a SIOCSIFFLAGS ioctl with the
      IFF_UP flag) and immediately afterward set the interface's speed/duplex.
      Enabling the interface calls .ndo_open() then phy_start() and the PHY
      transitions PHY_HALTED --> PHY_RESUMING. Setting the speed/duplex ends up
      calling phy_ethtool_sset(), which calls phy_start_aneg() (meanwhile the
      phy_state_machine() hasn't processed the PHY_RESUMING state change yet).
      Signed-off-by: NTim Beale <tim.beale@alliedtelesis.co.nz>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db9683fb
    • H
      netlink: Reset portid after netlink_insert failure · c0bb07df
      Herbert Xu 提交于
      The commit c5adde94 ("netlink:
      eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
      because it doesn't reset portid after a failed netlink_insert.
      
      This means that should autobind fail the first time around, then
      the socket will be stuck in limbo as it can never be bound again
      since it already has a non-zero portid.
      
      Fixes: c5adde94 ("netlink: eliminate nl_sk_hash_lock")
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bb07df
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 1d605701
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      The following patchset contains Netfilter fixes for your net tree, they are:
      
      1) Fix a leak in IPVS, the sysctl table is not released accordingly when
         destroying a netns, patch from Tommi Rantala.
      
      2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is
         compiled as module, from Florian Westphal.
      
      3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch
         from Jesper Dangaard Brouer.
      
      4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores
         a map, from Mirek Kratochvil.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d605701
  8. 16 5月, 2015 6 次提交
  9. 15 5月, 2015 4 次提交
  10. 14 5月, 2015 1 次提交