1. 28 2月, 2015 1 次提交
  2. 02 2月, 2015 1 次提交
  3. 15 1月, 2015 1 次提交
    • J
      team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin · b0d11b42
      Jiri Pirko 提交于
      This patch is fixing a race condition that may cause setting
      count_pending to -1, which results in unwanted big bulk of arp messages
      (in case of "notify peers").
      
      Consider following scenario:
      
      count_pending == 2
         CPU0                                           CPU1
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 1)
      					  schedule_delayed_work
       team_notify_peers
         atomic_add (adding 1 to count_pending)
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 1)
      					  schedule_delayed_work
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 0)
         schedule_delayed_work
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to -1)
      
      Fix this race by using atomic_dec_if_positive - that will prevent
      count_pending running under 0.
      
      Fixes: fc423ff0 ("team: add peer notification")
      Fixes: 492b200e  ("team: add support for sending multicast rejoins")
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0d11b42
  4. 13 1月, 2015 1 次提交
  5. 14 11月, 2014 1 次提交
    • M
      net: generic dev_disable_lro() stacked device handling · fbe168ba
      Michal Kubeček 提交于
      Large receive offloading is known to cause problems if received packets
      are passed to other host. Therefore the kernel disables it by calling
      dev_disable_lro() whenever a network device is enslaved in a bridge or
      forwarding is enabled for it (or globally). For virtual devices we need
      to disable LRO on the underlying physical device (which is actually
      receiving the packets).
      
      Current dev_disable_lro() code handles this  propagation for a vlan
      (including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
      It doesn't handle other stacked devices and their combinations, in
      particular propagation from a bond to its slaves which often causes
      problems in virtualization setups.
      
      As we now have generic data structures describing the upper-lower device
      relationship, dev_disable_lro() can be generalized to disable LRO also
      for all lower devices (if any) once it is disabled for the device
      itself.
      
      For bonding and teaming devices, it is necessary to disable LRO not only
      on current slaves at the moment when dev_disable_lro() is called but
      also on any slave (port) added later.
      
      v2: use lower device links for all devices (including vlan and macvlan)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbe168ba
  6. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  7. 05 10月, 2014 1 次提交
    • J
      team: avoid race condition in scheduling delayed work · 47549650
      Joe Lawrence 提交于
      When team_notify_peers and team_mcast_rejoin are called, they both reset
      their respective .count_pending atomic variable. Then when the actual
      worker function is executed, the variable is atomically decremented.
      This pattern introduces a potential race condition where the
      .count_pending rolls over and the worker function keeps rescheduling
      until .count_pending decrements to zero again:
      
      THREAD 1                           THREAD 2
      
      ========                           ========
      team_notify_peers(teamX)
        atomic_set count_pending = 1
        schedule_delayed_work
                                         team_notify_peers(teamX)
                                         atomic_set count_pending = 1
      team_notify_peers_work
        atomic_dec_and_test
          count_pending = 0
        (return)
                                         schedule_delayed_work
                                         team_notify_peers_work
                                         atomic_dec_and_test
                                           count_pending = -1
                                         schedule_delayed_work
                                         (repeat until count_pending = 0)
      
      Instead of assigning a new value to .count_pending, use atomic_add to
      tack-on the additional desired worker function invocations.
      Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Fixes: fc423ff0 ("team: add peer notification")
      Fixes: 492b200e ("team: add support for sending multicast rejoins")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47549650
  8. 26 8月, 2014 1 次提交
  9. 06 8月, 2014 1 次提交
  10. 03 8月, 2014 1 次提交
    • A
      net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1
      Alexei Starovoitov 提交于
      clean up names related to socket filtering and bpf in the following way:
      - everything that deals with sockets keeps 'sk_*' prefix
      - everything that is pure BPF is changed to 'bpf_*' prefix
      
      split 'struct sk_filter' into
      struct sk_filter {
      	atomic_t        refcnt;
      	struct rcu_head rcu;
      	struct bpf_prog *prog;
      };
      and
      struct bpf_prog {
              u32                     jited:1,
                                      len:31;
              struct sock_fprog_kern  *orig_prog;
              unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                                  const struct bpf_insn *filter);
              union {
                      struct sock_filter      insns[0];
                      struct bpf_insn         insnsi[0];
                      struct work_struct      work;
              };
      };
      so that 'struct bpf_prog' can be used independent of sockets and cleans up
      'unattached' bpf use cases
      
      split SK_RUN_FILTER macro into:
          SK_RUN_FILTER to be used with 'struct sk_filter *' and
          BPF_PROG_RUN to be used with 'struct bpf_prog *'
      
      __sk_filter_release(struct sk_filter *) gains
      __bpf_prog_release(struct bpf_prog *) helper function
      
      also perform related renames for the functions that work
      with 'struct bpf_prog *', since they're on the same lines:
      
      sk_filter_size -> bpf_prog_size
      sk_filter_select_runtime -> bpf_prog_select_runtime
      sk_filter_free -> bpf_prog_free
      sk_unattached_filter_create -> bpf_prog_create
      sk_unattached_filter_destroy -> bpf_prog_destroy
      sk_store_orig_filter -> bpf_prog_store_orig_filter
      sk_release_orig_filter -> bpf_release_orig_filter
      __sk_migrate_filter -> bpf_migrate_filter
      __sk_prepare_filter -> bpf_prepare_filter
      
      API for attaching classic BPF to a socket stays the same:
      sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
      and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
      which is used by sockets, tun, af_packet
      
      API for 'unattached' BPF programs becomes:
      bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
      and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
      which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae457c1
  11. 01 8月, 2014 1 次提交
  12. 31 7月, 2014 1 次提交
    • P
      net: filter: don't release unattached filter through call_rcu() · 34c5bd66
      Pablo Neira 提交于
      sk_unattached_filter_destroy() does not always need to release the
      filter object via rcu. Since this filter is never attached to the
      socket, the caller should be responsible for releasing the filter
      in a safe way, which may not necessarily imply rcu.
      
      This is a short summary of clients of this function:
      
      1) xt_bpf.c and cls_bpf.c use the bpf matchers from rules, these rules
         are removed from the packet path before the filter is released. Thus,
         the framework makes sure the filter is safely removed.
      
      2) In the ppp driver, the ppp_lock ensures serialization between the
         xmit and filter attachment/detachment path. This doesn't use rcu
         so deferred release via rcu makes no sense.
      
      3) In the isdn/ppp driver, it is called from isdn_ppp_release()
         the isdn_ppp_ioctl(). This driver uses mutex and spinlocks, no rcu.
         Thus, deferred rcu makes no sense to me either, the deferred releases
         may be just masking the effects of wrong locking strategy, which
         should be fixed in the driver itself.
      
      4) In the team driver, this is the only place where the rcu
         synchronization with unattached filter is used. Therefore, this
         patch introduces synchronize_rcu() which is called from the
         genetlink path to make sure the filter doesn't go away while packets
         are still walking over it. I think we can revisit this once struct
         bpf_prog (that only wraps specific bpf code bits) is in place, then
         add some specific struct rcu_head in the scope of the team driver if
         Jiri thinks this is needed.
      
      Deferred rcu release for unattached filters was originally introduced
      in 302d6637 ("filter: Allow to create sk-unattached filters").
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34c5bd66
  13. 03 6月, 2014 1 次提交
    • J
      team: fix mtu setting · 9d0d68fa
      Jiri Pirko 提交于
      Now it is not possible to set mtu to team device which has a port
      enslaved to it. The reason is that when team_change_mtu() calls
      dev_set_mtu() for port device, notificator for NETDEV_PRECHANGEMTU
      event is called and team_device_event() returns NOTIFY_BAD forbidding
      the change. So fix this by returning NOTIFY_DONE here in case team is
      changing mtu in team_change_mtu().
      
      Introduced-by: 3d249d4c "net: introduce ethernet teaming device"
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d0d68fa
  14. 25 5月, 2014 1 次提交
  15. 24 5月, 2014 1 次提交
    • D
      net: filter: let unattached filters use sock_fprog_kern · b1fcd35c
      Daniel Borkmann 提交于
      The sk_unattached_filter_create() API is used by BPF filters that
      are not directly attached or related to sockets, and are used in
      team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
      internal managment of obtaining filter blocks and thus already
      have them in kernel memory and set up before calling into
      sk_unattached_filter_create(). As a result, due to __user annotation
      in sock_fprog, sparse triggers false positives (incorrect type in
      assignment [different address space]) when filters are set up before
      passing them to sk_unattached_filter_create(). Therefore, let
      sk_unattached_filter_create() API use sock_fprog_kern to overcome
      this issue.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1fcd35c
  16. 23 5月, 2014 1 次提交
    • M
      teaming: fix vlan_features computing · 3625920b
      Michal Kubeček 提交于
      __team_compute_features() uses netdev_increment_features() to
      combine vlan_features of slaves into vlan_features of the team.
      As netdev_increment_features() only adds most features and we
      start with TEAM_VLAN_FEATURES, we can end up with features none
      of the slaves provided.
      
      Initialize vlan_features only with the flags which are both in
      TEAM_VLAN_FEATURES and NETIF_F_ALL_FOR_ALL. Right now there is
      no such feature so that we actually initialize vlan_features
      with zero but stating it explicitely will make the code more
      future proof.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3625920b
  17. 25 4月, 2014 1 次提交
  18. 30 3月, 2014 1 次提交
  19. 15 3月, 2014 1 次提交
  20. 17 2月, 2014 1 次提交
  21. 15 2月, 2014 1 次提交
  22. 23 1月, 2014 1 次提交
  23. 22 1月, 2014 1 次提交
    • D
      random32: add prandom_u32_max and convert open coded users · f337db64
      Daniel Borkmann 提交于
      Many functions have open coded a function that returns a random
      number in range [0,N-1]. Under the assumption that we have a PRNG
      such as taus113 with being well distributed in [0, ~0U] space,
      we can implement such a function as uword t = (n*m')>>32, where
      m' is a random number obtained from PRNG, n the right open interval
      border and t our resulting random number, with n,m',t in u32 universe.
      
      Lets go with Joe and simply call it prandom_u32_max(), although
      technically we have an right open interval endpoint, but that we
      have documented. Other users can further be migrated to the new
      prandom_u32_max() function later on; for now, we need to make sure
      to migrate reciprocal_divide() users for the reciprocal_divide()
      follow-up fixup since their function signatures are going to change.
      
      Joint work with Hannes Frederic Sowa.
      
      Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f337db64
  24. 17 1月, 2014 1 次提交
  25. 11 1月, 2014 1 次提交
    • J
      net: core: explicitly select a txq before doing l2 forwarding · f663dd9a
      Jason Wang 提交于
      Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
      will cause several issues:
      
      - NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
        instead of lower device which misses the necessary txq synchronization for
        lower device such as txq stopping or frozen required by dev watchdog or
        control path.
      - dev_hard_start_xmit() was called with NULL txq which bypasses the net device
        watchdog.
      - dev_hard_start_xmit() does not check txq everywhere which will lead a crash
        when tso is disabled for lower device.
      
      Fix this by explicitly introducing a new param for .ndo_select_queue() for just
      selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
      extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
      forwarding transmission.
      
      With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
      to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
      a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
      dev_queue_xmit() to do the transmission.
      
      In the future, it was also required for macvtap l2 forwarding support since it
      provides a necessary synchronization method.
      
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: e1000-devel@lists.sourceforge.net
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f663dd9a
  26. 30 11月, 2013 1 次提交
  27. 20 11月, 2013 3 次提交
  28. 15 11月, 2013 1 次提交
  29. 06 11月, 2013 1 次提交
    • J
      net: Explicitly initialize u64_stats_sync structures for lockdep · 827da44c
      John Stultz 提交于
      In order to enable lockdep on seqcount/seqlock structures, we
      must explicitly initialize any locks.
      
      The u64_stats_sync structure, uses a seqcount, and thus we need
      to introduce a u64_stats_init() function and use it to initialize
      the structure.
      
      This unfortunately adds a lot of fairly trivial initialization code
      to a number of drivers. But the benefit of ensuring correctness makes
      this worth while.
      
      Because these changes are required for lockdep to be enabled, and the
      changes are quite trivial, I've not yet split this patch out into 30-some
      separate patches, as I figured it would be better to get the various
      maintainers thoughts on how to best merge this change along with
      the seqcount lockdep enablement.
      
      Feedback would be appreciated!
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mirko Lindner <mlindner@marvell.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Roger Luethi <rl@hellgate.ch>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Wensong Zhang <wensong@linux-vs.org>
      Cc: netdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      827da44c
  30. 04 9月, 2013 1 次提交
  31. 27 7月, 2013 1 次提交
  32. 24 7月, 2013 3 次提交
  33. 12 6月, 2013 4 次提交