1. 12 10月, 2012 1 次提交
    • S
      9P: Fix race between p9_write_work() and p9_fd_request() · 759f4298
      Simon Derr 提交于
      Race scenario:
      
      thread A			thread B
      
      p9_write_work()                p9_fd_request()
      
      if (list_empty
        (&m->unsent_req_list))
        ...
      
                                     spin_lock(&client->lock);
                                     req->status = REQ_STATUS_UNSENT;
                                     list_add_tail(..., &m->unsent_req_list);
                                     spin_unlock(&client->lock);
                                     ....
                                     if (n & POLLOUT &&
                                     !test_and_set_bit(Wworksched, &m->wsched)
                                     schedule_work(&m->wq);
                                     --> not done because Wworksched is set
      
        clear_bit(Wworksched, &m->wsched);
        return;
      
      --> nobody will take care of sending the new request.
      
      This is not very likely to happen though, because p9_write_work()
      being called with an empty unsent_req_list is not frequent.
      But this also means that taking the lock earlier will not be costly.
      Signed-off-by: NSimon Derr <simon.derr@bull.net>
      Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
      759f4298
  2. 10 10月, 2012 1 次提交
    • J
      RDS: fix rds-ping spinlock recursion · 5175a5e7
      jeff.liu 提交于
      This is the revised patch for fixing rds-ping spinlock recursion
      according to Venkat's suggestions.
      
      RDS ping/pong over TCP feature has been broken for years(2.6.39 to
      3.6.0) since we have to set TCP cork and call kernel_sendmsg() between
      ping/pong which both need to lock "struct sock *sk". However, this
      lock has already been hold before rds_tcp_data_ready() callback is
      triggerred. As a result, we always facing spinlock resursion which
      would resulting in system panic.
      
      Given that RDS ping is only used to test the connectivity and not for
      serious performance measurements, we can queue the pong transmit to
      rds_wq as a delayed response.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      CC: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5175a5e7
  3. 09 10月, 2012 13 次提交
    • M
      rbtree: empty nodes have no color · 4c199a93
      Michel Lespinasse 提交于
      Empty nodes have no color.  We can make use of this property to simplify
      the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros.  Also,
      we can get rid of the rb_init_node function which had been introduced by
      commit 88d19cf3 ("timers: Add rb_init_node() to allow for stack
      allocated rb nodes") to avoid some issue with the empty node's color not
      being initialized.
      
      I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
      doing there, though.  axboe introduced them in commit 10fd48f2
      ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev").  The way I
      see it, the 'empty node' abstraction is only used by rbtree users to
      flag nodes that they haven't inserted in any rbtree, so asking the
      predecessor or successor of such nodes doesn't make any sense.
      
      One final rb_init_node() caller was recently added in sysctl code to
      implement faster sysctl name lookups.  This code doesn't make use of
      RB_EMPTY_NODE at all, and from what I could see it only called
      rb_init_node() under the mistaken assumption that such initialization was
      required before node insertion.
      
      [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c199a93
    • J
      ipvs: fix ARP resolving for direct routing mode · ad4d3ef8
      Julian Anastasov 提交于
      After the change "Make neigh lookups directly in output packet path"
      (commit a263b309) IPVS can not reach the real server for DR mode
      because we resolve the destination address from IP header, not from
      route neighbour. Use the new FLOWI_FLAG_KNOWN_NH flag to request
      output routes with known nexthop, so that it has preference
      on resolving.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad4d3ef8
    • J
      ipv4: Add FLOWI_FLAG_KNOWN_NH · c92b9655
      Julian Anastasov 提交于
      Add flag to request that output route should be
      returned with known rt_gateway, in case we want to use
      it as nexthop for neighbour resolving.
      
      	The returned route can be cached as follows:
      
      - in NH exception: because the cached routes are not shared
      	with other destinations
      - in FIB NH: when using gateway because all destinations for
      	NH share same gateway
      
      	As last option, to return rt_gateway!=0 we have to
      set DST_NOCACHE.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c92b9655
    • J
      ipv4: introduce rt_uses_gateway · 155e8336
      Julian Anastasov 提交于
      Add new flag to remember when route is via gateway.
      We will use it to allow rt_gateway to contain address of
      directly connected host for the cases when DST_NOCACHE is
      used or when the NH exception caches per-destination route
      without DST_NOCACHE flag, i.e. when routes are not used for
      other destinations. By this way we force the neighbour
      resolving to work with the routed destination but we
      can use different address in the packet, feature needed
      for IPVS-DR where original packet for virtual IP is routed
      via route to real IP.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      155e8336
    • J
      ipv4: make sure nh_pcpu_rth_output is always allocated · f8a17175
      Julian Anastasov 提交于
      Avoid checking nh_pcpu_rth_output in fast path,
      abort fib_info creation on alloc_percpu failure.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8a17175
    • J
      ipv4: fix forwarding for strict source routes · e0adef0f
      Julian Anastasov 提交于
      After the change "Adjust semantics of rt->rt_gateway"
      (commit f8126f1d) rt_gateway can be 0 but ip_forward() compares
      it directly with nexthop. What we want here is to check if traffic
      is to directly connected nexthop and to fail if using gateway.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0adef0f
    • J
      ipv4: fix sending of redirects · e81da0e1
      Julian Anastasov 提交于
      After "Cache input routes in fib_info nexthops" (commit
      d2d68ba9) and "Elide fib_validate_source() completely when possible"
      (commit 7a9bc9b8) we can not send ICMP redirects. It seems we
      should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
      the same fib_info can be used for traffic that is not redirected,
      eg. from other input devices or from sources that are not in same subnet.
      
      	As result, we have to disable the caching of RTCF_DOREDIRECT
      flag and to force source validation for the case when forwarding
      traffic to the input device. If traffic comes from directly connected
      source we allow redirection as it was done before both changes.
      
      	Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS
      is disabled, this can avoid source address validation and to
      help caching the routes.
      
      	After the change "Adjust semantics of rt->rt_gateway"
      (commit f8126f1d) we should make sure our ICMP_REDIR_HOST messages
      contain daddr instead of 0.0.0.0 when target is directly connected.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e81da0e1
    • E
      ipv6: gro: fix PV6_GRO_CB(skb)->proto problem · 86347245
      Eric Dumazet 提交于
      It seems IPV6_GRO_CB(skb)->proto can be destroyed in skb_gro_receive()
      if a new skb is allocated (to serve as an anchor for frag_list)
      
      We copy NAPI_GRO_CB() only (not the IPV6 specific part) in :
      
      *NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);
      
      So we leave IPV6_GRO_CB(nskb)->proto to 0 (fresh skb allocation) instead
      of IPPROTO_TCP (6)
      
      ipv6_gro_complete() isnt able to call ops->gro_complete()
      [ tcp6_gro_complete() ]
      
      Fix this by moving proto in NAPI_GRO_CB() and getting rid of
      IPV6_GRO_CB
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86347245
    • F
      vlan: don't deliver frames for unknown vlans to protocols · 48cc32d3
      Florian Zumbiehl 提交于
      6a32e4f9 made the vlan code skip marking
      vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
      there was an rx_handler, as the rx_handler could cause the frame to be received
      on a different (virtual) vlan-capable interface where that vlan might be
      configured.
      
      As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
      frames for unknown vlans to be delivered to the protocol stack as if they had
      been received untagged.
      
      For example, if an ipv6 router advertisement that's tagged for a locally not
      configured vlan is received on an interface with macvlan interfaces attached,
      macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
      macvlan interfaces, which caused it to be passed to the protocol stack, leading
      to ipv6 addresses for the announced prefix being configured even though those
      are completely unusable on the underlying interface.
      
      The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
      rx_handler, if there is one, sees the frame unchanged, but afterwards,
      before the frame is delivered to the protocol stack, it gets marked whether
      there is an rx_handler or not.
      Signed-off-by: NFlorian Zumbiehl <florz@florz.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48cc32d3
    • E
      net: gro: selective flush of packets · 2e71a6f8
      Eric Dumazet 提交于
      Current GRO can hold packets in gro_list for almost unlimited
      time, in case napi->poll() handler consumes its budget over and over.
      
      In this case, napi_complete()/napi_gro_flush() are not called.
      
      Another problem is that gro_list is flushed in non friendly way :
      We scan the list and complete packets in the reverse order.
      (youngest packets first, oldest packets last)
      This defeats priorities that sender could have cooked.
      
      Since GRO currently only store TCP packets, we dont really notice the
      bug because of retransmits, but this behavior can add unexpected
      latencies, particularly on mice flows clamped by elephant flows.
      
      This patch makes sure no packet can stay more than 1 ms in queue, and
      only in stress situations.
      
      It also complete packets in the right order to minimize latencies.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e71a6f8
    • S
      ipv4: Don't report stale pmtu values to userspace · ee9a8f7a
      Steffen Klassert 提交于
      We report cached pmtu values even if they are already expired.
      Change this to not report these values after they are expired
      and fix a race in the expire time calculation, as suggested by
      Eric Dumazet.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee9a8f7a
    • S
      ipv4: Don't create nh exeption when the device mtu is smaller than the reported pmtu · 7f92d334
      Steffen Klassert 提交于
      When a local tool like tracepath tries to send packets bigger than
      the device mtu, we create a nh exeption and set the pmtu to device
      mtu. The device mtu does not expire, so check if the device mtu is
      smaller than the reported pmtu and don't crerate a nh exeption in
      that case.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f92d334
    • S
      ipv4: Always invalidate or update the route on pmtu events · d851c12b
      Steffen Klassert 提交于
      Some protocols, like IPsec still cache routes. So we need to invalidate
      the old route on pmtu events to avoid the reuse of stale routes.
      We also need to update the mtu and expire time of the route if we already
      use a nh exception route, otherwise we ignore newly learned pmtu values
      after the first expiration.
      
      With this patch we always invalidate or update the route on pmtu events.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d851c12b
  4. 08 10月, 2012 3 次提交
  5. 07 10月, 2012 2 次提交
    • E
      net: remove skb recycling · acb600de
      Eric Dumazet 提交于
      Over time, skb recycling infrastructure got litle interest and
      many bugs. Generic rx path skb allocation is now using page
      fragments for efficient GRO / TCP coalescing, and recyling
      a tx skb for rx path is not worth the pain.
      
      Last identified bug is that fat skbs can be recycled
      and it can endup using high order pages after few iterations.
      
      With help from Maxime Bizon, who pointed out that commit
      87151b86 (net: allow pskb_expand_head() to get maximum tailroom)
      introduced this regression for recycled skbs.
      
      Instead of fixing this bug, lets remove skb recycling.
      
      Drivers wanting really hot skbs should use build_skb() anyway,
      to allocate/populate sk_buff right before netif_receive_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb600de
    • G
      netlink: add reference of module in netlink_dump_start · 6dc878a8
      Gao feng 提交于
      I get a panic when I use ss -a and rmmod inet_diag at the
      same time.
      
      It's because netlink_dump uses inet_diag_dump which belongs to module
      inet_diag.
      
      I search the codes and find many modules have the same problem.  We
      need to add a reference to the module which the cb->dump belongs to.
      
      Thanks for all help from Stephen,Jan,Eric,Steffen and Pablo.
      
      Change From v3:
      change netlink_dump_start to inline,suggestion from Pablo and
      Eric.
      
      Change From v2:
      delete netlink_dump_done,and call module_put in netlink_dump
      and netlink_sock_destruct.
      Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dc878a8
  6. 06 10月, 2012 2 次提交
  7. 05 10月, 2012 7 次提交
  8. 03 10月, 2012 3 次提交
  9. 02 10月, 2012 8 次提交