1. 18 12月, 2008 1 次提交
  2. 17 12月, 2008 1 次提交
  3. 16 12月, 2008 9 次提交
    • Y
      ipv6: Add IPV6_PKTINFO sticky option support to setsockopt() · b24a2516
      Yang Hongyang 提交于
      There are three reasons for me to add this support:
      1.When no interface is specified in an IPV6_PKTINFO ancillary data
        item, the interface specified in an IPV6_PKTINFO sticky optionis 
        is used.
      
      RFC3542:
      6.7.  Summary of Outgoing Interface Selection
      
         This document and [RFC-3493] specify various methods that affect the
         selection of the packet's outgoing interface.  This subsection
         summarizes the ordering among those in order to ensure deterministic
         behavior.
      
         For a given outgoing packet on a given socket, the outgoing interface
         is determined in the following order:
      
         1. if an interface is specified in an IPV6_PKTINFO ancillary data
            item, the interface is used.
      
         2. otherwise, if an interface is specified in an IPV6_PKTINFO sticky
            option, the interface is used.
      
      2.When no IPV6_PKTINFO ancillary data is received,getsockopt() should 
        return the sticky option value which set with setsockopt().
      
      RFC 3542:
         Issuing getsockopt() for the above options will return the sticky
         option value i.e., the value set with setsockopt().  If no sticky
         option value has been set getsockopt() will return the following
         values:
      
      3.Make the setsockopt implementation POSIX compliant.
      Signed-off-by: NYang Hongyang <yanghy@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b24a2516
    • S
      net: Refactor full duplex flow control resolution · bc02ff95
      Steve Glendinning 提交于
      These 4 drivers have identical full duplex flow control resolution
      functions.  This patch changes them all to use one common function.
      
      The function in question decides whether a device should enable TX and
      RX flow control in a standard way (IEEE 802.3-2005 table 28B-3), so this
      should also be useful for other drivers.
      Signed-off-by: NSteve Glendinning <steve.glendinning@smsc.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc02ff95
    • S
      net: Move flow control definitions to mii.h · e18ce346
      Steve Glendinning 提交于
      flags used within drivers for indicating tx and rx flow control are
      defined in 4 drivers (and probably more), move these constants to mii.h.
      
      The 3 SMSC drivers use the same constants (FLOW_CTRL_TX), but TG3 uses
      TG3_FLOW_CTRL_TX, so this patch also renames the constants within TG3.
      Signed-off-by: NSteve Glendinning <steve.glendinning@smsc.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e18ce346
    • P
      netfilter: ctnetlink: fix missing CTA_NAT_SEQ_UNSPEC · 092cab7e
      Pablo Neira Ayuso 提交于
      This patch fixes an inconsistency in nfnetlink_conntrack.h that
      I introduced myself. The problem is that CTA_NAT_SEQ_UNSPEC is
      missing from enum ctattr_natseq. This inconsistency may lead to
      problems in the message parsing in userspace (if the message
      contains the CTA_NAT_SEQ_* attributes, of course).
      
      This patch breaks backward compatibility, however, the only known
      client of this code is libnetfilter_conntrack which indeed crashes
      because it assumes the existence of CTA_NAT_SEQ_UNSPEC to do
      the parsing.
      
      The CTA_NAT_SEQ_* attributes were introduced in 2.6.25.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      092cab7e
    • H
      ethtool: Add GGRO and SGRO ops · b240a0e5
      Herbert Xu 提交于
      This patch adds the ethtool ops to enable and disable GRO.  It also
      makes GRO depend on RX checksum offload much the same as how TSO
      depends on SG support.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b240a0e5
    • H
      net: Add skb_gro_receive · 71d93b39
      Herbert Xu 提交于
      This patch adds the helper skb_gro_receive to merge packets for
      GRO.  The current method is to allocate a new header skb and then
      chain the original packets to its frag_list.  This is done to
      make it easier to integrate into the existing GSO framework.
      
      In future as GSO is moved into the drivers, we can undo this and
      simply chain the original packets together.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71d93b39
    • H
      net: Add Generic Receive Offload infrastructure · d565b0a1
      Herbert Xu 提交于
      This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
      This is pretty similar to LRO except that this is protocol-independent.
      Instead of holding packets in an lro_mgr structure, they're now held in
      napi_struct.
      
      For drivers that intend to use this, they can set the NETIF_F_GRO bit and
      call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
      The latter will call napi_receive_skb automatically.  When napi_gro_receive
      is used, the driver must either call napi_complete/napi_rx_complete, or
      call napi_gro_flush in softirq context if the driver uses the primitives
      __napi_complete/__napi_rx_complete.
      
      Protocols will set the gro_receive and gro_complete function pointers in
      order to participate in this scheme.
      
      In addition to the packet, gro_receive will get a list of currently held
      packets.  Each packet in the list has a same_flow field which is non-zero
      if it is a potential match for the new packet.  For each packet that may
      match, they also have a flush field which is non-zero if the held packet
      must not be merged with the new packet.
      
      Once gro_receive has determined that the new skb matches a held packet,
      the held packet may be processed immediately if the new skb cannot be
      merged with it.  In this case gro_receive should return the pointer to
      the existing skb in gro_list.  Otherwise the new skb should be merged into
      the existing packet and NULL should be returned, unless the new skb makes
      it impossible for any further merges to be made (e.g., FIN packet) where
      the merged skb should be returned.
      
      Whenever the skb is merged into an existing entry, the gro_receive
      function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
      merely matches an existing entry but can't be merged with it, then
      this shouldn't be set.
      
      If gro_receive finds it pointless to hold the new skb for future merging,
      it should set NAPI_GRO_CB(skb)->flush.
      
      Held packets will be flushed by napi_gro_flush which is called by
      napi_complete and napi_rx_complete.
      
      Currently held packets are stored in a singly liked list just like LRO.
      The list is limited to a maximum of 8 entries.  In future, this may be
      expanded to use a hash table to allow more flows to be held for merging.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d565b0a1
    • H
      net: Add frag_list support to GSO · 1a881f27
      Herbert Xu 提交于
      This patch allows GSO to handle frag_list in a limited way for the
      purposes of allowing packets merged by GRO to be refragmented on
      output.
      
      Most hardware won't (and aren't expected to) support handling GRO
      frag_list packets directly.  Therefore we will perform GSO in
      software for those cases.
      
      However, for drivers that can support it (such as virtual NICs) we
      may not have to segment the packets at all.
      
      Whether the added overhead of GRO/GSO is worthwhile for bridges
      and routers when weighed against the benefit of potentially
      increasing the MTU within the host is still an open question.
      However, for the case of host nodes this is undoubtedly a win.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a881f27
    • R
      Define smp_call_function_many for UP · d2ff9118
      Rusty Russell 提交于
      Otherwise those using it in transition patches (eg. kvm) can't compile
      with CONFIG_SMP=n:
      
      arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'make_all_cpus_request':
      arch/x86/kvm/../../../virt/kvm/kvm_main.c:380: error: implicit declaration of function 'smp_call_function_many'
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ff9118
  4. 13 12月, 2008 2 次提交
  5. 11 12月, 2008 7 次提交
    • B
      netns: ip6mr: enable namespace support in ipv6 multicast forwarding code · 8229efda
      Benjamin Thery 提交于
      This last patch makes the appropriate changes to use and propagate the
      network namespace where needed in IPv6 multicast forwarding code.
      
      This consists mainly in replacing all the remaining init_net occurences
      with current netns pointer retrieved from sockets, net devices or 
      mfc6_caches depending on the routines' contexts.
      
      Some routines receive a new 'struct net' parameter to propagate the current
      netns:
      * ip6mr_get_route
      * ip6mr_cache_report
      * ip6mr_cache_find
      * ip6mr_cache_unresolved
      * mif6_add/mif6_delete
      * ip6mr_mfc_add/ip6mr_mfc_delete
      * ip6mr_reg_vif
      
      All the IPv6 multicast forwarding variables moved to struct netns_ipv6 by
      the previous patches are now referenced in the correct namespace.
      
      Changelog:
      ==========
      * Take into account the net associated to mfc6_cache when matching entries in
        mfc_unres_queue list.
      * Call mroute_clean_tables() in ip6mr_net_exit() to free memory allocated
        per-namespace.
      * Call dev_net_set() in ip6mr_reg_vif() to initialize dev->nd_net 
        correctly.
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8229efda
    • B
      netns: ip6mr: store netns in struct mfc6_cache · 58701ad4
      Benjamin Thery 提交于
      This patch stores into struct mfc6_cache the network namespace each
      mfc6_cache belongs to. The new member is mfc6_net.
      
      mfc6_net is assigned at cache allocation and doesn't change during
      the rest of the cache entry life.
      
      This will help to retrieve the current netns around the IPv6 multicast
      forwarding code.
      
      At the moment, all mfc6_cache are allocated in init_net.
      
      Changelog:
      ==========
      * Use write_pnet()/read_pnet() to set and get mfc6_net.
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58701ad4
    • B
      netns: ip6mr: allocate mroute6_socket per-namespace. · bd91b8bf
      Benjamin Thery 提交于
      Preliminary work to make IPv6 multicast forwarding netns-aware.
      
      Make IPv6 multicast forwarding mroute6_socket per-namespace,
      moves it into struct netns_ipv6.
      
      At the moment, mroute6_socket is only referenced in init_net.
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd91b8bf
    • S
      smsc911x: add dynamic bus configuration · 2107fb8b
      Steve Glendinning 提交于
      Convert the driver to select 16-bit or 32-bit bus access at runtime,
      at a small performance cost.
      Signed-off-by: NSteve Glendinning <steve.glendinning@smsc.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2107fb8b
    • H
      KSYM_SYMBOL_LEN fixes · 9c246247
      Hugh Dickins 提交于
      Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
      to my 966c8c12 sprint_symbol(): use
      less stack exposing a bug in slub's list_locations() -
      kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
      beyond the end of page provided.
      
      The 100 slop which list_locations() allows at end of page looks roughly
      enough for all the other stuff it might print after the symbol before
      it checks again: break out KSYM_SYMBOL_LEN earlier than before.
      
      Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
      need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
      where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
      them.
      
      [akpm@linux-foundation.org: ftrace.h needs module.h]
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc Miles Lane <miles.lane@gmail.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c246247
    • A
      revert "percpu_counter: new function percpu_counter_sum_and_set" · 02d21168
      Andrew Morton 提交于
      Revert
      
          commit e8ced39d
          Author: Mingming Cao <cmm@us.ibm.com>
          Date:   Fri Jul 11 19:27:31 2008 -0400
      
              percpu_counter: new function percpu_counter_sum_and_set
      
      As described in
      
      	revert "percpu counter: clean up percpu_counter_sum_and_set()"
      
      the new percpu_counter_sum_and_set() is racy against updates to the
      cpu-local accumulators on other CPUs.  Revert that change.
      
      This means that ext4 will be slow again.  But correct.
      Reported-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: <stable@kernel.org>		[2.6.27.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02d21168
    • A
      revert "percpu counter: clean up percpu_counter_sum_and_set()" · 71c5576f
      Andrew Morton 提交于
      Revert
      
          commit 1f7c14c6
          Author: Mingming Cao <cmm@us.ibm.com>
          Date:   Thu Oct 9 12:50:59 2008 -0400
      
              percpu counter: clean up percpu_counter_sum_and_set()
      
      Before this patch we had the following:
      
      percpu_counter_sum(): return the percpu_counter's value
      
      percpu_counter_sum_and_set(): return the percpu_counter's value, copying
      that value into the central value and zeroing the per-cpu counters before
      returning.
      
      After this patch, percpu_counter_sum_and_set() has gone, and
      percpu_counter_sum() gets the old percpu_counter_sum_and_set()
      functionality.
      
      Problem is, as Eric points out, the old percpu_counter_sum_and_set()
      functionality was racy and wrong.  It zeroes out counters on "other" cpus,
      without holding any locks which will prevent races agaist updates from
      those other CPUS.
      
      This patch reverts 1f7c14c6.  This means
      that percpu_counter_sum_and_set() still has the race, but
      percpu_counter_sum() does not.
      
      Note that this is not a simple revert - ext4 has since started using
      percpu_counter_sum() for its dirty_blocks counter as well.
      
      Note that this revert patch changes percpu_counter_sum() semantics.
      
      Before the patch, a call to percpu_counter_sum() will bring the counter's
      central counter mostly up-to-date, so a following percpu_counter_read()
      will return a close value.
      
      After this patch, a call to percpu_counter_sum() will leave the counter's
      central accumulator unaltered, so a subsequent call to
      percpu_counter_read() can now return a significantly inaccurate result.
      
      If there is any code in the tree which was introduced after
      e8ced39d was merged, and which depends
      upon the new percpu_counter_sum() semantics, that code will break.
      Reported-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71c5576f
  6. 10 12月, 2008 1 次提交
    • N
      netpoll: fix race on poll_list resulting in garbage entry · 7b363e44
      Neil Horman 提交于
      	A few months back a race was discused between the netpoll napi service
      path, and the fast path through net_rx_action:
      http://kerneltrap.org/mailarchive/linux-netdev/2007/10/16/345470
      
      A patch was submitted for that bug, but I think we missed a case.
      
      Consider the following scenario:
      
      INITIAL STATE
      CPU0 has one napi_struct A on its poll_list
      CPU1 is calling netpoll_send_skb and needs to call poll_napi on the same
      napi_struct A that CPU0 has on its list
      
      
      
      CPU0						CPU1
      net_rx_action					poll_napi
      !list_empty (returns true)			locks poll_lock for A
      						 poll_one_napi
      						  napi->poll
      						   netif_rx_complete
      						    __napi_complete
      						    (removes A from poll_list)
      list_entry(list->next)
      
      
      In the above scenario, net_rx_action assumes that the per-cpu poll_list is
      exclusive to that cpu.  netpoll of course violates that, and because the netpoll
      path can dequeue from the poll list, its possible for CPU0 to detect a non-empty
      list at the top of the while loop in net_rx_action, but have it become empty by
      the time it calls list_entry.  Since the poll_list isn't surrounded by any other
      structure, the returned data from that list_entry call in this situation is
      garbage, and any number of crashes can result based on what exactly that garbage
      is.
      
      Given that its not fasible for performance reasons to place exclusive locks
      arround each cpus poll list to provide that mutal exclusion, I think the best
      solution is modify the netpoll path in such a way that we continue to guarantee
      that the poll_list for a cpu is in fact exclusive to that cpu.  To do this I've
      implemented the patch below.  It adds an additional bit to the state field in
      the napi_struct.  When executing napi->poll from the netpoll_path, this bit will
      be set. When a driver calls netif_rx_complete, if that bit is set, it will not
      remove the napi_struct from the poll_list.  That work will be saved for the next
      iteration of net_rx_action.
      
      I've tested this and it seems to work well.  About the biggest drawback I can
      see to it is the fact that it might result in an extra loop through
      net_rx_action in the event that the device is actually contended for (i.e. the
      netpoll path actually preforms all the needed work no the device, and the call
      to net_rx_action winds up doing nothing, except removing the napi_struct from
      the poll_list.  However I think this is probably a small price to pay, given
      that the alternative is a crash.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b363e44
  7. 09 12月, 2008 2 次提交
  8. 08 12月, 2008 4 次提交
    • G
      dccp ccid-2: Phase out the use of boolean Ack Vector sysctl · 6fdd34d4
      Gerrit Renker 提交于
      This removes the use of the sysctl and the minisock variable for the Send Ack
      Vector feature, as it now is handled fully dynamically via feature negotiation
      (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
       RFC 4341, 4.).
      
      Using a sysctl in parallel to this implementation would open the door to
      crashes, since much of the code relies on tests of the boolean minisock /
      sysctl variable. Thus, this patch replaces all tests of type
      
      	if (dccp_msk(sk)->dccpms_send_ack_vector)
      		/* ... */
      with
      	if (dp->dccps_hc_rx_ackvec != NULL)
      		/* ... */
      
      The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
      negotiation concluded that Ack Vectors are to be used on the half-connection.
      Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
      so that the test is a valid one.
      
      The activation handler for Ack Vectors is called as soon as the feature
      negotiation has concluded at the
       * server when the Ack marking the transition RESPOND => OPEN arrives;
       * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.
      
      Adding the sequence number of the Response packet to the Ack Vector has been
      removed, since
       (a) connection establishment implies that the Response has been received;
       (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
           this entry will always be ignored;
       (c) it can not be used for anything useful - to detect loss for instance, only
           packets received after the loss can serve as pseudo-dupacks.
      
      There was a FIXME to change the error code when dccp_ackvec_add() fails.
      I removed this after finding out that:
       * the check whether ackno < ISN is already made earlier,
       * this Response is likely the 1st packet with an Ackno that the client gets,
       * so when dccp_ackvec_add() fails, the reason is likely not a packet error.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fdd34d4
    • G
      dccp: Remove manual influence on NDP Count feature · 4098dce5
      Gerrit Renker 提交于
      Updating the NDP count feature is handled automatically now:
       * for CCID-2 it is disabled, since the code does not use NDP counts;
       * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.
      
      Allowing the user to change NDP values leads to unpredictable and failing
      behaviour, since it is then possible to disable NDP counts even when they
      are needed (e.g. in CCID-3).
      
      This means that only those user settings are sensible that agree with the
      values for Send NDP Count implied by the choice of CCID. But those settings
      are already activated by the feature negotiation (CCID dependency tracking),
      hence this form of support is redundant.
      
      At startup the initialisation of the NDP count feature uses the default
      value of 0, which is done implicitly by the zeroing-out of the socket when
      it is allocated. If the choice of CCID or feature negotiation enables NDP
      count, this will then be updated via the NDP activation handler.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4098dce5
    • G
      dccp: Remove obsolete parts of the old CCID interface · 0049bab5
      Gerrit Renker 提交于
      The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
      case, their value equals initially that of the sysctl, but at the end of
      feature negotiation may be something different.
      
      The old interface removed by this patch thus has been replaced by the newer
      interface to dynamically query the currently loaded CCIDs.
      
      Also removed are the constructors for the TX CCID and the RX CCID, since the
      switch "rx <-> non-rx" is done by the handler in minisocks.c (and the handler
      is the only place in the code where CCIDs are loaded).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0049bab5
    • W
      netdevice: Kill netdev->priv · b74ca3a8
      Wang Chen 提交于
      This is the last shoot of this series.
      After I removing all directly reference of netdev->priv, I am killing
      "priv" of "struct net_device" and fixing relative comments/docs.
      
      Anyone will not be allowed to reference netdev->priv directly.
      If you want to reference the memory of private data, use netdev_priv()
      instead.
      If the private data is not allocted when alloc_netdev(), use
      netdev->ml_priv to point that memory after you creating that private
      data.
      Signed-off-by: NWang Chen <wangchen@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b74ca3a8
  9. 06 12月, 2008 1 次提交
  10. 05 12月, 2008 2 次提交
  11. 04 12月, 2008 4 次提交
    • C
      [PATCH 2/2] documnt FMODE_ constants · fc9161e5
      Christoph Hellwig 提交于
      Make sure all FMODE_ constants are documents, and ensure a coherent
      style for the already existing comments.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fc9161e5
    • C
      [PATCH 1/2] kill FMODE_NDELAY_NOW · fd4ce1ac
      Christoph Hellwig 提交于
      Update FMODE_NDELAY before each ioctl call so that we can kill the
      magic FMODE_NDELAY_NOW.  It would be even better to do this directly
      in setfl(), but for that we'd need to have FMODE_NDELAY for all files,
      not just block special files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fd4ce1ac
    • D
      atm: 32-bit ioctl compatibility · 8865c418
      David Woodhouse 提交于
      We lack compat ioctl support through most of the ATM code. This patch
      deals with most of it, and I can now at least use BR2684 and PPPoATM
      with 32-bit userspace.
      
      I haven't added a .compat_ioctl method to struct atm_ioctl, because
      AFAICT none of the current users need any conversion -- so we can just
      call the ->ioctl() method in every case. I looked at br2684, clip, lec,
      mpc, pppoatm and atmtcp.
      
      In svc_compat_ioctl() the only mangling which is needed is to change
      COMPAT_ATM_ADDPARTY to ATM_ADDPARTY. Although it's defined as
      	_IOW('a', ATMIOC_SPECIAL+4,struct atm_iobuf)
      it doesn't actually _take_ a struct atm_iobuf as an argument -- it takes
      a struct sockaddr_atmsvc, which _is_ the same between 32-bit and 64-bit
      code, so doesn't need conversion.
      
      Almost all of vcc_ioctl() would have been identical, so I converted that
      into a core do_vcc_ioctl() function with an 'int compat' argument.
      
      I've done the same with atm_dev_ioctl(), where there _are_ a few
      differences, but still it's relatively contained and there would
      otherwise have been a lot of duplication.
      
      I haven't done any of the actual device-specific ioctls, although I've
      added a compat_ioctl method to struct atmdev_ops.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8865c418
    • O
      can: Fix CAN_(EFF|RTR)_FLAG handling in can_filter · d253eee2
      Oliver Hartkopp 提交于
      Due to a wrong safety check in af_can.c it was not possible to filter
      for SFF frames with a specific CAN identifier without getting the
      same selected CAN identifier from a received EFF frame also.
      
      This fix has a minimum (but user visible) impact on the CAN filter
      API and therefore the CAN version is set to a new date.
      
      Indeed the 'old' API is still working as-is. But when now setting
      CAN_(EFF|RTR)_FLAG in can_filter.can_mask you might get less traffic
      than before - but still the stuff that you expected to get for your
      defined filter ...
      
      Thanks to Kurt Van Dijck for pointing at this issue and for the review.
      Signed-off-by: NOliver Hartkopp <oliver@hartkopp.net>
      Acked-by: NKurt Van Dijck <kurt.van.dijck@eia.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d253eee2
  12. 03 12月, 2008 4 次提交
    • M
      block: fix setting of max_segment_size and seg_boundary mask · 0e435ac2
      Milan Broz 提交于
      Fix setting of max_segment_size and seg_boundary mask for stacked md/dm
      devices.
      
      When stacking devices (LVM over MD over SCSI) some of the request queue
      parameters are not set up correctly in some cases by default, namely
      max_segment_size and and seg_boundary mask.
      
      If you create MD device over SCSI, these attributes are zeroed.
      
      Problem become when there is over this mapping next device-mapper mapping
      - queue attributes are set in DM this way:
      
      request_queue   max_segment_size  seg_boundary_mask
      SCSI                65536             0xffffffff
      MD RAID1                0                      0
      LVM                 65536                 -1 (64bit)
      
      Unfortunately bio_add_page (resp.  bio_phys_segments) calculates number of
      physical segments according to these parameters.
      
      During the generic_make_request() is segment cout recalculated and can
      increase bio->bi_phys_segments count over the allowed limit.  (After
      bio_clone() in stack operation.)
      
      Thi is specially problem in CCISS driver, where it produce OOPS here
      
          BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);
      
      (MAXSEGENTRIES is 31 by default.)
      
      Sometimes even this command is enough to cause oops:
      
        dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10
      
      This command generates bios with 250 sectors, allocated in 32 4k-pages
      (last page uses only 1024 bytes).
      
      For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
      unfortunatelly on lower layer it is recalculated to 32 segments and this
      violates CCISS restriction and triggers BUG_ON().
      
      The patch tries to fix it by:
      
       * initializing attributes above in queue request constructor
         blk_queue_make_request()
      
       * make sure that blk_queue_stack_limits() inherits setting
      
       (DM uses its own function to set the limits because it
       blk_queue_stack_limits() was introduced later.  It should probably switch
       to use generic stack limit function too.)
      
       * sets the default seg_boundary value in one place (blkdev.h)
      
       * use this mask as default in DM (instead of -1, which differs in 64bit)
      
      Bugs related to this:
      https://bugzilla.redhat.com/show_bug.cgi?id=471639
      http://bugzilla.kernel.org/show_bug.cgi?id=8672Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Reviewed-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Miller <mike.miller@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0e435ac2
    • T
      block: internal dequeue shouldn't start timer · 53a08807
      Tejun Heo 提交于
      blkdev_dequeue_request() and elv_dequeue_request() are equivalent and
      both start the timeout timer.  Barrier code dequeues the original
      barrier request but doesn't passes the request itself to lower level
      driver, only broken down proxy requests; however, as the original
      barrier code goes through the same dequeue path and timeout timer is
      started on it.  If barrier sequence takes long enough, this timer
      expires but the low level driver has no idea about this request and
      oops follows.
      
      Timeout timer shouldn't have been started on the original barrier
      request as it never goes through actual IO.  This patch unexports
      elv_dequeue_request(), which has no external user anyway, and makes it
      operate on elevator proper w/o adding the timer and make
      blkdev_dequeue_request() call elv_dequeue_request() and add timer.
      Internal users which don't pass the request to driver - barrier code
      and end_that_request_last() - are converted to use
      elv_dequeue_request().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      53a08807
    • J
      nfsd: fix vm overcommit crash fix #2 · 1b79cd04
      Junjiro R. Okajima 提交于
      The previous patch from Alan Cox ("nfsd: fix vm overcommit crash",
      commit 731572d3) fixed the problem where
      knfsd crashes on exported shmemfs objects and strict overcommit is set.
      
      But the patch forgot supporting the case when CONFIG_SECURITY is
      disabled.
      
      This patch copies a part of his fix which is mainly for detecting a bug
      earlier.
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NJunjiro R. Okajima <hooanon05@yahoo.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b79cd04
    • B
      amd74xx: workaround unreliable AltStatus register for nVidia controllers · 6636487e
      Bartlomiej Zolnierkiewicz 提交于
      It seems that on some nVidia controllers using AltStatus register
      can be unreliable so default to Status register if the PCI device
      is in Compatibility Mode.  In order to achieve this:
      
      * Add ide_pci_is_in_compatibility_mode() inline helper to <linux/ide.h>.
      
      * Add IDE_HFLAG_BROKEN_ALTSTATUS host flag and set it in amd74xx host
        driver for nVidia controllers in Compatibility Mode.
      
      * Teach actual_try_to_identify() and drive_is_ready() about the new flag.
      
      This fixes the regression caused by removal of CONFIG_IDEPCI_SHARE_IRQ
      config option in 2.6.25 and using AltStatus register unconditionally when
      available (kernel.org bugs #11659 and #10216).
      
      [ Moreover for CONFIG_IDEPCI_SHARE_IRQ=y (which is what most people
        and distributions use) it never worked correctly. ]
      
      Thanks to Remy LABENE and Lars Winterfeld for help with debugging the problem.
      
      More info at:
      http://bugzilla.kernel.org/show_bug.cgi?id=11659
      http://bugzilla.kernel.org/show_bug.cgi?id=10216Reported-by: NRemy LABENE <remy.labene@free.fr>
      Tested-by: NRemy LABENE <remy.labene@free.fr>
      Tested-by: NLars Winterfeld <lars.winterfeld@tu-ilmenau.de>
      Acked-by: NBorislav Petkov <petkovbb@gmail.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      6636487e
  13. 02 12月, 2008 2 次提交
    • M
      lib/idr.c: fix rcu related race with idr_find · 6ff2d39b
      Manfred Spraul 提交于
      2nd part of the fixes needed for
      http://bugzilla.kernel.org/show_bug.cgi?id=11796.
      
      When the idr tree is either grown or shrunk, then the update to the number
      of layers and the top pointer were not atomic.  This race caused crashes.
      
      The attached patch fixes that by replicating the layers counter in each
      layer, thus idr_find doesn't need idp->layers anymore.
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Clement Calmels <cboulte@gmail.com>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Cc: Pierre Peiffer <peifferp@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ff2d39b
    • D
      epoll: introduce resource usage limits · 7ef9964e
      Davide Libenzi 提交于
      It has been thought that the per-user file descriptors limit would also
      limit the resources that a normal user can request via the epoll
      interface.  Vegard Nossum reported a very simple program (a modified
      version attached) that can make a normal user to request a pretty large
      amount of kernel memory, well within the its maximum number of fds.  To
      solve such problem, default limits are now imposed, and /proc based
      configuration has been introduced.  A new directory has been created,
      named /proc/sys/fs/epoll/ and inside there, there are two configuration
      points:
      
        max_user_instances = Maximum number of devices - per user
      
        max_user_watches   = Maximum number of "watched" fds - per user
      
      The current default for "max_user_watches" limits the memory used by epoll
      to store "watches", to 1/32 of the amount of the low RAM.  As example, a
      256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
      That should be enough to not break existing heavy epoll users.  The
      default value for "max_user_instances" is set to 128, that should be
      enough too.
      
      This also changes the userspace, because a new error code can now come out
      from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
      listed, so that should be ok.
      
      [akpm@linux-foundation.org: use get_current_user()]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: <stable@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reported-by: NVegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ef9964e