1. 10 1月, 2011 7 次提交
  2. 07 1月, 2011 16 次提交
    • G
      dccp: make upper bound for seq_window consistent on 32/64 bit · bfbb2346
      Gerrit Renker 提交于
      The 'seq_window' sysctl sets the initial value for the DCCP Sequence Window,
      which may range from 32..2^46-1 (RFC 4340, 7.5.2). The patch sets the upper
      bound consistently to 2^32-1 on both 32 and 64 bit systems, which should be
      sufficient - with a RTT of 1sec and 1-byte packets, a seq_window of 2^32-1
      corresponds to a link speed of 34 Gbps.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      bfbb2346
    • S
      dccp: fix bug in updating the GSR · 763dadd4
      Samuel Jero 提交于
      Currently dccp_check_seqno allows any valid packet to update the Greatest
      Sequence Number Received, even if that packet's sequence number is less than
      the current GSR. This patch adds a check to make sure that the new packet's
      sequence number is greater than GSR.
      Signed-off-by: NSamuel Jero <sj323707@ohio.edu>
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      763dadd4
    • S
      dccp: fix return value for sequence-invalid packets · 2cf5be93
      Samuel Jero 提交于
      Currently dccp_check_seqno returns 0 (indicating a valid packet) if the
      acknowledgment number is out of bounds and the sync that RFC 4340 mandates at
      this point is currently being rate-limited. This function should return -1,
      indicating an invalid packet.
      Signed-off-by: NSamuel Jero <sj323707@ohio.edu>
      Acked-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      2cf5be93
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
    • N
      fs: improve scalability of pseudo filesystems · 4b936885
      Nick Piggin 提交于
      Regardless of how much we possibly try to scale dcache, there is likely
      always going to be some fundamental contention when adding or removing children
      under the same parent. Pseudo filesystems do not seem need to have connected
      dentries because by definition they are disconnected.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      4b936885
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: avoid inode RCU freeing for pseudo fs · ff0c7d15
      Nick Piggin 提交于
      Pseudo filesystems that don't put inode on RCU list or reachable by
      rcu-walk dentries do not need to RCU free their inodes.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      ff0c7d15
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44
    • C
      net: bridge: check the length of skb after nf_bridge_maybe_copy_header() · f88de8de
      Changli Gao 提交于
      Since nf_bridge_maybe_copy_header() may change the length of skb,
      we should check the length of skb after it to handle the ppoe skbs.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f88de8de
    • P
      netfilter: fix export secctx error handling · cba85b53
      Pablo Neira Ayuso 提交于
      In 1ae4de0c, the secctx was exported
      via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
      instead of the secmark.
      
      That patch introduced the use of security_secid_to_secctx() which may
      return a non-zero value on error.
      
      In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
      security modules. Thus, security_secid_to_secctx() returns a negative
      value that results in the breakage of the /proc and `conntrack -L'
      outputs. To fix this, we skip the inclusion of secctx if the
      aforementioned function fails.
      
      This patch also fixes the dynamic netlink message size calculation
      if security_secid_to_secctx() returns an error, since its logic is
      also wrong.
      
      This problem exists in Linux kernel >= 2.6.37.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cba85b53
    • C
      netfilter: fix the race when initializing nf_ct_expect_hash_rnd · f682cefa
      Changli Gao 提交于
      Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
      locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.
      
      In this patch, we use nf_conntrack_hash_rnd instead of
      nf_ct_expect_hash_rnd.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f682cefa
    • E
      ipv4: IP defragmentation must be ECN aware · 6623e3b2
      Eric Dumazet 提交于
      RFC3168 (The Addition of Explicit Congestion Notification to IP)
      states :
      
      5.3.  Fragmentation
      
         ECN-capable packets MAY have the DF (Don't Fragment) bit set.
         Reassembly of a fragmented packet MUST NOT lose indications of
         congestion.  In other words, if any fragment of an IP packet to be
         reassembled has the CE codepoint set, then one of two actions MUST be
         taken:
      
            * Set the CE codepoint on the reassembled packet.  However, this
              MUST NOT occur if any of the other fragments contributing to
              this reassembly carries the Not-ECT codepoint.
      
            * The packet is dropped, instead of being reassembled, for any
              other reason.
      
      This patch implements this requirement for IPv4, choosing the first
      action :
      
      If one fragment had NO-ECT codepoint
              reassembled frame has NO-ECT
      ElIf one fragment had CE codepoint
              reassembled frame has CE
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6623e3b2
    • D
      dcb: use after free in dcb_flushapp() · 2a8fe003
      Dan Carpenter 提交于
      The original code has a use after free bug because it's not using the
      _safe() version of the list_for_each_entry() macro.
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a8fe003
    • D
      dcb: unlock on error in dcbnl_ieee_get() · 70bfa2d2
      Dan Carpenter 提交于
      There is a "goto nla_put_failure" hidden inside the NLA_PUT() macro, but
      we're holding the dcb_lock so we need to unlock first.
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70bfa2d2
    • E
      net: add POLLPRI to sock_def_readable() · 2c6607c6
      Eric Dumazet 提交于
      Leonardo Chiquitto found poll() could block forever on tcp sockets and
      Urgent data was received, if the event flag only contains POLLPRI.
      
      He did a bisection and found commit 4938d7e0 (poll: avoid extra
      wakeups in select/poll) was the source of the problem.
      
      Problem is TCP sockets use standard sock_def_readable() function for
      their sk_data_ready() handler, and sock_def_readable() doesnt signal
      POLLPRI.
      
      Only TCP is affected by the problem. Adding POLLPRI to the list of flags
      might trigger unnecessary schedules, but URGENT handling is such a
      seldom used feature this seems a good compromise.
      
      Thanks a lot to Leonardo for providing the bisection result and a test
      program as well.
      
      Reference : http://www.spinics.net/lists/netdev/msg151793.htmlReported-and-bisected-by: NLeonardo Chiquitto <leonardo.lists@gmail.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c6607c6
  3. 06 1月, 2011 5 次提交
  4. 05 1月, 2011 10 次提交
  5. 04 1月, 2011 2 次提交