1. 14 1月, 2009 11 次提交
  2. 10 1月, 2009 9 次提交
  3. 09 1月, 2009 20 次提交
    • N
      md: make devices disappear when they are no longer needed. · d3374825
      NeilBrown 提交于
      Currently md devices, once created, never disappear until the module
      is unloaded.  This is essentially because the gendisk holds a
      reference to the mddev, and the mddev holds a reference to the
      gendisk, this a circular reference.
      
      If we drop the reference from mddev to gendisk, then we need to ensure
      that the mddev is destroyed when the gendisk is destroyed.  However it
      is not possible to hook into the gendisk destruction process to enable
      this.
      
      So we drop the reference from the gendisk to the mddev and destroy the
      gendisk when the mddev gets destroyed.  However this has a
      complication.
      Between the call
         __blkdev_get->get_gendisk->kobj_lookup->md_probe
      and the call
         __blkdev_get->md_open
      
      there is no obvious way to hold a reference on the mddev any more, so
      unless something is done, it will disappear and gendisk will be
      destroyed prematurely.
      
      Also, once we decide to destroy the mddev, there will be an unlockable
      moment before the gendisk is unlinked (blk_unregister_region) during
      which a new reference to the gendisk can be created.  We need to
      ensure that this reference can not be used.  i.e. the ->open must
      fail.
      
      So:
       1/  in md_probe we set a flag in the mddev (hold_active) which
           indicates that the array should be treated as active, even
           though there are no references, and no appearance of activity.
           This is cleared by md_release when the device is closed if it
           is no longer needed.
           This ensures that the gendisk will survive between md_probe and
           md_open.
      
       2/  In md_open we check if the mddev we expect to open matches
           the gendisk that we did open.
           If there is a mismatch we return -ERESTARTSYS and modify
           __blkdev_get to retry from the top in that case.
           In the -ERESTARTSYS sys case we make sure to wait until
           the old gendisk (that we succeeded in opening) is really gone so
           we loop at most once.
      
      Some udev configurations will always open an md device when it first
      appears.   If we allow an md device that was just created by an open
      to disappear on an immediate close, then this can race with such udev
      configurations and result in an infinite loop the device being opened
      and closed, then re-open due to the 'ADD' even from the first open,
      and then close and so on.
      So we make sure an md device, once created by an open, remains active
      at least until some md 'ioctl' has been made on it.  This means that
      all normal usage of md devices will allow them to disappear promptly
      when not needed, but the worst that an incorrect usage will do it
      cause an inactive md device to be left in existence (it can easily be
      removed).
      
      As an array can be stopped by writing to a sysfs attribute
        echo clear > /sys/block/mdXXX/md/array_state
      we need to use scheduled work for deleting the gendisk and other
      kobjects.  This allows us to wait for any pending gendisk deletion to
      complete by simply calling flush_scheduled_work().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d3374825
    • D
      dlm: change rsbtbl rwlock to spinlock · c7be761a
      David Teigland 提交于
      The rwlock is almost always used in write mode, so there's no reason
      to not use a spinlock instead.
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      c7be761a
    • D
      dlm: fix seq_file usage in debugfs lock dump · 892c4467
      David Teigland 提交于
      The old code would leak iterators and leave reference counts on
      rsbs because it was ignoring the "stop" seq callback.  The code
      followed an example that used the seq operations differently.
      This new code is based on actually understanding how the seq
      operations work.  It also improves things by saving the hash bucket
      in the position to avoid cycling through completed buckets in start.
      Siged-off-by: NDavd Teigland <teigland@redhat.com>
      892c4467
    • C
      fix similar typos to successfull · 73ac36ea
      Coly Li 提交于
      When I review ocfs2 code, find there are 2 typos to "successfull".  After
      doing grep "successfull " in kernel tree, 22 typos found totally -- great
      minds always think alike :)
      
      This patch fixes all the similar typos. Thanks for Randy's ack and comments.
      Signed-off-by: NColy Li <coyli@suse.de>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Acked-by: NRoland Dreier <rolandd@cisco.com>
      Cc: Jeremy Kerr <jk@ozlabs.org>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
      Cc: Sridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73ac36ea
    • W
      generic swap(): dcache: use swap() instead of private do_switch() · 9a8d5bb4
      Wu Fengguang 提交于
      Use the new generic implementation.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a8d5bb4
    • W
      generic swap(): ext4: remove local swap() macro · 97e133b4
      Wu Fengguang 提交于
      Use the new generic implementation.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97e133b4
    • W
      generic swap(): ext3: remove local swap() macro · be857df1
      Wu Fengguang 提交于
      Use the new generic implementation.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be857df1
    • F
      remove lots of double-semicolons · c19a28e1
      Fernando Carrijo 提交于
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: James Morris <jmorris@namei.org>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Acked-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c19a28e1
    • R
      romfs: romfs_iget() - unsigned ino >= 0 is always true · f1565962
      roel kluin 提交于
      romfs_strnlen() returns int
      unsigned X >= 0 is always true
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: Nroel kluin <roel.kluin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1565962
    • M
      vmcore: remove saved_max_pfn check · 921d58c0
      Magnus Damm 提交于
      Remove the saved_max_pfn check from the /proc/vmcore function
      read_from_oldmem().  No need to verify, we should be able to just trust
      that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
      command line like we do with other parameters.
      
      The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
      read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
      sense to use saved_max_pfn.  For oldmem it is used to determine when to
      stop reading.  For vmcore we already have the elf header info pointing out
      the physical memory regions, no need to pass the end-of- old-memory twice.
      
      Removing the saved_max_pfn check from vmcore makes it possible for
      architectures to skip oldmem but still support crash dump through vmcore -
      without the need for the old saved_max_pfn cruft.
      
      Architectures that want to play safe can do the saved_max_pfn check in
      copy_oldmem_page().  Not sure why anyone would want to do that, but that's
      even safer than today - the saved_max_pfn check in vmcore removed by this
      patch only checks the first page.
      Signed-off-by: NMagnus Damm <damm@igel.co.jp>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      921d58c0
    • K
      ELF: implement AT_RANDOM for glibc PRNG seeding · f06295b4
      Kees Cook 提交于
      While discussing[1] the need for glibc to have access to random bytes
      during program load, it seems that an earlier attempt to implement
      AT_RANDOM got stalled.  This implements a random 16 byte string, available
      to every ELF program via a new auxv AT_RANDOM vector.
      
      [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html
      
      Ulrich said:
      
      glibc needs right after startup a bit of random data for internal
      protections (stack canary etc).  What is now in upstream glibc is that we
      always unconditionally open /dev/urandom, read some data, and use it.  For
      every process startup.  That's slow.
      
      ...
      
      The solution is to provide a limited amount of random data to the
      starting process in the aux vector.  I suggested 16 bytes and this is
      what the patch implements.  If we need only 16 bytes or less we use the
      data directly.  If we need more we'll use the 16 bytes to see a PRNG.
      This avoids the costly /dev/urandom use and it allows the kernel to use
      the most adequate source of random data for this purpose.  It might not
      be the same pool as that for /dev/urandom.
      
      Concerns were expressed about the depletion of the randomness pool.  But
      this patch doesn't make the situation worse, it doesn't deplete entropy
      more than happens now.
      Signed-off-by: NKees Cook <kees.cook@canonical.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06295b4
    • K
      memcg: synchronized LRU · 08e552c6
      KAMEZAWA Hiroyuki 提交于
      A big patch for changing memcg's LRU semantics.
      
      Now,
        - page_cgroup is linked to mem_cgroup's its own LRU (per zone).
      
        - LRU of page_cgroup is not synchronous with global LRU.
      
        - page and page_cgroup is one-to-one and statically allocated.
      
        - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
          - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
      
        - SwapCache is handled.
      
      And, when we handle LRU list of page_cgroup, we do following.
      
      	pc = lookup_page_cgroup(page);
      	lock_page_cgroup(pc); .....................(1)
      	mz = page_cgroup_zoneinfo(pc);
      	spin_lock(&mz->lru_lock);
      	.....add to LRU
      	spin_unlock(&mz->lru_lock);
      	unlock_page_cgroup(pc);
      
      But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
      So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
      
      This is a trial to remove this dirty nesting of locks.
      This patch changes mz->lru_lock to be zone->lru_lock.
      Then, above sequence will be written as
      
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      	mem_cgroup_add/remove/etc_lru() {
      		pc = lookup_page_cgroup(page);
      		mz = page_cgroup_zoneinfo(pc);
      		if (PageCgroupUsed(pc)) {
      			....add to LRU
      		}
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      
      This is much simpler.
      (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
          1. When pc->mem_cgroup can be modified.
             - at charge.
             - at account_move().
          2. at charge
             the PCG_USED bit is not set before pc->mem_cgroup is fixed.
          3. at account_move()
             the page is isolated and not on LRU.
      
      Pros.
        - easy for maintenance.
        - memcg can make use of laziness of pagevec.
        - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
        - LRU status of memcg will be synchronized with global LRU's one.
        - # of locks are reduced.
        - account_move() is simplified very much.
      Cons.
        - may increase cost of LRU rotation.
          (no impact if memcg is not configured.)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08e552c6
    • J
      quota: don't set grace time when user isn't above softlimit · e04a88a9
      Jan Kara 提交于
      do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if
      user was not above his softlimit.  This does not make much sence and by
      coincidence causes quota code to omit softlimit warning when user really
      exceeds softlimit.  This patch makes do_set_dqblk() reset user's grace
      time if he has not exceeded softlimit.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e04a88a9
    • R
      coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL · 87d1fda5
      Richard A. Holden III 提交于
      Fix
      fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used
      fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used
      
      these are only used when CONFIG_SYSCTL is defined.
      Signed-off-by: NRichard A. Holden III <aciddeath@gmail.com>
      Cc: Jan Harkes <jaharkes@cs.cmu.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87d1fda5
    • R
      jbd: remove excess kernel-doc notation · 1579c3a1
      Randy Dunlap 提交于
      Remove excess kernel-doc from fs/jbd/transaction.c:
      
      Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1579c3a1
    • D
      ext3: tighten restrictions on inode flags · 04143e2f
      Duane Griffin 提交于
      At the moment there are few restrictions on which flags may be set on
      which inodes.  Specifically DIRSYNC may only be set on directories and
      IMMUTABLE and APPEND may not be set on links.  Tighten that to disallow
      TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
      on non-regular file, non-directories.
      
      Introduces a flags masking function which masks flags based on mode and
      use it during inode creation and when flags are set via the ioctl to
      facilitate future consistency.
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Acked-by: NAndreas Dilger <adilger@sun.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04143e2f
    • D
      ext3: don't inherit inappropriate inode flags from parent · 2e8671cb
      Duane Griffin 提交于
      At present INDEX is the only flag that new ext3 inodes do NOT inherit from
      their parent.  In addition prevent the flags DIRTY, ECOMPR, IMAGIC and
      TOPDIR from being inherited.  List inheritable flags explicitly to prevent
      future flags from accidentally being inherited.
      
      This fixes the TOPDIR flag inheritance bug reported at
      http://bugzilla.kernel.org/show_bug.cgi?id=9866.
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Acked-by: NAndreas Dilger <adilger@sun.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e8671cb
    • P
      ext3: allocate ->s_blockgroup_lock separately · 5df096d6
      Pekka Enberg 提交于
      As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit
      which makes it a very bad fit for SLAB allocators.  The culprit of the
      wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
      NR_CPUS >= 32.
      
      To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
      page in the worst case, separately.  This shinks down struct ext3_sb_info
      enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
      32 KB saving 15 KB of memory.
      Acked-by: NAndreas Dilger <adilger@sun.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5df096d6
    • J
      jbd: improve fsync batching · f420d4dc
      Josef Bacik 提交于
      There is a flaw with the way jbd handles fsync batching.  If we fsync() a
      file and we were not the last person to run fsync() on this fs then we
      automatically sleep for 1 jiffie in order to wait for new writers to join
      into the transaction before forcing the commit.  The problem with this is
      that with really fast storage (ie a Clariion) the time it takes to commit
      a transaction to disk is way faster than 1 jiffie in most cases, so
      sleeping means waiting longer with nothing to do than if we just committed
      the transaction and kept going.  Ric Wheeler noticed this when using
      fs_mark with more than 1 thread, the throughput would plummet as he added
      more threads.
      
      This patch attempts to fix this problem by recording the average time in
      nanoseconds that it takes to commit a transaction to disk, and what time
      we started the transaction.  If we run an fsync() and we have been running
      for less time than it takes to commit the transaction to disk, we sleep
      for the delta amount of time and then commit to disk.  We acheive
      sub-jiffie sleeping using schedule_hrtimeout.  This means that the wait
      time is auto-tuned to the speed of the underlying disk, instead of having
      this static timeout.  I weighted the average according to somebody's
      comments (Andreas Dilger I think) in order to help normalize random
      outliers where we take way longer or way less time to commit than the
      average.  I also have a min() check in there to make sure we don't sleep
      longer than a jiffie in case our storage is super slow, this was requested
      by Andrew.
      
      I unfortunately do not have access to a Clariion, so I had to use a
      ramdisk to represent a super fast array.  I tested with a SATA drive with
      barrier=1 to make sure there was no regression with local disks, I tested
      with a 4 way multipathed Apple Xserve RAID array and of course the
      ramdisk.  I ran the following command
      
      fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i
      
      where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
      results
      
      type	threads		with patch	without patch
      sata	2		24.6		26.3
      sata	4		49.2		48.1
      sata	8		70.1		67.0
      sata	16		104.0		94.1
      sata	32		153.6		142.7
      
      xserve	2		246.4		222.0
      xserve	4		480.0		440.8
      xserve	8		829.5		730.8
      xserve	16		1172.7		1026.9
      xserve	32		1816.3		1650.5
      
      ramdisk	2		2538.3		1745.6
      ramdisk	4		2942.3		661.9
      ramdisk	8		2882.5		999.8
      ramdisk	16		2738.7		1801.9
      ramdisk	32		2541.9		2394.0
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Ric Wheeler <rwheeler@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f420d4dc
    • D
      ext2: tighten restrictions on inode flags · ef8b6461
      Duane Griffin 提交于
      At the moment there are few restrictions on which flags may be set on
      which inodes.  Specifically DIRSYNC may only be set on directories and
      IMMUTABLE and APPEND may not be set on links.  Tighten that to disallow
      TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
      on non-regular file, non-directories.
      
      Introduces a flags masking function which masks flags based on mode and
      use it during inode creation and when flags are set via the ioctl to
      facilitate future consistency.
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Acked-by: NAndreas Dilger <adilger@sun.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8b6461