1. 28 10月, 2010 3 次提交
    • L
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner 提交于
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      filesystem).
      
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bfff6873
    • N
      /proc/pid/pagemap: document in Documentation/filesystems/proc.txt · 03f890f8
      Nikanth Karthikesan 提交于
      Document /proc/pid/pagemap in Documentation/filesystems/proc.txt
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Cc: Richard Guenther <rguenther@suse.de>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03f890f8
    • N
      /proc/pid/smaps: export amount of anonymous memory in a mapping · b40d4f84
      Nikanth Karthikesan 提交于
      Export the number of anonymous pages in a mapping via smaps.
      
      Even the private pages in a mapping backed by a file, would be marked as
      anonymous, when they are modified. Export this information to user-space via
      smaps.
      
      Exporting this count will help gdb to make a better decision on which
      areas need to be dumped in its coredump; and should be useful to others
      studying the memory usage of a process.
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40d4f84
  2. 27 10月, 2010 2 次提交
  3. 26 10月, 2010 2 次提交
  4. 25 10月, 2010 1 次提交
  5. 23 10月, 2010 2 次提交
    • L
      Revert "tty: Add a new file /proc/tty/consoles" · 6c2754c2
      Linus Torvalds 提交于
      This reverts commit f4a3e0bc.  Jiri
      Sladby points out that the tty structure we're using may already be
      gone, and Al Viro doesn't hold back in complaining about the random
      loading of 'filp->private_data' which doesn't have to be a pointer at
      all, nor does checking the magic field for TTY_MAGIC prove anything.
      
      Belated review by Al:
      
       "a) global variable depending on stdin of the last opener? Affecting
           output of read(2)? Really?
      
        b) iterator is broken; list should be locked in ->start(), unlocked in
           ->stop() and *NOT* unlocked/relocked in ->next()
      
        c) ->show() ought to do nothing in case of ->device == NULL, instead
           of skipping those in ->next()/->start()
      
        d) regardless of the merits of the bright idea about asterisk at that
           line in output *and* regardless of (a), the implementation is not
           only atrociously ugly, it's actually very likely to be a roothole.
           Verifying that Cthulhu knows what number happens to be address of a
           tty_struct by blindly dereferencing memory at that address...
           Ouch.
      
        Please revert that crap."
      
      And Christoph pipes in and NAK's the approach of walking fd tables etc
      too.  So it's pretty unanimous.
      Noticed-by: NJri Slaby <jslaby@suse.cz>
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Werner Fink <werner@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c2754c2
    • D
      tty: Add a new file /proc/tty/consoles · f4a3e0bc
      Dr. Werner Fink 提交于
      Add a new file /proc/tty/consoles to be able to determine the registered
      system console lines.  If the reading process holds /dev/console open at
      the regular standard input stream the active device will be marked by an
      asterisk.  Show possible operations and also decode the used flags of
      the listed console lines.
      Signed-off-by: NWerner Fink <werner@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      f4a3e0bc
  6. 12 10月, 2010 1 次提交
    • T
      ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. · 7bdb0d18
      Tristan Ye 提交于
      Currently, the default behavior of O_DIRECT writes was allowing
      concurrent writing among nodes to the same file, with no cluster
      coherency guaranteed (no EX lock held).  This can leave stale data in
      the cache for buffered reads on other nodes.
      
      The new mount option introduce a chance to choose two different
      behaviors for O_DIRECT writes:
      
          * coherency=full, as the default value, will disallow
                            concurrent O_DIRECT writes by taking
                            EX locks.
      
          * coherency=buffered, allow concurrent O_DIRECT writes
                                without EX lock among nodes, which
                                gains high performance at risk of
                                getting stale data on other nodes.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      7bdb0d18
  7. 08 10月, 2010 1 次提交
    • B
      NFS: new idmapper · 955a857e
      Bryan Schumaker 提交于
      This patch creates a new idmapper system that uses the request-key function to
      place a call into userspace to map user and group ids to names.  The old
      idmapper was single threaded, which prevented more than one request from running
      at a single time.  This means that a user would have to wait for an upcall to
      finish before accessing a cached result.
      
      The upcall result is stored on a keyring of type id_resolver.  See the file
      Documentation/filesystems/nfs/idmapper.txt for instructions.
      Signed-off-by: NBryan Schumaker <bjschuma@netapp.com>
      [Trond: fix up the return value of nfs_idmap_lookup_name and clean up code]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      955a857e
  8. 17 9月, 2010 1 次提交
  9. 14 8月, 2010 1 次提交
  10. 10 8月, 2010 4 次提交
    • D
      oom: deprecate oom_adj tunable · 51b1bd2a
      David Rientjes 提交于
      /proc/pid/oom_adj is now deprecated so that that it may eventually be
      removed.  The target date for removal is August 2012.
      
      A warning will be printed to the kernel log if a task attempts to use this
      interface.  Future warning will be suppressed until the kernel is rebooted
      to prevent spamming the kernel log.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51b1bd2a
    • D
      oom: badness heuristic rewrite · a63d83f4
      David Rientjes 提交于
      This a complete rewrite of the oom killer's badness() heuristic which is
      used to determine which task to kill in oom conditions.  The goal is to
      make it as simple and predictable as possible so the results are better
      understood and we end up killing the task which will lead to the most
      memory freeing while still respecting the fine-tuning from userspace.
      
      Instead of basing the heuristic on mm->total_vm for each task, the task's
      rss and swap space is used instead.  This is a better indication of the
      amount of memory that will be freeable if the oom killed task is chosen
      and subsequently exits.  This helps specifically in cases where KDE or
      GNOME is chosen for oom kill on desktop systems instead of a memory
      hogging task.
      
      The baseline for the heuristic is a proportion of memory that each task is
      currently using in memory plus swap compared to the amount of "allowable"
      memory.  "Allowable," in this sense, means the system-wide resources for
      unconstrained oom conditions, the set of mempolicy nodes, the mems
      attached to current's cpuset, or a memory controller's limit.  The
      proportion is given on a scale of 0 (never kill) to 1000 (always kill),
      roughly meaning that if a task has a badness() score of 500 that the task
      consumes approximately 50% of allowable memory resident in RAM or in swap
      space.
      
      The proportion is always relative to the amount of "allowable" memory and
      not the total amount of RAM systemwide so that mempolicies and cpusets may
      operate in isolation; they shall not need to know the true size of the
      machine on which they are running if they are bound to a specific set of
      nodes or mems, respectively.
      
      Root tasks are given 3% extra memory just like __vm_enough_memory()
      provides in LSMs.  In the event of two tasks consuming similar amounts of
      memory, it is generally better to save root's task.
      
      Because of the change in the badness() heuristic's baseline, it is also
      necessary to introduce a new user interface to tune it.  It's not possible
      to redefine the meaning of /proc/pid/oom_adj with a new scale since the
      ABI cannot be changed for backward compatability.  Instead, a new tunable,
      /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
      be used to polarize the heuristic such that certain tasks are never
      considered for oom kill while others may always be considered.  The value
      is added directly into the badness() score so a value of -500, for
      example, means to discount 50% of its memory consumption in comparison to
      other tasks either on the system, bound to the mempolicy, in the cpuset,
      or sharing the same memory controller.
      
      /proc/pid/oom_adj is changed so that its meaning is rescaled into the
      units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
      these per-task tunables will rescale the value of the other to an
      equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
      a bitshift on the badness score, it now shares the same linear growth as
      /proc/pid/oom_score_adj but with different granularity.  This is required
      so the ABI is not broken with userspace applications and allows oom_adj to
      be deprecated for future removal.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a63d83f4
    • A
      update VFS documentation for method changes. · 336fb3b9
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      336fb3b9
    • C
      1e231735
  11. 06 8月, 2010 3 次提交
  12. 04 8月, 2010 1 次提交
    • J
      Documentation: update broken web addresses. · 0ea6e611
      Justin P. Mattock 提交于
      Below you will find an updated version from the original series bunching all patches into one big patch
      updating broken web addresses that are located in Documentation/*
      Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
      the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
      Now there are also some addresses pointing to .spec files some are located, but some(after searching
      on the companies site)where still no where to be found. In this case I just changed the address
      to the company site this way the users can contact the company and they can locate them for the users.
      Signed-off-by: NJustin P. Mattock <justinmattock@gmail.com>
      Signed-off-by: NThomas Weber <weber@corscience.de>
      Signed-off-by: NMike Frysinger <vapier.adi@gmail.com>
      Cc: Paulo Marques <pmarques@grupopie.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Michael Neuling <mikey@neuling.org>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      0ea6e611
  13. 31 7月, 2010 1 次提交
  14. 27 7月, 2010 1 次提交
  15. 23 7月, 2010 3 次提交
    • R
      nilfs2: add nodiscard mount option · 802d3177
      Ryusuke Konishi 提交于
      Nilfs has "discard" mount option which issues discard/TRIM commands to
      underlying block device, but it lacks a complementary option and has
      no way to disable the feature through remount.
      
      This adds "nodiscard" option to resolve this imbalance.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      802d3177
    • R
      nilfs2: add barrier mount option · 773bc4f3
      Ryusuke Konishi 提交于
      Nilfs enables write barriers by default and has "nobarrier" mount
      option to disable this feature.  But it lacks the complementary option
      and has no way to re-enable the feature on remount.
      
      This adds "barrier" option to resolve this imbalance.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      773bc4f3
    • T
      fscache: convert object to use workqueue instead of slow-work · 8b8edefa
      Tejun Heo 提交于
      Make fscache object state transition callbacks use workqueue instead
      of slow-work.  New dedicated unbound CPU workqueue fscache_object_wq
      is created.  get/put callbacks are renamed and modified to take
      @object and called directly from the enqueue wrapper and the work
      function.  While at it, make all open coded instances of get/put to
      use fscache_get/put_object().
      
      * Unbound workqueue is used.
      
      * work_busy() output is printed instead of slow-work flags in object
        debugging outputs.  They mean basically the same thing bit-for-bit.
      
      * sysctl fscache.object_max_active added to control concurrency.  The
        default value is nr_cpus clamped between 4 and
        WQ_UNBOUND_MAX_ACTIVE.
      
      * slow_work_sleep_till_thread_needed() is replaced with fscache
        private implementation fscache_object_sleep_till_congested() which
        waits on fscache_object_wq congestion.
      
      * debugfs support is dropped for now.  Tracing API based debug
        facility is planned to be added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      8b8edefa
  16. 03 6月, 2010 2 次提交
  17. 28 5月, 2010 3 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
    • C
      drop unused dentry argument to ->fsync · 7ea80859
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea80859
    • J
      Documentation/filesystems/Locking: update documentation on llseek() wrt BKL · 866707fc
      Jan Blunck 提交于
      The inode's i_size is not protected by the big kernel lock.  Therefore it
      does not make sense to recommend taking the BKL in filesystems llseek
      operations.  Instead it should use the inode's mutex or use just use
      i_size_read() instead.  Add a note that this is not protecting
      file->f_pos.
      Signed-off-by: NJan Blunck <jblunck@suse.de>
      Acked-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      866707fc
  18. 26 5月, 2010 1 次提交
  19. 25 5月, 2010 1 次提交
  20. 24 5月, 2010 1 次提交
  21. 22 5月, 2010 2 次提交
    • E
      ext3: make barrier options consistent with ext4 · 0636c73e
      Eric Sandeen 提交于
      ext4 was updated to accept barrier/nobarrier mount options
      in addition to the older barrier=0/1.  The barrier story
      is complex enough, we should help people by making the options
      the same at least, even if the defaults are different.
      
      This patch allows the barrier/nobarrier mount options for ext3,
      while keeping nobarrier the default.
      
      It also unconditionally displays barrier status in show_options,
      and prints a message at mount time if barriers are not enabled,
      just as ext4 does.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      0636c73e
    • S
      sysfs-namespaces: add a high-level Documentation file · b9d8b45e
      Serge E. Hallyn 提交于
      The first three paragraphs are almost verbatim taken from Eric's
      commit message on the patch introducing network ns tags.  The next
      two paragraphs I wrote to be a brief high level overview.  The last
      section is taken from the commit message on "Implement sysfs tagged
      directory support", but updated.  Hopefully correctly.
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      b9d8b45e
  22. 14 5月, 2010 1 次提交
  23. 12 5月, 2010 1 次提交
    • R
      revert "procfs: provide stack information for threads" and its fixup commits · 34441427
      Robin Holt 提交于
      Originally, commit d899bf7b ("procfs: provide stack information for
      threads") attempted to introduce a new feature for showing where the
      threadstack was located and how many pages are being utilized by the
      stack.
      
      Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
      applied to fix the NO_MMU case.
      
      Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
      64-bit") was applied to fix a bug in ia32 executables being loaded.
      
      Commit 9ebd4eba ("procfs: fix /proc/<pid>/stat stack pointer for kernel
      threads") was applied to fix a bug which had kernel threads printing a
      userland stack address.
      
      Commit 1306d603 ('proc: partially revert "procfs: provide stack
      information for threads"') was then applied to revert the stack pages
      being used to solve a significant performance regression.
      
      This patch nearly undoes the effect of all these patches.
      
      The reason for reverting these is it provides an unusable value in
      field 28.  For x86_64, a fork will result in the task->stack_start
      value being updated to the current user top of stack and not the stack
      start address.  This unpredictability of the stack_start value makes
      it worthless.  That includes the intended use of showing how much stack
      space a thread has.
      
      Other architectures will get different values.  As an example, ia64
      gets 0.  The do_fork() and copy_process() functions appear to treat the
      stack_start and stack_size parameters as architecture specific.
      
      I only partially reverted c44972f1 ("procfs: disable per-task stack usage
      on NOMMU") .  If I had completely reverted it, I would have had to change
      mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
      configured.  Since I could not test the builds without significant effort,
      I decided to not change mm/Makefile.
      
      I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
      information for threads on 64-bit") .  I left the KSTK_ESP() change in
      place as that seemed worthwhile.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34441427
  24. 11 5月, 2010 1 次提交