1. 11 1月, 2012 10 次提交
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • C
      procfs: make proc_get_link to use dentry instead of inode · 7773fbc5
      Cyrill Gorcunov 提交于
      Prepare the ground for the next "map_files" patch which needs a name of a
      link file to analyse.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7773fbc5
    • F
      reiserfs: don't lock root inode searching · 9b467e6e
      Frederic Weisbecker 提交于
      Nothing requires that we lock the filesystem until the root inode is
      provided.
      
      Also iget5_locked() triggers a warning because we are holding the
      filesystem lock while allocating the inode, which result in a lockdep
      suspicion that we have a lock inversion against the reclaim path:
      
      [ 1986.896979] =================================
      [ 1986.896990] [ INFO: inconsistent lock state ]
      [ 1986.896997] 3.1.1-main #8
      [ 1986.897001] ---------------------------------
      [ 1986.897007] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      [ 1986.897016] kswapd0/16 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [ 1986.897023]  (&REISERFS_SB(s)->lock){+.+.?.}, at: [<c01f8bd4>] reiserfs_write_lock+0x20/0x2a
      [ 1986.897044] {RECLAIM_FS-ON-W} state was registered at:
      [ 1986.897050]   [<c014a5b9>] mark_held_locks+0xae/0xd0
      [ 1986.897060]   [<c014aab3>] lockdep_trace_alloc+0x7d/0x91
      [ 1986.897068]   [<c0190ee0>] kmem_cache_alloc+0x1a/0x93
      [ 1986.897078]   [<c01e7728>] reiserfs_alloc_inode+0x13/0x3d
      [ 1986.897088]   [<c01a5b06>] alloc_inode+0x14/0x5f
      [ 1986.897097]   [<c01a5cb9>] iget5_locked+0x62/0x13a
      [ 1986.897106]   [<c01e99e0>] reiserfs_fill_super+0x410/0x8b9
      [ 1986.897114]   [<c01953da>] mount_bdev+0x10b/0x159
      [ 1986.897123]   [<c01e764d>] get_super_block+0x10/0x12
      [ 1986.897131]   [<c0195b38>] mount_fs+0x59/0x12d
      [ 1986.897138]   [<c01a80d1>] vfs_kern_mount+0x45/0x7a
      [ 1986.897147]   [<c01a83e3>] do_kern_mount+0x2f/0xb0
      [ 1986.897155]   [<c01a987a>] do_mount+0x5c2/0x612
      [ 1986.897163]   [<c01a9a72>] sys_mount+0x61/0x8f
      [ 1986.897170]   [<c044060c>] sysenter_do_call+0x12/0x32
      [ 1986.897181] irq event stamp: 7509691
      [ 1986.897186] hardirqs last  enabled at (7509691): [<c0190f34>] kmem_cache_alloc+0x6e/0x93
      [ 1986.897197] hardirqs last disabled at (7509690): [<c0190eea>] kmem_cache_alloc+0x24/0x93
      [ 1986.897209] softirqs last  enabled at (7508896): [<c01294bd>] __do_softirq+0xee/0xfd
      [ 1986.897222] softirqs last disabled at (7508859): [<c01030ed>] do_softirq+0x50/0x9d
      [ 1986.897234]
      [ 1986.897235] other info that might help us debug this:
      [ 1986.897242]  Possible unsafe locking scenario:
      [ 1986.897244]
      [ 1986.897250]        CPU0
      [ 1986.897254]        ----
      [ 1986.897257]   lock(&REISERFS_SB(s)->lock);
      [ 1986.897265] <Interrupt>
      [ 1986.897269]     lock(&REISERFS_SB(s)->lock);
      [ 1986.897276]
      [ 1986.897277]  *** DEADLOCK ***
      [ 1986.897278]
      [ 1986.897286] no locks held by kswapd0/16.
      [ 1986.897291]
      [ 1986.897292] stack backtrace:
      [ 1986.897299] Pid: 16, comm: kswapd0 Not tainted 3.1.1-main #8
      [ 1986.897306] Call Trace:
      [ 1986.897314]  [<c0439e76>] ? printk+0xf/0x11
      [ 1986.897324]  [<c01482d1>] print_usage_bug+0x20e/0x21a
      [ 1986.897332]  [<c01479b8>] ? print_irq_inversion_bug+0x172/0x172
      [ 1986.897341]  [<c014855c>] mark_lock+0x27f/0x483
      [ 1986.897349]  [<c0148d88>] __lock_acquire+0x628/0x1472
      [ 1986.897358]  [<c0149fae>] lock_acquire+0x47/0x5e
      [ 1986.897366]  [<c01f8bd4>] ? reiserfs_write_lock+0x20/0x2a
      [ 1986.897384]  [<c01f8bd4>] ? reiserfs_write_lock+0x20/0x2a
      [ 1986.897397]  [<c043b5ef>] mutex_lock_nested+0x35/0x26f
      [ 1986.897409]  [<c01f8bd4>] ? reiserfs_write_lock+0x20/0x2a
      [ 1986.897421]  [<c01f8bd4>] reiserfs_write_lock+0x20/0x2a
      [ 1986.897433]  [<c01e2edd>] map_block_for_writepage+0xc9/0x590
      [ 1986.897448]  [<c01b1706>] ? create_empty_buffers+0x33/0x8f
      [ 1986.897461]  [<c0121124>] ? get_parent_ip+0xb/0x31
      [ 1986.897472]  [<c043ef7f>] ? sub_preempt_count+0x81/0x8e
      [ 1986.897485]  [<c043cae0>] ? _raw_spin_unlock+0x27/0x3d
      [ 1986.897496]  [<c0121124>] ? get_parent_ip+0xb/0x31
      [ 1986.897508]  [<c01e355d>] reiserfs_writepage+0x1b9/0x3e7
      [ 1986.897521]  [<c0173b40>] ? clear_page_dirty_for_io+0xcb/0xde
      [ 1986.897533]  [<c014a6e3>] ? trace_hardirqs_on_caller+0x108/0x138
      [ 1986.897546]  [<c014a71e>] ? trace_hardirqs_on+0xb/0xd
      [ 1986.897559]  [<c0177b38>] shrink_page_list+0x34f/0x5e2
      [ 1986.897572]  [<c01780a7>] shrink_inactive_list+0x172/0x22c
      [ 1986.897585]  [<c0178464>] shrink_zone+0x303/0x3b1
      [ 1986.897597]  [<c043cae0>] ? _raw_spin_unlock+0x27/0x3d
      [ 1986.897611]  [<c01788c9>] kswapd+0x3b7/0x5f2
      
      The deadlock shouldn't happen since we are doing that allocation in the
      mount path, the filesystem is not available for any reclaim.  Still the
      warning is annoying.
      
      To solve this, acquire the lock later only where we need it, right before
      calling reiserfs_read_locked_inode() that wants to lock to walk the tree.
      Reported-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b467e6e
    • F
      reiserfs: don't lock journal_init() · 37c69b98
      Frederic Weisbecker 提交于
      journal_init() doesn't need the lock since no operation on the filesystem
      is involved there.  journal_read() and get_list_bitmap() have yet to be
      reviewed carefully though before removing the lock there.  Just keep the
      it around these two calls for safety.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37c69b98
    • F
      reiserfs: delay reiserfs lock until journal initialization · f32485be
      Frederic Weisbecker 提交于
      In the mount path, transactions that are made before journal
      initialization don't involve the filesystem.  We can delay the reiserfs
      lock until we play with the journal.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f32485be
    • D
      reiserfs: delete comments referring to the BKL · b18c1c6e
      Davidlohr Bueso 提交于
      Signed-off-by: NDavidlohr Bueso <dave@gnu.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b18c1c6e
    • D
      fs: binfmt_elf: create Kconfig variable for PIE randomization · e39f5602
      David Daney 提交于
      Randomization of PIE load address is hard coded in binfmt_elf.c for X86
      and ARM.  Create a new Kconfig variable
      (CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE) for this and use it instead.  Thus
      architecture specific policy is pushed out of the generic binfmt_elf.c and
      into the architecture Kconfig files.
      
      X86 and ARM Kconfigs are modified to select the new variable so there is
      no change in behavior.  A follow on patch will select it for MIPS too.
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e39f5602
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
    • J
      btrfs: pass __GFP_WRITE for buffered write page allocations · e3a41a5b
      Johannes Weiner 提交于
      Tell the page allocator that pages allocated for a buffered write are
      expected to become dirty soon.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3a41a5b
    • K
      mm: account reaped page cache on inode cache pruning · 5f8aefd4
      Konstantin Khlebnikov 提交于
      Inode cache pruning indirectly reclaims page-cache by invalidating mapping
      pages.  Let's account them into reclaim-state to notice this progress in
      memory reclaimer.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f8aefd4
  2. 10 1月, 2012 4 次提交
    • A
      vfs: new helper - d_make_root() · adc0e91a
      Al Viro 提交于
      d_alloc_root() with iput() in case of allocation failure...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      adc0e91a
    • D
      dcache: use a dispose list in select_parent · b48f03b3
      Dave Chinner 提交于
      select_parent currently abuses the dentry cache LRU to provide
      cleanup features for child dentries that need to be freed. It moves
      them to the tail of the LRU, then tells shrink_dcache_parent() to
      calls __shrink_dcache_sb to unconditionally move them to a dispose
      list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
      relock the dentries to move them off the LRU onto the dispose list,
      but otherwise does not touch the dentries that select_parent() moved
      to the tail of the LRU. It then passses the dispose list to
      shrink_dentry_list() which tries to free the dentries.
      
      IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
      exactly the same list of dentries for disposal directly in
      select_parent() and call shrink_dentry_list() instead of calling
      __shrink_dcache_sb() to do that. This means that we avoid long holds
      on the lru lock walking the LRU moving dentries to the dispose list
      We also avoid the need to relock each dentry just to move it off the
      LRU, reducing the numebr of times we lock each dentry to dispose of
      them in shrink_dcache_parent() from 3 to 2 times.
      
      Further, we remove one of the two callers of __shrink_dcache_sb().
      This also means that __shrink_dcache_sb can be moved into back into
      prune_dcache_sb() and we no longer have to handle referenced
      dentries conditionally, simplifying the code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b48f03b3
    • A
      ceph: d_alloc_root() may fail · 3c5184ef
      Al Viro 提交于
      ... and ceph_init_dentry(NULL) will oops
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3c5184ef
    • A
      ext4: fix failure exits · 94bf608a
      Al Viro 提交于
      a) leaking root dentry is bad
      b) in case of failed ext4_mb_init() we don't want to do ext4_mb_release()
      c) OTOH, in the same case we *do* want ext4_ext_release()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      94bf608a
  3. 09 1月, 2012 17 次提交
  4. 08 1月, 2012 1 次提交
    • B
      ore: Must support none-PAGE-aligned IO · 724577ca
      Boaz Harrosh 提交于
      NFS might send us offsets that are not PAGE aligned. So
      we must read in the reminder of the first/last pages, in cases
      we need it for Parity calculations.
      
      We only add an sg segments to read the partial page. But
      we don't mark it as read=true because it is a lock-for-write
      page.
      
      TODO: In some cases (IO spans a single unit) we can just
      adjust the raid_unit offset/length, but this is left for
      later Kernels.
      
      [Bug in 3.2.0 Kernel]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      724577ca
  5. 07 1月, 2012 8 次提交