1. 24 9月, 2009 1 次提交
    • H
      fs: change sys_truncate length parameter type · 4fd8da8d
      Heiko Carstens 提交于
      For this system call user space passes a signed long length parameter,
      while the kernel side takes an unsigned long parameter and converts it
      later to signed long again.
      
      This has led to bugs in compat wrappers see e.g.  dd90bbd5 "powerpc: Add
      compat_sys_truncate".  The s390 compat wrapper for this functions is
      broken as well since it also performs zero extension instead of sign
      extension for the length parameter.
      
      In addition if hpa comes up with an automated way of generating
      compat wrappers it would generate a wrong one here.
      
      So change the length parameter from unsigned long to long.
      
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fd8da8d
  2. 23 9月, 2009 39 次提交
    • H
      ext2: fix format string compile warning (ino_t) · a4255e4c
      Heiko Carstens 提交于
      Unlike on most other architectures ino_t is an unsigned int on s390.  So
      add an explicit cast to avoid this compile warning:
      
      fs/ext2/namei.c: In function 'ext2_lookup':
      fs/ext2/namei.c:73: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'ino_t'
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4255e4c
    • D
      V3 minixfs: add missing directory type checking · 9f6c1333
      Doug Graham 提交于
      There are a few places in the Minix FS code where the "inode" field of a
      minix_dir_entry is used without checking first to see if the dirent is
      really a minix3_dir_entry.  The inode number in a V1/V2 dirent is 16 bits,
      whereas that in a V3 dirent is 32 bits.
      
      Accessing it as a 16 bit field when it really should be accessed as a 32
      bit field probably kinda sorta works on a little-endian machine, but leads
      to some rather odd behaviour on big-endian machines.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDoug Graham <dgraham@nortel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f6c1333
    • B
      ncpfs: fix wrong check in __ncp_ioctl() · 8b2feb10
      Bartlomiej Zolnierkiewicz 提交于
      We want to check for s_inode's existence, not inode's one (inode is always
      valid in this function).
      
      This takes care of the following entry from Dan's list:
      
      fs/ncpfs/ioctl.c +445 __ncp_ioctl(180) warning: variable derefenced before check 'inode'
      Reported-by: NDan Carpenter <error27@gmail.com>
      Cc: Julia Lawall <julia@diku.dk>
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      Cc: Petr Vandrovec <vandrove@vc.cvut.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b2feb10
    • R
      ncpfs: read buffer overflow · c5df5913
      Roel Kluin 提交于
      This function uses signed integers for the unix_date and local variables -
      if a negative number is supplied and the leap-year condition is not met,
      month will be 0, leading to a later read of day_n[-1]
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5df5913
    • M
      ramfs: move RAMFS_MAGIC to include/linux/magic.h · a7e3108c
      maximilian attems 提交于
      initramfs userspace likes to use this magic number.
      
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: Nmaximilian attems <max@stro.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7e3108c
    • K
      /proc/kcore: update stat.st_size after memory hotplug · 0d4c36a9
      KAMEZAWA Hiroyuki 提交于
      After memory hotplug (or other events in future), kcore size can be
      modified.
      
      To update inode->i_size, we have to know inode/dentry but we can't get it
      from inside /proc directly.  But considerinyg memory hotplug, kcore image
      is updated only when it's opened.  Then, updating inode->i_size at open()
      is enough.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d4c36a9
    • K
      /proc/kcore: fix stat.st_size · 678ad5d8
      KAMEZAWA Hiroyuki 提交于
      Presently the size of /proc/kcore which can be read by 'ls -l' is 0.  But
      it's not the correct value.
      
      On x86-64, ls -l shows
       ... root root 140737486266368 2009-09-17 10:29 /proc/kcore
      Then, 7FFFFFFE02000. This comes from vmalloc area's size.
      (*) This shows "core" size, not  memory size.
      
      This patch shows the size by updating "size" field in struct
      proc_dir_entry.  Later, lookup routine will create inode and fill
      inode->i_size based on this value.  Then, this has a problem.
      
       - Once inode is cached, inode->i_size will never be updated.
      
      Then, this patch is not memory-hotplug-aware.
      
      To update inode->i_size, we have to know dentry or inode.
      But there is no way to lookup them by inside kernel. Hmmm....
      Next patch will try it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678ad5d8
    • K
      kcore: more fixes for init · 90396f96
      KAMEZAWA Hiroyuki 提交于
      proc_kcore_init() doesn't check NULL case.  fix it and remove unnecessary
      comments.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90396f96
    • K
      kcore: register module area in generic way · 81ac3ad9
      KAMEZAWA Hiroyuki 提交于
      Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
      This is handled only in x86-64.  This patch make it more generic.  And we
      can use vread/vwrite to access the area.  Fix it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81ac3ad9
    • K
      kcore: register vmemmap range · 26562c59
      KAMEZAWA Hiroyuki 提交于
      Benjamin Herrenschmidt <benh@kernel.crashing.org> pointed out that vmemmap
      range is not included in KCORE_RAM, KCORE_VMALLOC ....
      
      This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used.  By this, vmemmap
      can be readable via /proc/kcore
      
      Because it's not vmalloc area, vread/vwrite cannot be used.  But the range
      is static against the memory layout, this patch handles vmemmap area by
      the same scheme with physical memory.
      
      This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range.  It's
      correct now.
      
      [akpm@linux-foundation.org: fix typo]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26562c59
    • K
      kcore: use registerd physmem information · 3089aa1b
      KAMEZAWA Hiroyuki 提交于
      For /proc/kcore, each arch registers its memory range by kclist_add().
      In usual,
      
      	- range of physical memory
      	- range of vmalloc area
      	- text, etc...
      
      are registered but "range of physical memory" has some troubles.  It
      doesn't updated at memory hotplug and it tend to include unnecessary
      memory holes.  Now, /proc/iomem (kernel/resource.c) includes required
      physical memory range information and it's properly updated at memory
      hotplug.  Then, it's good to avoid using its own code(duplicating
      information) and to rebuild kclist for physical memory based on
      /proc/iomem.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3089aa1b
    • K
      kcore: register text area in generic way · 9492587c
      KAMEZAWA Hiroyuki 提交于
      Some 64bit arch has special segment for mapping kernel text.  It should be
      entried to /proc/kcore in addtion to direct-linear-map, vmalloc area.
      This patch unifies KCORE_TEXT entry scattered under x86 and ia64.
      
      I'm not familiar with other archs (mips has its own even after this patch)
      but range of [_stext ..._end) is a valid area of text and it's not in
      direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary
      thing to do.
      
      Note: I left mips as it is now.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9492587c
    • K
      kcore: register vmalloc area in generic way · a0614da8
      KAMEZAWA Hiroyuki 提交于
      For /proc/kcore, vmalloc areas are registered per arch.  But, all of them
      registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
      them.  By this.  archs which have no kclist_add() hooks can see vmalloc
      area correctly.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0614da8
    • K
      kcore: add kclist types · c30bb2a2
      KAMEZAWA Hiroyuki 提交于
      Presently, kclist_add() only eats start address and size as its arguments.
      Considering to make kclist dynamically reconfigulable, it's necessary to
      know which kclists are for System RAM and which are not.
      
      This patch add kclist types as
        KCORE_RAM
        KCORE_VMALLOC
        KCORE_TEXT
        KCORE_OTHER
      
      This "type" is used in a patch following this for detecting KCORE_RAM.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c30bb2a2
    • K
      kcore: use usual list for kclist · 2ef43ec7
      KAMEZAWA Hiroyuki 提交于
      This patchset is for /proc/kcore.  With this,
      
       - many per-arch hooks are removed.
      
       - /proc/kcore will know really valid physical memory area.
      
       - /proc/kcore will be aware of memory hotplug.
      
       - /proc/kcore will be architecture independent i.e.
         if an arch supports CONFIG_MMU, it can use /proc/kcore.
         (if the arch uses usual memory layout.)
      
      This patch:
      
      /proc/kcore uses its own list handling codes. It's better to use
      generic list codes.
      
      No changes in logic. just clean up.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ef43ec7
    • S
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold 提交于
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
    • V
      fs/proc/base.c: fix proc_fault_inject_write() input sanity check · cba8aafe
      Vincent Li 提交于
      Remove obfuscated zero-length input check and return -EINVAL instead of
      -EIO error to make the error message clear to user.  Add whitespace
      stripping.  No functionality changes.
      
      The old code:
      
      echo  1  > /proc/pid/make-it-fail (ok)
      echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error)
      
      The new code:
      
      echo  1  > /proc/pid/make-it-fail (ok)
      echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument)
      
      This patch is conservative in changes to not breaking existing
      scripts/applications.
      Signed-off-by: NVincent Li <macli@brc.ubc.ca>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cba8aafe
    • V
      fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity check · fb92a4b0
      Vincent Li 提交于
      Andrew Morton pointed out similar string hacking and obfuscated check for
      zero-length input at the end of the function, David Rientjes suggested to
      use strict_strtol to replace simple_strtol, this patch cover above
      suggestions, add removing of leading and trailing whitespace from user
      input.  It does not change function behavious.
      Signed-off-by: NVincent Li <macli@brc.ubc.ca>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Amerigo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb92a4b0
    • A
      kcore: fix /proc/kcore's stat.st_size · acef82b8
      Amerigo Wang 提交于
      In 9063c61f ("x86, 64-bit: Clean up user address masking") Linus
      fixed the wrong size of /proc/kcore problem.
      
      But its size still looks insane, since it never equals the size of
      physical memory.
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Tao Ma <tao.ma@oracle.com>
      Cc: <mtk.manpages@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acef82b8
    • O
      proc_flush_task: flush /proc/tid/task/pid when a sub-thread exits · 9b4d1cbe
      Oleg Nesterov 提交于
      The exiting sub-thread flushes /proc/pid only, but this doesn't buy too
      much: ps and friends mostly use /proc/tid/task/pid.
      
      Remove "if (thread_group_leader())" checks from proc_flush_task() path,
      this means we always remove /proc/tid/task/pid dentry on exit, and this
      actually matches the comment above proc_flush_task().
      
      The test-case:
      
      	static void* tfunc(void *arg)
      	{
      		char name[256];
      
      		sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid());
      		close(open(name, O_RDONLY));
      
      		return NULL;
      	}
      
      	int main(void)
      	{
      		pthread_t t;
      
      		for (;;) {
      			if (!pthread_create(&t, NULL, &tfunc, NULL))
      				pthread_join(t, NULL);
      		}
      	}
      
      slabtop shows that pid/proc_inode_cache/etc grow quickly and
      "indefinitely" until the task is killed or shrink_slab() is called, not
      good.  And the main thread needs a lot of time to exit.
      
      The same can happen if something like "ps -efL" runs continuously, while
      some application spawns short-living threads.
      Reported-by: N"James M. Leddy" <jleddy@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dominic Duval <dduval@redhat.com>
      Cc: Frank Hirtz <fhirtz@redhat.com>
      Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Paul Batkowski <pbatkowski@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b4d1cbe
    • K
      proc: fix reported unit for RLIMIT_CPU · cff4edb5
      Kees Cook 提交于
      /proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit
      used in kernel/posix-cpu-timers.c:
      
              unsigned long psecs = cputime_to_secs(ptime);
              ...
              if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) {
                      ...
                      __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
      Signed-off-by: NKees Cook <kees.cook@canonical.com>
      Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cff4edb5
    • J
      getrusage: fill ru_maxrss value · 1f10206c
      Jiri Pirko 提交于
      Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
      mark.  This struct is filled as a parameter to getrusage syscall.
      ->ru_maxrss value is set to KBs which is the way it is done in BSD
      systems.  /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
      which seems to be incorrect behavior.  Maintainer of this util was
      notified by me with the patch which corrects it and cc'ed.
      
      To make this happen we extend struct signal_struct by two fields.  The
      first one is ->maxrss which we use to store rss hiwater of the task.  The
      second one is ->cmaxrss which we use to store highest rss hiwater of all
      task childs.  These values are used in k_getrusage() to actually fill
      ->ru_maxrss.  k_getrusage() uses current rss hiwater value directly if mm
      struct exists.
      
      Note:
      exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
      it is intetionally behavior. *BSD getrusage have exec() inheriting.
      
      test programs
      ========================================================
      
      getrusage.c
      ===========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
      
       #define err(str) perror(str), exit(1)
      
      int main(int argc, char** argv)
      {
      	int status;
      
      	printf("allocate 100MB\n");
      	consume(100);
      
      	printf("testcase1: fork inherit? \n");
      	printf("  expect: initial.self ~= child.self\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase2: fork inherit? (cont.) \n");
      	printf("  expect: initial.children ~= 100MB, but child.children = 0\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase3: fork + malloc \n");
      	printf("  expect: child.self ~= initial.self + 50MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		printf("allocate +50MB\n");
      		consume(50);
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase4: grandchild maxrss\n");
      	printf("  expect: post_wait.children ~= 300MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 0 -g 300");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase5: zombie\n");
      	printf("  expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
      	printf("          post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
      	show_rusage("initial");
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("pre_wait");
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 400");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase6: SIG_IGN\n");
      	printf("  expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
      	show_rusage("initial");
      	signal(SIGCHLD, SIG_IGN);
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("after_zombie");
      	} else {
      		system("./child -n 500");
      		_exit(0);
      	}
      	printf("\n");
      	signal(SIGCHLD, SIG_DFL);
      
      	printf("testcase7: exec (without fork) \n");
      	printf("  expect: initial ~= exec \n");
      	show_rusage("initial");
      	execl("./child", "child", "-v", NULL);
      
      	return 0;
      }
      
      child.c
      =======
       #include <sys/types.h>
       #include <unistd.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
      
       #include "common.h"
      
      int main(int argc, char** argv)
      {
      	int status;
      	int c;
      	long consume_size = 0;
      	long grandchild_consume_size = 0;
      	int show = 0;
      
      	while ((c = getopt(argc, argv, "n:g:v")) != -1) {
      		switch (c) {
      		case 'n':
      			consume_size = atol(optarg);
      			break;
      		case 'v':
      			show = 1;
      			break;
      		case 'g':
      
      			grandchild_consume_size = atol(optarg);
      			break;
      		default:
      			break;
      		}
      	}
      
      	if (show)
      		show_rusage("exec");
      
      	if (consume_size) {
      		printf("child alloc %ldMB\n", consume_size);
      		consume(consume_size);
      	}
      
      	if (grandchild_consume_size) {
      		if (fork()) {
      			wait(&status);
      		} else {
      			printf("grandchild alloc %ldMB\n", grandchild_consume_size);
      			consume(grandchild_consume_size);
      
      			exit(0);
      		}
      	}
      
      	return 0;
      }
      
      common.c
      ========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
       #define err(str) perror(str), exit(1)
      
      void show_rusage(char *prefix)
      {
          	int err, err2;
          	struct rusage rusage_self;
          	struct rusage rusage_children;
      
          	printf("%s: ", prefix);
          	err = getrusage(RUSAGE_SELF, &rusage_self);
          	if (!err)
          		printf("self %ld ", rusage_self.ru_maxrss);
          	err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
          	if (!err2)
          		printf("children %ld ", rusage_children.ru_maxrss);
      
          	printf("\n");
      }
      
      /* Some buggy OS need this worthless CPU waste. */
      void make_pagefault(void)
      {
      	void *addr;
      	int size = getpagesize();
      	int i;
      
      	for (i=0; i<1000; i++) {
      		addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
      		if (addr == MAP_FAILED)
      			err("make_pagefault");
      		memset(addr, 0, size);
      		munmap(addr, size);
      	}
      }
      
      void consume(int mega)
      {
          	size_t sz = mega * 1024 * 1024;
          	void *ptr;
      
          	ptr = malloc(sz);
          	memset(ptr, 0, sz);
      	make_pagefault();
      }
      
      pid_t __fork(void)
      {
      	pid_t pid;
      
      	pid = fork();
      	make_pagefault();
      
      	return pid;
      }
      
      common.h
      ========
      void show_rusage(char *prefix);
      void make_pagefault(void);
      void consume(int mega);
      pid_t __fork(void);
      
      FreeBSD result (expected result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 103492 children 0
      fork child: self 103540 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 103540 children 103540
      child: self 103564 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 103564 children 103564
      allocate +50MB
      fork child: self 154860 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 103564 children 154860
      grandchild alloc 300MB
      post_wait: self 103564 children 308720
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 103564 children 308720
      child alloc 400MB
      pre_wait: self 103564 children 308720
      post_wait: self 103564 children 411312
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 103564 children 411312
      child alloc 500MB
      after_zombie: self 103624 children 411312
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 103624 children 411312
      exec: self 103624 children 411312
      
      Linux result (actual test result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 102848 children 0
      fork child: self 102572 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 102876 children 102644
      child: self 102572 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 102876 children 102644
      allocate +50MB
      fork child: self 153804 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 102876 children 153864
      grandchild alloc 300MB
      post_wait: self 102876 children 307536
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 102876 children 307536
      child alloc 400MB
      pre_wait: self 102876 children 307536
      post_wait: self 102876 children 410076
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 102876 children 410076
      child alloc 500MB
      after_zombie: self 102880 children 410076
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 102880 children 410076
      exec: self 102880 children 410076
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f10206c
    • S
      fix compat_sys_utimensat() · d7d7561c
      Suzuki Poulose 提交于
      Compat utimensat() returns EINVAL when the tv_nsec is one of UTIME_OMIT or
      UTIME_NOW and the tv_sec is set to non-zero.  As per man pages, the tv_sec
      field should be ignored.
      
      sys_utimensat() works fine in this case.
      
      Test case:
      
      #define _GNU_SOURCE
      #define _ATFILE_SOURCE
      #include <stdio.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      
      main(int argc, char *argv[])
      {
      	struct timespec ts[2];
      	struct timespec *tsp;
      
      	if (argc < 2) {
      		fprintf(stderr, "Usage : %s filename\n", argv[0]);
      		exit (-1);
      	}
      
      	ts[0].tv_nsec = ts[1].tv_nsec = UTIME_NOW;
      	ts[0].tv_sec = ts[1].tv_sec = 1;
      
      	tsp = ts;
      
      	if (utimensat(AT_FDCWD, argv[1],tsp,0) == -1)
      		perror("utimensat");
      	else
      		fprintf(stdout, "utimensat success\n");
      	return 0;
      }
      mjs22lp5:~ # cc -m64 utimensat-test.c -o utimensat_test64
      mjs22lp5:~ # cc -m32 utimensat-test.c -o utimensat_test32
      mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test
      utimensat: Invalid argument
      mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test
      utimensat success
      mjs22lp5:~ # uname -r
      2.6.31-rc8
      
      With the patch :
      
      mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test
      utimensat success
      mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test
      utimensat success
      mjs22lp5:~ # uname -r
      2.6.31-rc8utimensat
      Signed-off-by: NSuzuki K P <suzuki@in.ibm.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7d7561c
    • C
      qnx4: remove write support · 945ffe54
      Christoph Hellwig 提交于
      qnx4 wrte support has never been fully implement, is broken since the dawn
      of time and hasn't been actively developed since before git history
      started.
      
      Instead of letting it further bitrot and complicate API transition (like
      the new truncate code) remove it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      945ffe54
    • C
      ntfs: remove ntfs_file_write · 8a9f47dd
      Christoph Hellwig 提交于
      do_sync_write() does the right thing for turning the aio_writev method
      into a normal non-vectored synchronous write, no need to duplicate it in
      ntfs.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NAnton Altaparmakov <aia21@cantab.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9f47dd
    • D
      anonfd: split interface into file creation and install · 562787a5
      Davide Libenzi 提交于
      Split the anonfd interface into a bare file pointer creation one, and a
      file pointer creation plus install one.
      
      There are cases, like the usage of eventfds inside other kernel
      interfaces, where the file pointer created by anonfd needs to be used
      inside the initialization of other structures.
      
      As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
      race with userspace closing the newly installed file descriptor.
      
      This patch, while keeping the old anon_inode_getfd(), introduces a new
      anon_inode_getfile() (whose services are reused in anon_inode_getfd())
      that allows to split the file creation phase and the fd install one.
      
      Once all the kernel structures are initialized, the code can call the
      proper fd_install().
      
      Gregory manifested the need for something like this inside KVM.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Acked-by: NRoland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      562787a5
    • H
      aio.c: move EXPORT* macros to line after function · 385773e0
      H Hartley Sweeten 提交于
      As mentioned in Documentation/CodingStyle, move EXPORT* macro's
      to the line immediately after the closing function brace line.
      
      Also, move the __initcall() similarly.
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      385773e0
    • H
      fs/buffer.c: clean up EXPORT* macros · 1fe72eaa
      H Hartley Sweeten 提交于
      According to Documentation/CodingStyle the EXPORT* macro should follow
      immediately after the closing function brace line.
      
      Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
      elsewhere so they should be marked as static.
      
      In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
      that file.
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fe72eaa
    • N
      fs: turn iprune_mutex into rwsem · 88e0fbc4
      Nick Piggin 提交于
      We have had a report of bad memory allocation latency during DVD-RAM (UDF)
      writing.  This is causing the user's desktop session to become unusable.
      
      Jan tracked the cause of this down to UDF inode reclaim blocking:
      
      gnome-screens D ffff810006d1d598     0 20686      1
       ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
       ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
       ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
      Call Trace:
       [<ffffffff804477f3>] io_schedule+0x63/0xa5
       [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
       [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
       [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
       [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
       [<ffffffff802c442a>] __bread+0x70/0x86
       [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
       [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
       [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
       [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
       [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
       [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
       [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
       [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
       [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
       [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
       [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
       [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
       [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
       [<ffffffff802822d8>] __do_fault+0x10c/0x417
       [<ffffffff80284232>] handle_mm_fault+0x466/0x940
       [<ffffffff8044b922>] do_page_fault+0x676/0xabf
      
      This blocks with iprune_mutex held, which then blocks other reclaimers:
      
      X             D ffff81009d47c400     0 17285  14831
       ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
       ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
       ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
      Call Trace:
       [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
       [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
       [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
       [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
       [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
       [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
       [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
       [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
       [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
       [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
       [<ffffffff802ad949>] do_select+0x308/0x506
       [<ffffffff802adced>] core_sys_select+0x1a6/0x254
       [<ffffffff802ae0b7>] sys_select+0xb5/0x157
      
      Now I think the main problem is having the filesystem block (and do IO) in
      inode reclaim.  The problem is that this doesn't get accounted well and
      penalizes a random allocator with a big latency spike caused by work
      generated from elsewhere.
      
      I think the best idea would be to avoid this.  By design if possible, or
      by deferring the hard work to an asynchronous context.  If the latter,
      then the fs would probably want to throttle creation of new work with
      queue size of the deferred work, but let's not get into those details.
      
      Anyway, the other obvious thing we looked at is the iprune_mutex which is
      causing the cascading blocking.  We could turn this into an rwsem to
      improve concurrency.  It is unreasonable to totally ban all potentially
      slow or blocking operations in inode reclaim, so I think this is a cheap
      way to get a small improvement.
      
      This doesn't solve the whole problem of course.  The process doing inode
      reclaim will still take the latency hit, and concurrent processes may end
      up contending on filesystem locks.  So fs developers should keep these
      problems in mind.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88e0fbc4
    • J
      seq_file: constify seq_operations · 88e9d34c
      James Morris 提交于
      Make all seq_operations structs const, to help mitigate against
      revectoring user-triggerable function pointers.
      
      This is derived from the grsecurity patch, although generated from scratch
      because it's simpler than extracting the changes from there.
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88e9d34c
    • N
      Move magic numbers into magic.h · 1fd7317d
      Nick Black 提交于
      Move various magic-number definitions into magic.h.
      Signed-off-by: NNick Black <dank@qemfd.net>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fd7317d
    • G
      poll/select: avoid arithmetic overflow in __estimate_accuracy() · 5ae87e79
      Guillaume Knispel 提交于
      __estimate_accuracy() was prone to integer overflow, for example if *tv ==
      {2147, 483648000} on a 32 bit computer (or even for delays as small as
      {429, 500000000} if the task is niced).
      
      Because the result was already forced between 0 and 100ms, the effect of
      the overflow was not too problematic, but the use of the hrtimer range
      feature was not optimal in overflow cases.
      
      This patch ensures that there can not be an integer overflow in this
      function.
      Signed-off-by: NGuillaume Knispel <gknispel@proformatique.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ae87e79
    • R
      smbfs: read buffer overflow · ca976c53
      Roel Kluin 提交于
      This function uses signed integers for the unix_date and local variables -
      if a negative number is supplied and the leap-year condition is not met,
      month will be 0, leading to a read of day_n[-1]
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca976c53
    • T
      ocfs2: Use buffer IO if we are appending a file. · b80474b4
      Tao Ma 提交于
      In ocfs2_file_aio_write, we will prevent direct io if
      we find that we are appending(changing i_size) and call
      generic_file_aio_write_nolock. But actually O_DIRECT flag
      is there and this function will call generic_file_direct_write
      eventually which will update i_size and leave di->i_size
      alone. The bug is
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173.
      
      So this patch let ocfs2_direct_IO returns 0 directly if we
      are appending so that buffered write will be called and
      di->i_size get updated successfully. And this is also
      what we want in ocfs2_file_aio_write.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b80474b4
    • W
      ocfs2: add spinlock protection when dealing with lockres->purge. · 83e32d90
      Wengang Wang 提交于
      when we check/modify lockres->purge, we should with the protection of lockres->spinlock.
      in dlm_purge_lockres(), the checking/modifying is not with the protectin.
      this patch fixes it.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      83e32d90
    • C
      dlmglue.c: add missed mlog lines · d92bc512
      Coly Li 提交于
      This patch adds the missed mlog_exit() and mlog_exit_void() lines when routines
      return.
      Signed-off-by: NColy Li <coly.li@suse.de>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      d92bc512
    • S
      ocfs2: __ocfs2_abort() should not enable panic for local mounts · a2f2ddbf
      Sunil Mushran 提交于
      In a clustered setup, we have to panic the box on journal abort. This is
      because we don't have the facility to go hard readonly. With hard ro, another
      node would detect node failure and initiate recovery.
      
      Having said that, we shouldn't force panic if the volume is mounted locally.
      This patch defers the handling to the mount option, errors.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a2f2ddbf
    • T
      ocfs2: Add ioctl for reflink. · bd50873d
      Tao Ma 提交于
      The ioctl will take 3 parameters: old_path, new_path and
      preserve and call vfs_reflink. It is useful when we backport
      reflink features to old kernels.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      bd50873d
    • T
      ocfs2: Enable refcount tree support. · 64871b8d
      Tao Ma 提交于
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      64871b8d