1. 02 4月, 2014 4 次提交
  2. 31 3月, 2014 1 次提交
  3. 10 3月, 2014 1 次提交
    • L
      vfs: atomic f_pos accesses as per POSIX · 9c225f26
      Linus Torvalds 提交于
      Our write() system call has always been atomic in the sense that you get
      the expected thread-safe contiguous write, but we haven't actually
      guaranteed that concurrent writes are serialized wrt f_pos accesses, so
      threads (or processes) that share a file descriptor and use "write()"
      concurrently would quite likely overwrite each others data.
      
      This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
      
       "2.9.7 Thread Interactions with Regular File Operations
      
        All of the following functions shall be atomic with respect to each
        other in the effects specified in POSIX.1-2008 when they operate on
        regular files or symbolic links: [...]"
      
      and one of the effects is the file position update.
      
      This unprotected file position behavior is not new behavior, and nobody
      has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
      Michael Kerrisk that was due to this.
      
      This resolves the issue with a f_pos-specific lock that is taken by
      read/write/lseek on file descriptors that may be shared across threads
      or processes.
      Reported-by: NYongzhi Pan <panyongzhi@gmail.com>
      Reported-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c225f26
  4. 09 11月, 2013 1 次提交
  5. 25 10月, 2013 1 次提交
  6. 20 10月, 2013 1 次提交
    • A
      nfsd regression since delayed fput() · c7314d74
      Al Viro 提交于
      Background: nfsd v[23] had throughput regression since delayed fput
      went in; every read or write ends up doing fput() and we get a pair
      of extra context switches out of that (plus quite a bit of work
      in queue_work itselfi, apparently).  Use of schedule_delayed_work()
      gives it a chance to accumulate a bit before we do __fput() on all
      of them.  I'm not too happy about that solution, but... on at least
      one real-world setup it reverts about 10% throughput loss we got from
      switch to delayed fput.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c7314d74
  7. 12 9月, 2013 1 次提交
  8. 04 9月, 2013 1 次提交
  9. 13 7月, 2013 2 次提交
  10. 29 6月, 2013 1 次提交
  11. 15 6月, 2013 1 次提交
    • O
      fput: task_work_add() can fail if the caller has passed exit_task_work() · e7b2c406
      Oleg Nesterov 提交于
      fput() assumes that it can't be called after exit_task_work() but
      this is not true, for example free_ipc_ns()->shm_destroy() can do
      this. In this case fput() silently leaks the file.
      
      Change it to fallback to delayed_fput_work if task_work_add() fails.
      The patch looks complicated but it is not, it changes the code from
      
      	if (PF_KTHREAD) {
      		schedule_work(...);
      		return;
      	}
      	task_work_add(...)
      
      to
      	if (!PF_KTHREAD) {
      		if (!task_work_add(...))
      			return;
      		/* fallback */
      	}
      	schedule_work(...);
      
      As for shm_destroy() in particular, we could make another fix but I
      think this change makes sense anyway. There could be another similar
      user, it is not safe to assume that task_work_add() can't fail.
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e7b2c406
  12. 02 3月, 2013 1 次提交
  13. 23 2月, 2013 3 次提交
  14. 21 12月, 2012 1 次提交
    • J
      fs: Fix imbalance in freeze protection in mark_files_ro() · 72651cac
      Jan Kara 提交于
      File descriptors (even those for writing) do not hold freeze protection.
      Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
      against remount read-only. Calling mnt_drop_write_file() as we do now
      results in:
      
      [ BUG: bad unlock balance detected! ]
      3.7.0-rc6-00028-g88e75b6 #101 Not tainted
      -------------------------------------
      kworker/1:2/79 is trying to release lock (sb_writers) at:
      [<ffffffff811b33b4>] mnt_drop_write+0x24/0x30
      but there are no more locks to release!
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72651cac
  15. 10 10月, 2012 1 次提交
  16. 27 9月, 2012 1 次提交
  17. 08 9月, 2012 1 次提交
  18. 31 7月, 2012 1 次提交
  19. 30 7月, 2012 1 次提交
  20. 23 7月, 2012 1 次提交
    • A
      switch fput to task_work_add · 4a9d4b02
      Al Viro 提交于
      ... and schedule_work() for interrupt/kernel_thread callers
      (and yes, now it *is* OK to call from interrupt).
      
      We are guaranteed that __fput() will be done before we return
      to userland (or exit).  Note that for fput() from a kernel
      thread we get an async behaviour; it's almost always OK, but
      sometimes you might need to have __fput() completed before
      you do anything else.  There are two mechanisms for that -
      a general barrier (flush_delayed_fput()) and explicit
      __fput_sync().  Both should be used with care (as was the
      case for fput() from kernel threads all along).  See comments
      in fs/file_table.c for details.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4a9d4b02
  21. 14 7月, 2012 1 次提交
  22. 30 5月, 2012 2 次提交
    • A
      brlocks/lglocks: API cleanups · 962830df
      Andi Kleen 提交于
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      In preparation, this patch changes the API to look more like normal
      function calls with pointers, not magic macros.
      
      The patch is rather large because I move over all users in one go to keep
      it bisectable.  This impacts the VFS somewhat in terms of lines changed.
      But no actual behaviour change.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      962830df
    • A
      brlocks/lglocks: turn into functions · eea62f83
      Andi Kleen 提交于
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      Since there are at least two users it makes sense to share this code in a
      library.  This is also easier maintainable than a macro forest.
      
      This will also make it later possible to dynamically allocate lglocks and
      also use them in modules (this would both still need some additional, but
      now straightforward, code)
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eea62f83
  23. 21 3月, 2012 1 次提交
  24. 07 1月, 2012 1 次提交
  25. 27 7月, 2011 1 次提交
  26. 17 3月, 2011 1 次提交
  27. 15 3月, 2011 2 次提交
    • A
      Allow passing O_PATH descriptors via SCM_RIGHTS datagrams · 326be7b4
      Al Viro 提交于
      Just need to make sure that AF_UNIX garbage collector won't
      confuse O_PATHed socket on filesystem for real AF_UNIX opened
      socket.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      326be7b4
    • A
      New kind of open files - "location only". · 1abf0c71
      Al Viro 提交于
      New flag for open(2) - O_PATH.  Semantics:
      	* pathname is resolved, but the file itself is _NOT_ opened
      as far as filesystem is concerned.
      	* almost all operations on the resulting descriptors shall
      fail with -EBADF.  Exceptions are:
      	1) operations on descriptors themselves (i.e.
      		close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
      		fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
      		fcntl(fd, F_SETFD, ...))
      	2) fcntl(fd, F_GETFL), for a common non-destructive way to
      		check if descriptor is open
      	3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
      		points of pathname resolution
      	* closing such descriptor does *NOT* affect dnotify or
      posix locks.
      	* permissions are checked as usual along the way to file;
      no permission checks are applied to the file itself.  Of course,
      giving such thing to syscall will result in permission checks (at
      the moment it means checking that starting point of ....at() is
      a directory and caller has exec permissions on it).
      
      fget() and fget_light() return NULL on such descriptors; use of
      fget_raw() and fget_raw_light() is needed to get them.  That protects
      existing code from dealing with those things.
      
      There are two things still missing (they come in the next commits):
      one is handling of symlinks (right now we refuse to open them that
      way; see the next commit for semantics related to those) and another
      is descriptor passing via SCM_RIGHTS datagrams.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1abf0c71
  28. 10 2月, 2011 1 次提交
  29. 05 2月, 2011 1 次提交
  30. 17 1月, 2011 1 次提交
    • S
      fs: Remove unlikely() from fget_light() · 3bc0ba43
      Steven Rostedt 提交于
      There's an unlikely() in fget_light() that assumes the file ref count
      will be 1. Running the annotate branch profiler on a desktop that is
      performing daily tasks (running firefox, evolution, xchat and is also part
      of a distcc farm), it shows that the ref count is not 1 that often.
      
       correct incorrect      %    Function                  File              Line
       ------- ---------      -    --------                  ----              ----
      1035099358 6209599193  85    fget_light              file_table.c         315
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3bc0ba43
  31. 27 10月, 2010 1 次提交
    • E
      fs: allow for more than 2^31 files · 518de9b3
      Eric Dumazet 提交于
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of atomic_t.
      
      get_max_files() is changed to return an unsigned long.  get_nr_files() is
      changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      518de9b3
  32. 26 10月, 2010 1 次提交
    • E
      fs: allow for more than 2^31 files · 7e360c38
      Eric Dumazet 提交于
      Andrew,
      
      Could you please review this patch, you probably are the right guy to
      take it, because it crosses fs and net trees.
      
      Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
      depend on previous patch (sysctl: fix min/max handling in
      __do_proc_doulongvec_minmax())
      
      Thanks !
      
      [PATCH V4] fs: allow for more than 2^31 files
      
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of
      atomic_t.
      
      get_max_files() is changed to return an unsigned long.
      get_nr_files() is changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7e360c38