1. 15 6月, 2013 1 次提交
    • O
      fput: task_work_add() can fail if the caller has passed exit_task_work() · e7b2c406
      Oleg Nesterov 提交于
      fput() assumes that it can't be called after exit_task_work() but
      this is not true, for example free_ipc_ns()->shm_destroy() can do
      this. In this case fput() silently leaks the file.
      
      Change it to fallback to delayed_fput_work if task_work_add() fails.
      The patch looks complicated but it is not, it changes the code from
      
      	if (PF_KTHREAD) {
      		schedule_work(...);
      		return;
      	}
      	task_work_add(...)
      
      to
      	if (!PF_KTHREAD) {
      		if (!task_work_add(...))
      			return;
      		/* fallback */
      	}
      	schedule_work(...);
      
      As for shm_destroy() in particular, we could make another fix but I
      think this change makes sense anyway. There could be another similar
      user, it is not safe to assume that task_work_add() can't fail.
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e7b2c406
  2. 02 3月, 2013 1 次提交
  3. 23 2月, 2013 3 次提交
  4. 21 12月, 2012 1 次提交
    • J
      fs: Fix imbalance in freeze protection in mark_files_ro() · 72651cac
      Jan Kara 提交于
      File descriptors (even those for writing) do not hold freeze protection.
      Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
      against remount read-only. Calling mnt_drop_write_file() as we do now
      results in:
      
      [ BUG: bad unlock balance detected! ]
      3.7.0-rc6-00028-g88e75b6 #101 Not tainted
      -------------------------------------
      kworker/1:2/79 is trying to release lock (sb_writers) at:
      [<ffffffff811b33b4>] mnt_drop_write+0x24/0x30
      but there are no more locks to release!
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72651cac
  5. 10 10月, 2012 1 次提交
  6. 27 9月, 2012 1 次提交
  7. 08 9月, 2012 1 次提交
  8. 31 7月, 2012 1 次提交
  9. 30 7月, 2012 1 次提交
  10. 23 7月, 2012 1 次提交
    • A
      switch fput to task_work_add · 4a9d4b02
      Al Viro 提交于
      ... and schedule_work() for interrupt/kernel_thread callers
      (and yes, now it *is* OK to call from interrupt).
      
      We are guaranteed that __fput() will be done before we return
      to userland (or exit).  Note that for fput() from a kernel
      thread we get an async behaviour; it's almost always OK, but
      sometimes you might need to have __fput() completed before
      you do anything else.  There are two mechanisms for that -
      a general barrier (flush_delayed_fput()) and explicit
      __fput_sync().  Both should be used with care (as was the
      case for fput() from kernel threads all along).  See comments
      in fs/file_table.c for details.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4a9d4b02
  11. 14 7月, 2012 1 次提交
  12. 30 5月, 2012 2 次提交
    • A
      brlocks/lglocks: API cleanups · 962830df
      Andi Kleen 提交于
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      In preparation, this patch changes the API to look more like normal
      function calls with pointers, not magic macros.
      
      The patch is rather large because I move over all users in one go to keep
      it bisectable.  This impacts the VFS somewhat in terms of lines changed.
      But no actual behaviour change.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      962830df
    • A
      brlocks/lglocks: turn into functions · eea62f83
      Andi Kleen 提交于
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      Since there are at least two users it makes sense to share this code in a
      library.  This is also easier maintainable than a macro forest.
      
      This will also make it later possible to dynamically allocate lglocks and
      also use them in modules (this would both still need some additional, but
      now straightforward, code)
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eea62f83
  13. 21 3月, 2012 1 次提交
  14. 07 1月, 2012 1 次提交
  15. 27 7月, 2011 1 次提交
  16. 17 3月, 2011 1 次提交
  17. 15 3月, 2011 2 次提交
    • A
      Allow passing O_PATH descriptors via SCM_RIGHTS datagrams · 326be7b4
      Al Viro 提交于
      Just need to make sure that AF_UNIX garbage collector won't
      confuse O_PATHed socket on filesystem for real AF_UNIX opened
      socket.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      326be7b4
    • A
      New kind of open files - "location only". · 1abf0c71
      Al Viro 提交于
      New flag for open(2) - O_PATH.  Semantics:
      	* pathname is resolved, but the file itself is _NOT_ opened
      as far as filesystem is concerned.
      	* almost all operations on the resulting descriptors shall
      fail with -EBADF.  Exceptions are:
      	1) operations on descriptors themselves (i.e.
      		close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
      		fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
      		fcntl(fd, F_SETFD, ...))
      	2) fcntl(fd, F_GETFL), for a common non-destructive way to
      		check if descriptor is open
      	3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
      		points of pathname resolution
      	* closing such descriptor does *NOT* affect dnotify or
      posix locks.
      	* permissions are checked as usual along the way to file;
      no permission checks are applied to the file itself.  Of course,
      giving such thing to syscall will result in permission checks (at
      the moment it means checking that starting point of ....at() is
      a directory and caller has exec permissions on it).
      
      fget() and fget_light() return NULL on such descriptors; use of
      fget_raw() and fget_raw_light() is needed to get them.  That protects
      existing code from dealing with those things.
      
      There are two things still missing (they come in the next commits):
      one is handling of symlinks (right now we refuse to open them that
      way; see the next commit for semantics related to those) and another
      is descriptor passing via SCM_RIGHTS datagrams.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1abf0c71
  18. 10 2月, 2011 1 次提交
  19. 05 2月, 2011 1 次提交
  20. 17 1月, 2011 1 次提交
    • S
      fs: Remove unlikely() from fget_light() · 3bc0ba43
      Steven Rostedt 提交于
      There's an unlikely() in fget_light() that assumes the file ref count
      will be 1. Running the annotate branch profiler on a desktop that is
      performing daily tasks (running firefox, evolution, xchat and is also part
      of a distcc farm), it shows that the ref count is not 1 that often.
      
       correct incorrect      %    Function                  File              Line
       ------- ---------      -    --------                  ----              ----
      1035099358 6209599193  85    fget_light              file_table.c         315
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3bc0ba43
  21. 27 10月, 2010 1 次提交
    • E
      fs: allow for more than 2^31 files · 518de9b3
      Eric Dumazet 提交于
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of atomic_t.
      
      get_max_files() is changed to return an unsigned long.  get_nr_files() is
      changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      518de9b3
  22. 26 10月, 2010 1 次提交
    • E
      fs: allow for more than 2^31 files · 7e360c38
      Eric Dumazet 提交于
      Andrew,
      
      Could you please review this patch, you probably are the right guy to
      take it, because it crosses fs and net trees.
      
      Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
      depend on previous patch (sysctl: fix min/max handling in
      __do_proc_doulongvec_minmax())
      
      Thanks !
      
      [PATCH V4] fs: allow for more than 2^31 files
      
      Robin Holt tried to boot a 16TB system and found af_unix was overflowing
      a 32bit value :
      
      <quote>
      
      We were seeing a failure which prevented boot.  The kernel was incapable
      of creating either a named pipe or unix domain socket.  This comes down
      to a common kernel function called unix_create1() which does:
      
              atomic_inc(&unix_nr_socks);
              if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                      goto out;
      
      The function get_max_files() is a simple return of files_stat.max_files.
      files_stat.max_files is a signed integer and is computed in
      fs/file_table.c's files_init().
      
              n = (mempages * (PAGE_SIZE / 1024)) / 10;
              files_stat.max_files = n;
      
      In our case, mempages (total_ram_pages) is approx 3,758,096,384
      (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
      This causes 2 * get_max_files() to integer overflow.
      
      </quote>
      
      Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
      integers, and change af_unix to use an atomic_long_t instead of
      atomic_t.
      
      get_max_files() is changed to return an unsigned long.
      get_nr_files() is changed to return a long.
      
      unix_nr_socks is changed from atomic_t to atomic_long_t, while not
      strictly needed to address Robin problem.
      
      Before patch (on a 64bit kernel) :
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      -18446744071562067968
      
      After patch:
      # echo 2147483648 >/proc/sys/fs/file-max
      # cat /proc/sys/fs/file-max
      2147483648
      # cat /proc/sys/fs/file-nr
      704     0       2147483648
      Reported-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7e360c38
  23. 18 8月, 2010 2 次提交
    • N
      fs: scale files_lock · 6416ccb7
      Nick Piggin 提交于
      fs: scale files_lock
      
      Improve scalability of files_lock by adding per-cpu, per-sb files lists,
      protected with an lglock. The lglock provides fast access to the per-cpu lists
      to add and remove files. It also provides a snapshot of all the per-cpu lists
      (although this is very slow).
      
      One difficulty with this approach is that a file can be removed from the list
      by another CPU. We must track which per-cpu list the file is on with a new
      variale in the file struct (packed into a hole on 64-bit archs). Scalability
      could suffer if files are frequently removed from different cpu's list.
      
      However loads with frequent removal of files imply short interval between
      adding and removing the files, and the scheduler attempts to avoid moving
      processes too far away. Also, even in the case of cross-CPU removal, the
      hardware has much more opportunity to parallelise cacheline transfers with N
      cachelines than with 1.
      
      A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
      degenerates to contending on a single lock, which is no worse than before. When
      more than one CPU are allocating files, even if they are always freed by
      different CPUs, there will be more parallelism than the single-lock case.
      
      Testing results:
      
      On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
      to remove the file, the number of times it is removed by the same CPU that
      added it, and the number of times it is removed by the same node that added it.
      
      Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
      kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
      dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
      
      So a file is removed from the same CPU it was added by over 90% of the time.
      It remains within the same node 95% of the time.
      
      Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
      
                      throughput
      2.6.34-rc2      24.5
      +patch          24.9
      
                      us      sys     idle    IO wait (in %)
      2.6.34-rc2      51.25   28.25   17.25   3.25
      +patch          53.75   18.5    19      8.75
      
      So significantly less CPU time spent in kernel code, higher idle time and
      slightly higher throughput.
      
      Single threaded performance difference was within the noise of microbenchmarks.
      That is not to say penalty does not exist, the code is larger and more memory
      accesses required so it will be slightly slower.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6416ccb7
    • N
      fs: cleanup files_lock locking · ee2ffa0d
      Nick Piggin 提交于
      fs: cleanup files_lock locking
      
      Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
      manipulate the per-sb files list; unexport the files_lock spinlock.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ee2ffa0d
  24. 13 8月, 2010 1 次提交
  25. 11 8月, 2010 1 次提交
  26. 28 7月, 2010 1 次提交
    • E
      vfs/fsnotify: fsnotify_close can delay the final work in fput · c1e5c954
      Eric Paris 提交于
      fanotify almost works like so:
      
      user context calls fsnotify_* function with a struct file.
         fsnotify takes a reference on the struct path
      user context goes about it's buissiness
      
      at some later point in time the fsnotify listener gets the struct path
         fanotify listener calls dentry_open() to create a file which userspace can deal with
            listener drops the reference on the struct path
      at some later point the listener calls close() on it's new file
      
      With the switch from struct path to struct file this presents a problem for
      fput() and fsnotify_close().  fsnotify_close() is called when the filp has
      already reached 0 and __fput() wants to do it's cleanup.
      
      The solution presented here is a bit odd.  If an event is created from a
      struct file we take a reference on the file.  We check however if the f_count
      was already 0 and if so we take an EXTRA reference EVEN THOUGH IT WAS ZERO.
      In __fput() (where we know the f_count hit 0 once) we check if the f_count is
      non-zero and if so we drop that 'extra' ref and return without destroying the
      file.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      c1e5c954
  27. 28 5月, 2010 1 次提交
    • A
      get rid of the magic around f_count in aio · d7065da0
      Al Viro 提交于
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d7065da0
  28. 07 3月, 2010 1 次提交
  29. 07 2月, 2010 1 次提交
  30. 23 12月, 2009 1 次提交
  31. 17 12月, 2009 5 次提交