1. 08 8月, 2010 12 次提交
    • A
      writeback: harmonize writeback threads naming · 6f904ff0
      Artem Bityutskiy 提交于
      The write-back code mixes words "thread" and "task" for the same things. This
      is not a big deal, but still an inconsistency.
      
      hch: a convention I tend to use and I've seen in various places
      is to always use _task for the storage of the task_struct pointer,
      and thread everywhere else.  This especially helps with having
      foo_thread for the actual thread and foo_task for a global
      variable keeping the task_struct pointer
      
      This patch renames:
      * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
      * 'bdi_forker_task()'              -> 'bdi_forker_thread()'
      
      because bdi threads are 'bdi_writeback_thread()', so these names are more
      consistent.
      
      This patch also amends commentaries and makes them refer the forker and bdi
      threads as "thread", not "task".
      
      Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
      'static void' instead of 'void static' and make checkpatch.pl happy.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6f904ff0
    • J
      coda: fixup clash with block layer REQ_* defines · 4aeefdc6
      Jens Axboe 提交于
      CODA should not be using defines in the global name space of
      that nature, prefix them with CODA_.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aeefdc6
    • M
      writeback: remove wb in get_next_work_item · 08852b6d
      Minchan Kim 提交于
      83ba7b07 cleans up the writeback.
      So we don't use wb any more in get_next_work_item.
      Let's remove unnecessary argument.
      
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08852b6d
    • M
      splice: fix misuse of SPLICE_F_NONBLOCK · 6965031d
      Miklos Szeredi 提交于
      SPLICE_F_NONBLOCK is clearly documented to only affect blocking on the
      pipe.  In __generic_file_splice_read(), however, it causes an EAGAIN
      if the page is currently being read.
      
      This makes it impossible to write an application that only wants
      failure if the pipe is full.  For example if the same process is
      handling both ends of a pipe and isn't otherwise able to determine
      whether a splice to the pipe will fill it or not.
      
      We could make the read non-blocking on O_NONBLOCK or some other splice
      flag, but for now this is the simplest fix.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6965031d
    • A
      block: push down BKL into .open and .release · 6e9624b8
      Arnd Bergmann 提交于
      The open and release block_device_operations are currently
      called with the BKL held. In order to change that, we must
      first make sure that all drivers that currently rely
      on this have no regressions.
      
      This blindly pushes the BKL into all .open and .release
      operations for all block drivers to prepare for the
      next step. The drivers can subsequently replace the BKL
      with their own locks or remove it completely when it can
      be shown that it is not needed.
      
      The functions blkdev_get and blkdev_put are the only
      remaining users of the big kernel lock in the block
      layer, besides a few uses in the ioctl code, none
      of which need to serialize with blkdev_{get,put}.
      
      Most of these two functions is also under the protection
      of bdev->bd_mutex, including the actual calls to
      ->open and ->release, and the common code does not
      access any global data structures that need the BKL.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6e9624b8
    • D
      writeback: Add tracing to balance_dirty_pages · 028c2dd1
      Dave Chinner 提交于
      Tracing high level background writeback events is good, but it doesn't
      give the entire picture. Add visibility into write throttling to catch IO
      dispatched by foreground throttling of processing dirtying lots of pages.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      028c2dd1
    • D
      writeback: Initial tracing support · 455b2864
      Dave Chinner 提交于
      Trace queue/sched/exec parts of the writeback loop. This provides
      insight into when and why flusher threads are scheduled to run. e.g
      a sync invocation leaves traces like:
      
           sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
      flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
      
      This also lays the foundation for adding more writeback tracing to
      provide deeper insight into the whole writeback path.
      
      The original tracing code is from Jens Axboe, though this version is
      a rewrite as a result of the code being traced changing
      significantly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      455b2864
    • A
      gcc-4.6: fs: fix unused but set warnings · 1676effc
      Andi Kleen 提交于
      No real bugs I believe, just some dead code, and some
      shut up code.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      1676effc
    • C
      writeback: merge bdi_writeback_task and bdi_start_fn · 08243900
      Christoph Hellwig 提交于
      Move all code for the writeback thread into fs/fs-writeback.c instead of
      splitting it over two functions in two files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08243900
    • C
      writeback: remove wb_list · c1955ce3
      Christoph Hellwig 提交于
      The wb_list member of struct backing_device_info always has exactly one
      element.  Just use the direct bdi->wb pointer instead and simplify some
      code.
      
      Also remove bdi_task_init which is now trivial to prepare for the next
      patch.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c1955ce3
    • C
      block: unify flags for struct bio and struct request · 7b6d91da
      Christoph Hellwig 提交于
      Remove the current bio flags and reuse the request flags for the bio, too.
      This allows to more easily trace the type of I/O from the filesystem
      down to the block driver.  There were two flags in the bio that were
      missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
      renamed two request flags that had a superflous RW in them.
      
      Note that the flags are in bio.h despite having the REQ_ name - as
      blkdev.h includes bio.h that is the only way to go for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7b6d91da
    • C
      block: BARRIER request should imply SYNC · 41f2df62
      Christoph Hellwig 提交于
      A barrier request should by defintion have priority in get_request
      and let the queue be unplugged immediately as it's blocking all forward
      progress due to the queue draining.
      
      Most filesystems already get this implicitly by the way how submit_bh
      treats the buffer_ordered flag, and gfs2 sets it explicitly.  But btrfs
      and XFS are still forgetting to set the flag, as is blkdev_issue_flush
      and some places in DM/MD.
      
      For XFS on metadata heavy workloads this gives a consistent speedup
      in the 2-3% range.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      41f2df62
  2. 02 8月, 2010 1 次提交
  3. 31 7月, 2010 4 次提交
  4. 30 7月, 2010 1 次提交
    • D
      CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials · de09a977
      David Howells 提交于
      It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
      credentials by incrementing their usage count after their replacement by the
      task being accessed.
      
      What happens is that get_task_cred() can race with commit_creds():
      
      	TASK_1			TASK_2			RCU_CLEANER
      	-->get_task_cred(TASK_2)
      	rcu_read_lock()
      	__cred = __task_cred(TASK_2)
      				-->commit_creds()
      				old_cred = TASK_2->real_cred
      				TASK_2->real_cred = ...
      				put_cred(old_cred)
      				  call_rcu(old_cred)
      		[__cred->usage == 0]
      	get_cred(__cred)
      		[__cred->usage == 1]
      	rcu_read_unlock()
      							-->put_cred_rcu()
      							[__cred->usage == 1]
      							panic()
      
      However, since a tasks credentials are generally not changed very often, we can
      reasonably make use of a loop involving reading the creds pointer and using
      atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
      
      If successful, we can safely return the credentials in the knowledge that, even
      if the task we're accessing has released them, they haven't gone to the RCU
      cleanup code.
      
      We then change task_state() in procfs to use get_task_cred() rather than
      calling get_cred() on the result of __task_cred(), as that suffers from the
      same problem.
      
      Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
      tripped when it is noticed that the usage count is not zero as it ought to be,
      for example:
      
      kernel BUG at kernel/cred.c:168!
      invalid opcode: 0000 [#1] SMP
      last sysfs file: /sys/kernel/mm/ksm/run
      CPU 0
      Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
      745
      RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
      RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
      RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
      RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
      R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
      FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
      Stack:
       ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
      <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
      <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
      Call Trace:
       [<ffffffff810698cd>] put_cred+0x13/0x15
       [<ffffffff81069b45>] commit_creds+0x16b/0x175
       [<ffffffff8106aace>] set_current_groups+0x47/0x4e
       [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
       [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
      Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
      48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
      04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
      RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
       RSP <ffff88019e7e9eb8>
      ---[ end trace df391256a100ebdd ]---
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de09a977
  5. 29 7月, 2010 2 次提交
  6. 28 7月, 2010 2 次提交
  7. 27 7月, 2010 3 次提交
  8. 25 7月, 2010 1 次提交
  9. 24 7月, 2010 4 次提交
  10. 23 7月, 2010 2 次提交
    • S
      ceph: avoid dcache readdir for snapdir · a0dff78d
      Sage Weil 提交于
      We should always go to the MDS for readdir on the hidden snapdir.  The
      set of snapshots can change at any time; the client can't trust its cache
      for that.
      Signed-off-by: NSage Weil <sage@newdream.net>
      a0dff78d
    • D
      CIFS: Fix a malicious redirect problem in the DNS lookup code · 4c0c03ca
      David Howells 提交于
      Fix the security problem in the CIFS filesystem DNS lookup code in which a
      malicious redirect could be installed by a random user by simply adding a
      result record into one of their keyrings with add_key() and then invoking a
      CIFS CFS lookup [CVE-2010-2524].
      
      This is done by creating an internal keyring specifically for the caching of
      DNS lookups.  To enforce the use of this keyring, the module init routine
      creates a set of override credentials with the keyring installed as the thread
      keyring and instructs request_key() to only install lookup result keys in that
      keyring.
      
      The override is then applied around the call to request_key().
      
      This has some additional benefits when a kernel service uses this module to
      request a key:
      
       (1) The result keys are owned by root, not the user that caused the lookup.
      
       (2) The result keys don't pop up in the user's keyrings.
      
       (3) The result keys don't come out of the quota of the user that caused the
           lookup.
      
      The keyring can be viewed as root by doing cat /proc/keys:
      
      2a0ca6c3 I-----     1 perm 1f030000     0     0 keyring   .dns_resolver: 1/4
      
      It can then be listed with 'keyctl list' by root.
      
      	# keyctl list 0x2a0ca6c3
      	1 key in keyring:
      	726766307: --alswrv     0     0 dns_resolver: foo.bar.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-and-Tested-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NSteve French <smfrench@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c0c03ca
  11. 22 7月, 2010 1 次提交
  12. 20 7月, 2010 5 次提交
    • D
      xfs: track AGs with reclaimable inodes in per-ag radix tree · 16fd5367
      Dave Chinner 提交于
      https://bugzilla.kernel.org/show_bug.cgi?id=16348
      
      When the filesystem grows to a large number of allocation groups,
      the summing of recalimable inodes gets expensive. In many cases,
      most AGs won't have any reclaimable inodes and so we are wasting CPU
      time aggregating over these AGs. This is particularly important for
      the inode shrinker that gets called frequently under memory
      pressure.
      
      To avoid the overhead, track AGs with reclaimable inodes in the
      per-ag radix tree so that we can find all the AGs with reclaimable
      inodes via a simple gang tag lookup. This involves setting the tag
      when the first reclaimable inode is tracked in the AG, and removing
      the tag when the last reclaimable inode is removed from the tree.
      Then the summation process becomes a loop walking the radix tree
      summing AGs with the reclaim tag set.
      
      This significantly reduces the overhead of scanning - a 6400 AG
      filesystea now only uses about 25% of a cpu in kswapd while slab
      reclaim progresses instead of being permanently stuck at 100% CPU
      and making little progress. Clean filesystems filesystems will see
      no overhead and the overhead only increases linearly with the number
      of dirty AGs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      16fd5367
    • D
      xfs: convert inode shrinker to per-filesystem contexts · 70e60ce7
      Dave Chinner 提交于
      Now the shrinker passes us a context, wire up a shrinker context per
      filesystem. This allows us to remove the global mount list and the
      locking problems that introduced. It also means that a shrinker call
      does not need to traverse clean filesystems before finding a
      filesystem with reclaimable inodes.  This significantly reduces
      scanning overhead when lots of filesystems are present.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      70e60ce7
    • D
      Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE · 2ebc3464
      Dan Rosenberg 提交于
      1.  The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
      whether the donor file is append-only before writing to it.
      
      2.  The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
      overflow that allows a user to specify an out-of-bounds range to copy
      from the source file (if off + len wraps around).  I haven't been able
      to successfully exploit this, but I'd imagine that a clever attacker
      could use this to read things he shouldn't.  Even if it's not
      exploitable, it couldn't hurt to be safe.
      Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
      cc: stable@kernel.org
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ebc3464
    • S
      Btrfs: fix CLONE ioctl destination file size expansion to block boundary · b5384d48
      Sage Weil 提交于
      The CLONE and CLONE_RANGE ioctls round up the range of extents being
      cloned to the block size when the range to clone extends to the end of file
      (this is always the case with CLONE).  It was then using that offset when
      extending the destination file's i_size.  Fix this by not setting i_size
      beyond the originally requested ending offset.
      
      This bug was introduced by a22285a6 (2.6.35-rc1).
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b5384d48
    • C
      Btrfs: fix split_leaf double split corner case · 99d8f83c
      Chris Mason 提交于
      split_leaf was not properly balancing leaves when it was forced to
      split a leaf twice.  This commit adds an extra push left and right
      before forcing the double split in hopes of getting the slot where
      we want to insert at either the start or end of the leaf.
      
      If the extra pushes do work, then we are able to avoid splitting twice
      and we keep the tree properly balanced.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      99d8f83c
  13. 19 7月, 2010 2 次提交
    • P
      [S390] dasd: use correct label location for diag fba disks · cffab6bc
      Peter Oberparleiter 提交于
      Partition boundary calculation fails for DASD FBA disks under the
      following conditions:
      - disk is formatted with CMS FORMAT with a blocksize of more than
        512 bytes
      - all of the disk is reserved to a single CMS file using CMS RESERVE
      - the disk is accessed using the DIAG mode of the DASD driver
      
      Under these circumstances, the partition detection code tries to
      read the CMS label block containing partition-relevant information
      from logical block offset 1, while it is in fact located at physical
      block offset 1.
      
      Fix this problem by using the correct CMS label block location
      depending on the device type as determined by the DASD SENSE ID
      information.
      Signed-off-by: NPeter Oberparleiter <peter.oberparleiter@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      cffab6bc
    • D
      mm: add context argument to shrinker callback · 7f8275d0
      Dave Chinner 提交于
      The current shrinker implementation requires the registered callback
      to have global state to work from. This makes it difficult to shrink
      caches that are not global (e.g. per-filesystem caches). Pass the shrinker
      structure to the callback so that users can embed the shrinker structure
      in the context the shrinker needs to operate on and get back to it in the
      callback via container_of().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      7f8275d0