1. 15 9月, 2009 2 次提交
  2. 11 9月, 2009 8 次提交
    • J
      writeback: check for registered bdi in flusher add and inode dirty · 500b067c
      Jens Axboe 提交于
      Also a debugging aid. We want to catch dirty inodes being added to
      backing devices that don't do writeback.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      500b067c
    • J
      writeback: add name to backing_dev_info · d993831f
      Jens Axboe 提交于
      This enables us to track who does what and print info. Its main use
      is catching dirty inodes on the default_backing_dev_info, so we can
      fix that up.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d993831f
    • J
      writeback: get rid of pdflush completely · d0bceac7
      Jens Axboe 提交于
      It is now unused, so kill it off.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d0bceac7
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
    • J
      writeback: get rid of generic_sync_sb_inodes() export · d8a8559c
      Jens Axboe 提交于
      This adds two new exported functions:
      
      - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
        this super_block, for WB_SYNC_NONE writeout.
      - sync_inodes_sb(), which writes out all dirty inodes on this super_block
        and also waits for the IO to complete.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d8a8559c
    • O
    • S
      ahci: Add AMD SB900 SATA/IDE controller device IDs · e2dd90b1
      Shane Huang 提交于
      Add AMD SB900 SATA/IDE controller device IDs.
      Signed-off-by: NShane Huang <shane.huang@amd.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      e2dd90b1
  3. 10 9月, 2009 2 次提交
    • D
      LSM/SELinux: inode_{get,set,notify}secctx hooks to access LSM security context information. · 1ee65e37
      David P. Quigley 提交于
      This patch introduces three new hooks. The inode_getsecctx hook is used to get
      all relevant information from an LSM about an inode. The inode_setsecctx is
      used to set both the in-core and on-disk state for the inode based on a context
      derived from inode_getsecctx.The final hook inode_notifysecctx will notify the
      LSM of a change for the in-core state of the inode in question. These hooks are
      for use in the labeled NFS code and addresses concerns of how to set security
      on an inode in a multi-xattr LSM. For historical reasons Stephen Smalley's
      explanation of the reason for these hooks is pasted below.
      
      Quote Stephen Smalley
      
      inode_setsecctx:  Change the security context of an inode.  Updates the
      in core security context managed by the security module and invokes the
      fs code as needed (via __vfs_setxattr_noperm) to update any backing
      xattrs that represent the context.  Example usage:  NFS server invokes
      this hook to change the security context in its incore inode and on the
      backing file system to a value provided by the client on a SETATTR
      operation.
      
      inode_notifysecctx:  Notify the security module of what the security
      context of an inode should be.  Initializes the incore security context
      managed by the security module for this inode.  Example usage:  NFS
      client invokes this hook to initialize the security context in its
      incore inode to the value provided by the server for the file when the
      server returned the file's attributes to the client.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      1ee65e37
    • D
      VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx. · b1ab7e4b
      David P. Quigley 提交于
      This factors out the part of the vfs_setxattr function that performs the
      setting of the xattr and its notification. This is needed so the SELinux
      implementation of inode_setsecctx can handle the setting of the xattr while
      maintaining the proper separation of layers.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b1ab7e4b
  4. 09 9月, 2009 5 次提交
  5. 08 9月, 2009 1 次提交
  6. 07 9月, 2009 2 次提交
  7. 06 9月, 2009 2 次提交
    • O
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov 提交于
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: NTom Horsley <tom.horsley@att.net>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
    • O
      workqueues: introduce __cancel_delayed_work() · 4e49627b
      Oleg Nesterov 提交于
      cancel_delayed_work() has to use del_timer_sync() to guarantee the timer
      function is not running after return.  But most users doesn't actually
      need this, and del_timer_sync() has problems: it is not useable from
      interrupt, and it depends on every lock which could be taken from irq.
      
      Introduce __cancel_delayed_work() which calls del_timer() instead.
      
      The immediate reason for this patch is
      http://bugzilla.kernel.org/show_bug.cgi?id=13757
      but hopefully this helper makes sense anyway.
      
      As for 13757 bug, actually we need requeue_delayed_work(), but its
      semantics are not yet clear.
      
      Merge this patch early to resolves cross-tree interdependencies between
      input and infiniband.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e49627b
  8. 05 9月, 2009 4 次提交
    • S
      ring-buffer: only enable ring_buffer_swap_cpu when needed · 85bac32c
      Steven Rostedt 提交于
      Since the ability to swap the cpu buffers adds a small overhead to
      the recording of a trace, we only want to add it when needed.
      
      Only the irqsoff and preemptoff tracers use this feature, and both are
      not recommended for production kernels. This patch disables its use
      when neither irqsoff nor preemptoff is configured.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      85bac32c
    • S
      tracing: pass around ring buffer instead of tracer · e77405ad
      Steven Rostedt 提交于
      The latency tracers (irqsoff and wakeup) can swap trace buffers
      on the fly. If an event is happening and has reserved data on one of
      the buffers, and the latency tracer swaps the global buffer with the
      max buffer, the result is that the event may commit the data to the
      wrong buffer.
      
      This patch changes the API to the trace recording to be recieve the
      buffer that was used to reserve a commit. Then this buffer can be passed
      in to the commit.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e77405ad
    • J
      dm log: userspace add luid to distinguish between concurrent log instances · 7ec23d50
      Jonathan Brassow 提交于
      Device-mapper userspace logs (like the clustered log) are
      identified by a universally unique identifier (UUID).  This
      identifier is used to associate requests from the kernel to
      a specific log in userspace.  The UUID must be unique everywhere,
      since multiple machines may use this identifier when communicating
      about a particular log, as is the case for cluster logs.
      
      Sometimes, device-mapper/LVM may re-use a UUID.  This is the
      case during pvmoves, when moving from one segment of an LV
      to another, or when resizing a mirror, etc.  In these cases,
      a new log is created with the same UUID and loaded in the
      "inactive" slot.  When a device-mapper "resume" is issued,
      the "live" table is deactivated and the new "inactive" table
      becomes "live".  (The "inactive" table can also be removed
      via a device-mapper 'clear' command.)
      
      The above two issues were colliding.  More than one log was being
      created with the same UUID, and there was no way to distinguish
      between them.  So, sometimes the wrong log would be swapped
      out during the exchange.
      
      The solution is to create a locally unique identifier,
      'luid', to go along with the UUID.  This new identifier is used
      to determine exactly which log is being referenced by the kernel
      when the log exchange is made.  The identifier is not
      universally safe, but it does not need to be, since
      create/destroy/suspend/resume operations are bound to a specific
      machine; and these are the operations that make up the exchange.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7ec23d50
    • M
      dm stripe: expose correct io hints · 40bea431
      Mike Snitzer 提交于
      Set sensible I/O hints for striped DM devices in the topology
      infrastructure added for 2.6.31 for userspace tools to
      obtain via sysfs.
      
      Add .io_hints to 'struct target_type' to allow the I/O hints portion
      (io_min and io_opt) of the 'struct queue_limits' to be set by each
      target and implement this for dm-stripe.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      40bea431
  9. 04 9月, 2009 9 次提交
  10. 02 9月, 2009 5 次提交
    • D
      KEYS: Add a keyctl to install a process's session keyring on its parent [try #6] · ee18d64c
      David Howells 提交于
      Add a keyctl to install a process's session keyring onto its parent.  This
      replaces the parent's session keyring.  Because the COW credential code does
      not permit one process to change another process's credentials directly, the
      change is deferred until userspace next starts executing again.  Normally this
      will be after a wait*() syscall.
      
      To support this, three new security hooks have been provided:
      cred_alloc_blank() to allocate unset security creds, cred_transfer() to fill in
      the blank security creds and key_session_to_parent() - which asks the LSM if
      the process may replace its parent's session keyring.
      
      The replacement may only happen if the process has the same ownership details
      as its parent, and the process has LINK permission on the session keyring, and
      the session keyring is owned by the process, and the LSM permits it.
      
      Note that this requires alteration to each architecture's notify_resume path.
      This has been done for all arches barring blackfin, m68k* and xtensa, all of
      which need assembly alteration to support TIF_NOTIFY_RESUME.  This allows the
      replacement to be performed at the point the parent process resumes userspace
      execution.
      
      This allows the userspace AFS pioctl emulation to fully emulate newpag() and
      the VIOCSETTOK and VIOCSETTOK2 pioctls, all of which require the ability to
      alter the parent process's PAG membership.  However, since kAFS doesn't use
      PAGs per se, but rather dumps the keys into the session keyring, the session
      keyring of the parent must be replaced if, for example, VIOCSETTOK is passed
      the newpag flag.
      
      This can be tested with the following program:
      
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <keyutils.h>
      
      	#define KEYCTL_SESSION_TO_PARENT	18
      
      	#define OSERROR(X, S) do { if ((long)(X) == -1) { perror(S); exit(1); } } while(0)
      
      	int main(int argc, char **argv)
      	{
      		key_serial_t keyring, key;
      		long ret;
      
      		keyring = keyctl_join_session_keyring(argv[1]);
      		OSERROR(keyring, "keyctl_join_session_keyring");
      
      		key = add_key("user", "a", "b", 1, keyring);
      		OSERROR(key, "add_key");
      
      		ret = keyctl(KEYCTL_SESSION_TO_PARENT);
      		OSERROR(ret, "KEYCTL_SESSION_TO_PARENT");
      
      		return 0;
      	}
      
      Compiled and linked with -lkeyutils, you should see something like:
      
      	[dhowells@andromeda ~]$ keyctl show
      	Session Keyring
      	       -3 --alswrv   4043  4043  keyring: _ses
      	355907932 --alswrv   4043    -1   \_ keyring: _uid.4043
      	[dhowells@andromeda ~]$ /tmp/newpag
      	[dhowells@andromeda ~]$ keyctl show
      	Session Keyring
      	       -3 --alswrv   4043  4043  keyring: _ses
      	1055658746 --alswrv   4043  4043   \_ user: a
      	[dhowells@andromeda ~]$ /tmp/newpag hello
      	[dhowells@andromeda ~]$ keyctl show
      	Session Keyring
      	       -3 --alswrv   4043  4043  keyring: hello
      	340417692 --alswrv   4043  4043   \_ user: a
      
      Where the test program creates a new session keyring, sticks a user key named
      'a' into it and then installs it on its parent.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      ee18d64c
    • D
      KEYS: Add garbage collection for dead, revoked and expired keys. [try #6] · 5d135440
      David Howells 提交于
      Add garbage collection for dead, revoked and expired keys.  This involved
      erasing all links to such keys from keyrings that point to them.  At that
      point, the key will be deleted in the normal manner.
      
      Keyrings from which garbage collection occurs are shrunk and their quota
      consumption reduced as appropriate.
      
      Dead keys (for which the key type has been removed) will be garbage collected
      immediately.
      
      Revoked and expired keys will hang around for a number of seconds, as set in
      /proc/sys/kernel/keys/gc_delay before being automatically removed.  The default
      is 5 minutes.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      5d135440
    • D
      CRED: Add some configurable debugging [try #6] · e0e81739
      David Howells 提交于
      Add a config option (CONFIG_DEBUG_CREDENTIALS) to turn on some debug checking
      for credential management.  The additional code keeps track of the number of
      pointers from task_structs to any given cred struct, and checks to see that
      this number never exceeds the usage count of the cred struct (which includes
      all references, not just those from task_structs).
      
      Furthermore, if SELinux is enabled, the code also checks that the security
      pointer in the cred struct is never seen to be invalid.
      
      This attempts to catch the bug whereby inode_has_perm() faults in an nfsd
      kernel thread on seeing cred->security be a NULL pointer (it appears that the
      credential struct has been previously released):
      
      	http://www.kerneloops.org/oops.php?number=252883Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      e0e81739
    • A
      sched: Provide iowait counters · 8f0dfc34
      Arjan van de Ven 提交于
      For counting how long an application has been waiting for
      (disk) IO, there currently is only the HZ sample driven
      information available, while for all other counters in this
      class, a high resolution version is available via
      CONFIG_SCHEDSTATS.
      
      In order to make an improved bootchart tool possible, we also
      need a higher resolution version of the iowait time.
      
      This patch below adds this scheduler statistic to the kernel.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4A64B813.1080506@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f0dfc34
    • T
      libata: remove spindown skipping and warning · 051d9fbd
      Tejun Heo 提交于
      This was a hack to give userland shutdown tools time to drop manual
      spindown.  All popular distros updated quite some time ago and the due
      is well passed.  Drop it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      051d9fbd