1. 14 9月, 2009 5 次提交
  2. 11 9月, 2009 8 次提交
    • J
      writeback: check for registered bdi in flusher add and inode dirty · 500b067c
      Jens Axboe 提交于
      Also a debugging aid. We want to catch dirty inodes being added to
      backing devices that don't do writeback.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      500b067c
    • J
      writeback: add name to backing_dev_info · d993831f
      Jens Axboe 提交于
      This enables us to track who does what and print info. Its main use
      is catching dirty inodes on the default_backing_dev_info, so we can
      fix that up.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d993831f
    • J
      writeback: get rid of pdflush completely · d0bceac7
      Jens Axboe 提交于
      It is now unused, so kill it off.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d0bceac7
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
    • J
      writeback: get rid of generic_sync_sb_inodes() export · d8a8559c
      Jens Axboe 提交于
      This adds two new exported functions:
      
      - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
        this super_block, for WB_SYNC_NONE writeout.
      - sync_inodes_sb(), which writes out all dirty inodes on this super_block
        and also waits for the IO to complete.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d8a8559c
    • O
    • S
      ahci: Add AMD SB900 SATA/IDE controller device IDs · e2dd90b1
      Shane Huang 提交于
      Add AMD SB900 SATA/IDE controller device IDs.
      Signed-off-by: NShane Huang <shane.huang@amd.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      e2dd90b1
  3. 10 9月, 2009 2 次提交
    • D
      LSM/SELinux: inode_{get,set,notify}secctx hooks to access LSM security context information. · 1ee65e37
      David P. Quigley 提交于
      This patch introduces three new hooks. The inode_getsecctx hook is used to get
      all relevant information from an LSM about an inode. The inode_setsecctx is
      used to set both the in-core and on-disk state for the inode based on a context
      derived from inode_getsecctx.The final hook inode_notifysecctx will notify the
      LSM of a change for the in-core state of the inode in question. These hooks are
      for use in the labeled NFS code and addresses concerns of how to set security
      on an inode in a multi-xattr LSM. For historical reasons Stephen Smalley's
      explanation of the reason for these hooks is pasted below.
      
      Quote Stephen Smalley
      
      inode_setsecctx:  Change the security context of an inode.  Updates the
      in core security context managed by the security module and invokes the
      fs code as needed (via __vfs_setxattr_noperm) to update any backing
      xattrs that represent the context.  Example usage:  NFS server invokes
      this hook to change the security context in its incore inode and on the
      backing file system to a value provided by the client on a SETATTR
      operation.
      
      inode_notifysecctx:  Notify the security module of what the security
      context of an inode should be.  Initializes the incore security context
      managed by the security module for this inode.  Example usage:  NFS
      client invokes this hook to initialize the security context in its
      incore inode to the value provided by the server for the file when the
      server returned the file's attributes to the client.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      1ee65e37
    • D
      VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx. · b1ab7e4b
      David P. Quigley 提交于
      This factors out the part of the vfs_setxattr function that performs the
      setting of the xattr and its notification. This is needed so the SELinux
      implementation of inode_setsecctx can handle the setting of the xattr while
      maintaining the proper separation of layers.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b1ab7e4b
  4. 09 9月, 2009 5 次提交
  5. 08 9月, 2009 3 次提交
  6. 07 9月, 2009 5 次提交
  7. 06 9月, 2009 2 次提交
    • O
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov 提交于
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: NTom Horsley <tom.horsley@att.net>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
    • O
      workqueues: introduce __cancel_delayed_work() · 4e49627b
      Oleg Nesterov 提交于
      cancel_delayed_work() has to use del_timer_sync() to guarantee the timer
      function is not running after return.  But most users doesn't actually
      need this, and del_timer_sync() has problems: it is not useable from
      interrupt, and it depends on every lock which could be taken from irq.
      
      Introduce __cancel_delayed_work() which calls del_timer() instead.
      
      The immediate reason for this patch is
      http://bugzilla.kernel.org/show_bug.cgi?id=13757
      but hopefully this helper makes sense anyway.
      
      As for 13757 bug, actually we need requeue_delayed_work(), but its
      semantics are not yet clear.
      
      Merge this patch early to resolves cross-tree interdependencies between
      input and infiniband.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e49627b
  8. 05 9月, 2009 4 次提交
    • S
      ring-buffer: only enable ring_buffer_swap_cpu when needed · 85bac32c
      Steven Rostedt 提交于
      Since the ability to swap the cpu buffers adds a small overhead to
      the recording of a trace, we only want to add it when needed.
      
      Only the irqsoff and preemptoff tracers use this feature, and both are
      not recommended for production kernels. This patch disables its use
      when neither irqsoff nor preemptoff is configured.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      85bac32c
    • S
      tracing: pass around ring buffer instead of tracer · e77405ad
      Steven Rostedt 提交于
      The latency tracers (irqsoff and wakeup) can swap trace buffers
      on the fly. If an event is happening and has reserved data on one of
      the buffers, and the latency tracer swaps the global buffer with the
      max buffer, the result is that the event may commit the data to the
      wrong buffer.
      
      This patch changes the API to the trace recording to be recieve the
      buffer that was used to reserve a commit. Then this buffer can be passed
      in to the commit.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e77405ad
    • J
      dm log: userspace add luid to distinguish between concurrent log instances · 7ec23d50
      Jonathan Brassow 提交于
      Device-mapper userspace logs (like the clustered log) are
      identified by a universally unique identifier (UUID).  This
      identifier is used to associate requests from the kernel to
      a specific log in userspace.  The UUID must be unique everywhere,
      since multiple machines may use this identifier when communicating
      about a particular log, as is the case for cluster logs.
      
      Sometimes, device-mapper/LVM may re-use a UUID.  This is the
      case during pvmoves, when moving from one segment of an LV
      to another, or when resizing a mirror, etc.  In these cases,
      a new log is created with the same UUID and loaded in the
      "inactive" slot.  When a device-mapper "resume" is issued,
      the "live" table is deactivated and the new "inactive" table
      becomes "live".  (The "inactive" table can also be removed
      via a device-mapper 'clear' command.)
      
      The above two issues were colliding.  More than one log was being
      created with the same UUID, and there was no way to distinguish
      between them.  So, sometimes the wrong log would be swapped
      out during the exchange.
      
      The solution is to create a locally unique identifier,
      'luid', to go along with the UUID.  This new identifier is used
      to determine exactly which log is being referenced by the kernel
      when the log exchange is made.  The identifier is not
      universally safe, but it does not need to be, since
      create/destroy/suspend/resume operations are bound to a specific
      machine; and these are the operations that make up the exchange.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      7ec23d50
    • M
      dm stripe: expose correct io hints · 40bea431
      Mike Snitzer 提交于
      Set sensible I/O hints for striped DM devices in the topology
      infrastructure added for 2.6.31 for userspace tools to
      obtain via sysfs.
      
      Add .io_hints to 'struct target_type' to allow the I/O hints portion
      (io_min and io_opt) of the 'struct queue_limits' to be set by each
      target and implement this for dm-stripe.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      40bea431
  9. 04 9月, 2009 6 次提交
    • S
      ring-buffer: remove ring_buffer_event_discard · dc892f73
      Steven Rostedt 提交于
      The function ring_buffer_event_discard can be used on any item in the
      ring buffer, even after the item was committed. This function provides
      no safety nets and is very race prone.
      
      An item may be safely removed from the ring buffer before it is committed
      with the ring_buffer_discard_commit.
      
      Since there are currently no users of this function, and because this
      function is racey and error prone, this patch removes it altogether.
      
      Note, removing this function also allows the counters to ignore
      all discarded events (patches will follow).
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      dc892f73
    • P
      kmemleak: Don't scan uninitialized memory when kmemcheck is enabled · 8e019366
      Pekka Enberg 提交于
      Ingo Molnar reported the following kmemcheck warning when running both
      kmemleak and kmemcheck enabled:
      
        PM: Adding info for No Bus:vcsa7
        WARNING: kmemcheck: Caught 32-bit read from uninitialized memory
        (f6f6e1a4)
        d873f9f600000000c42ae4c1005c87f70000000070665f666978656400000000
         i i i i u u u u i i i i i i i i i i i i i i i i i i i i i u u u
                 ^
      
        Pid: 3091, comm: kmemleak Not tainted (2.6.31-rc7-tip #1303) P4DC6
        EIP: 0060:[<c110301f>] EFLAGS: 00010006 CPU: 0
        EIP is at scan_block+0x3f/0xe0
        EAX: f40bd700 EBX: f40bd780 ECX: f16b46c0 EDX: 00000001
        ESI: f6f6e1a4 EDI: 00000000 EBP: f10f3f4c ESP: c2605fcc
         DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
        CR0: 8005003b CR2: e89a4844 CR3: 30ff1000 CR4: 000006f0
        DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
        DR6: ffff4ff0 DR7: 00000400
         [<c110313c>] scan_object+0x7c/0xf0
         [<c1103389>] kmemleak_scan+0x1d9/0x400
         [<c1103a3c>] kmemleak_scan_thread+0x4c/0xb0
         [<c10819d4>] kthread+0x74/0x80
         [<c10257db>] kernel_thread_helper+0x7/0x3c
         [<ffffffff>] 0xffffffff
        kmemleak: 515 new suspected memory leaks (see
        /sys/kernel/debug/kmemleak)
        kmemleak: 42 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      
      The problem here is that kmemleak will scan partially initialized
      objects that makes kmemcheck complain. Fix that up by skipping
      uninitialized memory regions when kmemcheck is enabled.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      8e019366
    • I
      sched: Turn on SD_BALANCE_NEWIDLE · 840a0653
      Ingo Molnar 提交于
      Start the re-tuning of the balancer by turning on newidle.
      
      It improves hackbench performance and parallelism on a 4x4 box.
      The "perf stat --repeat 10" measurements give us:
      
        domain0             domain1
        .......................................
       -SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
         2041.273208  task-clock-msecs         #      9.354 CPUs    ( +-   0.363% )
      
       +SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
         2086.326925  task-clock-msecs         #     11.934 CPUs    ( +-   0.301% )
      
       +SD_BALANCE_NEWIDLE +SD_BALANCE_NEWIDLE:
         2115.289791  task-clock-msecs         #     12.158 CPUs    ( +-   0.263% )
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      840a0653
    • I
      sched: Clean up topology.h · 47734f89
      Ingo Molnar 提交于
      Re-organize the flag settings so that it's visible at a glance
      which sched-domains flags are set and which not.
      
      With the new balancer code we'll need to re-tune these details
      anyway, so make it cleaner to make fewer mistakes down the
      road ;-)
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      47734f89
    • P
      sched: Remove reciprocal for cpu_power · 18a3885f
      Peter Zijlstra 提交于
      Its a source of fail, also, now that cpu_power is dynamical,
      its a waste of time.
      
      before:
      <idle>-0   [000]   132.877936: find_busiest_group: avg_load: 0 group_load: 8241 power: 1
      
      after:
      bash-1689  [001]   137.862151: find_busiest_group: avg_load: 10636288 group_load: 10387 power: 1
      
      [ v2: build fix from From: Andreas Herrmann ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NGautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <20090901083826.425896304@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      18a3885f
    • P
      sched: Scale down cpu_power due to RT tasks · e9e9250b
      Peter Zijlstra 提交于
      Keep an average on the amount of time spend on RT tasks and use
      that fraction to scale down the cpu_power for regular tasks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NGautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <20090901083826.287778431@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e9e9250b