1. 14 12月, 2009 1 次提交
    • L
      tracing: Extract duplicate ftrace_raw_init_event_foo() · 87d9b4e1
      Li Zefan 提交于
      Use a generic trace_event_raw_init() function for all event's raw_init
      callbacks (but kprobes) instead of defining the same version for each
      of these.
      This shrinks the kernel code:
      
         text    data     bss     dec     hex filename
      5355293 1961928 7103260 14420481         dc0a01 vmlinux.o.old
      5346802 1961864 7103260 14411926         dbe896 vmlinux.o
      
      raw_init can't be removed, because ftrace events and kprobe events
      use different raw_init callbacks. Though it's possible to totally
      remove raw_init, I choose to leave it as it is for now.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      LKML-Reference: <4B1DC48C.7080603@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      87d9b4e1
  2. 10 12月, 2009 2 次提交
    • J
      tracing: Add full state to trace_seq · d184b31c
      Johannes Berg 提交于
      The trace_seq buffer might fill up, and right now one needs to check the
      return value of each printf into the buffer to check for that.
      
      Instead, have the buffer keep track of whether it is full or not, and
      reject more input if it is full or would have overflowed with an input
      that wasn't added.
      
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d184b31c
    • S
      tracing: Buffer the output of seq_file in case of filled buffer · a63ce5b3
      Steven Rostedt 提交于
      If the seq_read fills the buffer it will call s_start again on the next
      itertation with the same position. This causes a problem with the
      function_graph tracer because it consumes the iteration in order to
      determine leaf functions.
      
      What happens is that the iterator stores the entry, and the function
      graph plugin will look at the next entry. If that next entry is a return
      of the same function and task, then the function is a leaf and the
      function_graph plugin calls ring_buffer_read which moves the ring buffer
      iterator forward (the trace iterator still points to the function start
      entry).
      
      The copying of the trace_seq to the seq_file buffer will fail if the
      seq_file buffer is full. The seq_read will not show this entry.
      The next read by userspace will cause seq_read to again call s_start
      which will reuse the trace iterator entry (the function start entry).
      But the function return entry was already consumed. The function graph
      plugin will think that this entry is a nested function and not a leaf.
      
      To solve this, the trace code now checks the return status of the
      seq_printf (trace_print_seq). If the writing to the seq_file buffer
      fails, we set a flag in the iterator (leftover) and we do not reset
      the trace_seq buffer. On the next call to s_start, we check the leftover
      flag, and if it is set, we just reuse the trace_seq buffer and do not
      call into the plugin print functions.
      
      Before this patch:
      
       2)               |      fput() {
       2)               |        __fput() {
       2)   0.550 us    |          inotify_inode_queue_event();
       2)               |          __fsnotify_parent() {
       2)   0.540 us    |          inotify_dentry_parent_queue_event();
      
      After the patch:
      
       2)               |      fput() {
       2)               |        __fput() {
       2)   0.550 us    |          inotify_inode_queue_event();
       2)   0.548 us    |          __fsnotify_parent();
       2)   0.540 us    |          inotify_dentry_parent_queue_event();
      
      [
        Updated the patch to fix a missing return 0 from the trace_print_seq()
        stub when CONFIG_TRACING is disabled.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      ]
      Reported-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a63ce5b3
  3. 07 12月, 2009 2 次提交
  4. 06 12月, 2009 2 次提交
  5. 04 12月, 2009 12 次提交
  6. 03 12月, 2009 21 次提交
    • W
      writeback: introduce wbc.for_background · b17621fe
      Wu Fengguang 提交于
      It will lower the flush priority for NFS, and maybe more in future.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b17621fe
    • S
      GFS2: Tag all metadata with jid · 0ab7d13f
      Steven Whitehouse 提交于
      There are two spare field in the header common to all GFS2
      metadata. One is just the right size to fit a journal id
      in it, and this patch updates the journal code so that each
      time a metadata block is modified, we tag it with the journal
      id of the node which is performing the modification.
      
      The reason for this is that it should make it much easier to
      debug issues which arise if we can tell which node was the
      last to modify a particular metadata block.
      
      Since the field is updated before the block is written into
      the journal, each journal should only contain metadata which
      is tagged with its own journal id. The one exception to this
      is the journal header block, which might have a different node's
      id in it, if that journal was recovered by another node in the
      cluster.
      
      Thus each journal will contain a record of which nodes recovered
      it, via the journal header.
      
      The other field in the metadata header could potentially be
      used to hold information about what kind of operation was
      performed, but for the time being we just zero it on each
      transaction so that if we use it for that in future, we'll
      know that the information (where it exists) is reliable.
      
      I did consider using the other field to hold the journal
      sequence number, however since in GFS2's journaling we write
      the modified data into the journal and not the original
      data, this gives no information as to what action caused the
      modification, so I think we can probably come up with a better
      use for those 64 bits in the future.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      0ab7d13f
    • S
      VFS: Export dquot_send_warning · 86e931a3
      Steven Whitehouse 提交于
      Sending a message to userspace in a generic format to warn
      of events (e.g. quota exceeded) in the quota subsystem is
      a generically useful feature. This patch makes some minor
      changes to the send_message function from dquot.c renaming
      it quota_send_message, moving it to quota.c and exporting it
      for use by filesystems which do not use the dquot code.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      86e931a3
    • S
      VFS: Add forget_all_cached_acls() · 796bd952
      Steven Whitehouse 提交于
      This is required for cluster filesystems which want to use
      cached ACLs so that they can invalidate the cache when
      required.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Alexander Viro <aviro@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      796bd952
    • M
      block: Allow devices to indicate whether discarded blocks are zeroed · 98262f27
      Martin K. Petersen 提交于
      The discard ioctl is used by mkfs utilities to clear a block device
      prior to putting metadata down.  However, not all devices return zeroed
      blocks after a discard.  Some drives return stale data, potentially
      containing old superblocks.  It is therefore important to know whether
      discarded blocks are properly zeroed.
      
      Both ATA and SCSI drives have configuration bits that indicate whether
      zeroes are returned after a discard operation.  Implement a block level
      interface that allows this information to be bubbled up the stack and
      queried via a new block device ioctl.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      98262f27
    • C
      libata: add translation for SCSI WRITE SAME (aka TRIM support) · 18f0f978
      Christoph Hellwig 提交于
      Add support for the ATA TRIM command in libata.  We translate a WRITE SAME 16
      command with the unmap bit set into an ATA TRIM command and export enough
      information in READ CAPACITY 16 and the block limits EVPD page so that the new
      SCSI layer discard support will driver this for us.
      
      Note that I hardcode the WRITE_SAME_16 opcode for now as the patch to introduce
      the symbolic is not in 2.6.32 yet but only in the SCSI tree - as soon as it is
      merged we can fix it up to properly use the symbolic name.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      18f0f978
    • T
      libata: retry failed FLUSH if device didn't fail it · 6013efd8
      Tejun Heo 提交于
      If ATA device failed FLUSH, it means that the device failed to write
      out some amount of data and the error needs to be reported to upper
      layers. As retries can't recover the lost data, FLUSH failures need to
      be reported immediately in general.
      
      However, if FLUSH fails due to transmission errors, the FLUSH needs to
      be retried; otherwise, filesystems may switch to RO mode and/or raid
      array may drop a drive for a random transmission glitch.
      
      This condition can be rather easily reproduced on certain ahci
      controllers which go through a PHY event after powersave mode switch +
      ext4 combination.  Powersave mode switch is often closely followed by
      flush from the filesystem failing the FLUSH with ATA bus error which
      makes the filesystem code believe that data is lost and drop to RO
      mode.  This was reported in the following bugzilla bug.
      
        http://bugzilla.kernel.org/show_bug.cgi?id=14543
      
      This patch makes libata EH retry FLUSH if it wasn't failed by the
      device.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAndrey Vihrov <andrey.vihrov@gmail.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      6013efd8
    • C
      KVM: s390: Make psw available on all exits, not just a subset · d7b0b5eb
      Carsten Otte 提交于
      This patch moves s390 processor status word into the base kvm_run
      struct and keeps it up-to date on all userspace exits.
      
      The userspace ABI is broken by this, however there are no applications
      in the wild using this.  A capability check is provided so users can
      verify the updated API exists.
      
      Cc: stable@kernel.org
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d7b0b5eb
    • J
      KVM: x86: Add KVM_GET/SET_VCPU_EVENTS · 3cfc3092
      Jan Kiszka 提交于
      This new IOCTL exports all yet user-invisible states related to
      exceptions, interrupts, and NMIs. Together with appropriate user space
      changes, this fixes sporadic problems of vmsave/restore, live migration
      and system reset.
      
      [avi: future-proof abi by adding a flags field]
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      3cfc3092
    • A
      KVM: VMX: Report unexpected simultaneous exceptions as internal errors · 65ac7264
      Avi Kivity 提交于
      These happen when we trap an exception when another exception is being
      delivered; we only expect these with MCEs and page faults.  If something
      unexpected happens, things probably went south and we're better off reporting
      an internal error and freezing.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      65ac7264
    • A
      KVM: Allow internal errors reported to userspace to carry extra data · a9c7399d
      Avi Kivity 提交于
      Usually userspace will freeze the guest so we can inspect it, but some
      internal state is not available.  Add extra data to internal error
      reporting so we can expose it to the debugger.  Extra data is specific
      to the suberror.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      a9c7399d
    • J
      KVM: Reorder IOCTLs in main kvm.h · c54d2aba
      Jan Kiszka 提交于
      Obviously, people tend to extend this header at the bottom - more or
      less blindly. Ensure that deprecated stuff gets its own corner again by
      moving things to the top. Also add some comments and reindent IOCTLs to
      make them more readable and reduce the risk of number collisions.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c54d2aba
    • G
      KVM: allow userspace to adjust kvmclock offset · afbcf7ab
      Glauber Costa 提交于
      When we migrate a kvm guest that uses pvclock between two hosts, we may
      suffer a large skew. This is because there can be significant differences
      between the monotonic clock of the hosts involved. When a new host with
      a much larger monotonic time starts running the guest, the view of time
      will be significantly impacted.
      
      Situation is much worse when we do the opposite, and migrate to a host with
      a smaller monotonic clock.
      
      This proposed ioctl will allow userspace to inform us what is the monotonic
      clock value in the source host, so we can keep the time skew short, and
      more importantly, never goes backwards. Userspace may also need to trigger
      the current data, since from the first migration onwards, it won't be
      reflected by a simple call to clock_gettime() anymore.
      
      [marcelo: future-proof abi with a flags field]
      [jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]
      Signed-off-by: NGlauber Costa <glommer@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      afbcf7ab
    • E
      KVM: Xen PV-on-HVM guest support · ffde22ac
      Ed Swierk 提交于
      Support for Xen PV-on-HVM guests can be implemented almost entirely in
      userspace, except for handling one annoying MSR that maps a Xen
      hypercall blob into guest address space.
      
      A generic mechanism to delegate MSR writes to userspace seems overkill
      and risks encouraging similar MSR abuse in the future.  Thus this patch
      adds special support for the Xen HVM MSR.
      
      I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
      KVM which MSR the guest will write to, as well as the starting address
      and size of the hypercall blobs (one each for 32-bit and 64-bit) that
      userspace has loaded from files.  When the guest writes to the MSR, KVM
      copies one page of the blob from userspace to the guest.
      
      I've tested this patch with a hacked-up version of Gerd's userspace
      code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
      FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.
      
      [jan: fix i386 build warning]
      [avi: future proof abi with a flags field]
      Signed-off-by: NEd Swierk <eswierk@aristanetworks.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      ffde22ac
    • Z
      KVM: introduce kvm_vcpu_on_spin · d255f4f2
      Zhai, Edwin 提交于
      Introduce kvm_vcpu_on_spin, to be used by VMX/SVM to yield processing
      once the cpu detects pause-based looping.
      Signed-off-by: N"Zhai, Edwin" <edwin.zhai@intel.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      d255f4f2
    • A
      KVM: Activate Virtualization On Demand · 10474ae8
      Alexander Graf 提交于
      X86 CPUs need to have some magic happening to enable the virtualization
      extensions on them. This magic can result in unpleasant results for
      users, like blocking other VMMs from working (vmx) or using invalid TLB
      entries (svm).
      
      Currently KVM activates virtualization when the respective kernel module
      is loaded. This blocks us from autoloading KVM modules without breaking
      other VMMs.
      
      To circumvent this problem at least a bit, this patch introduces on
      demand activation of virtualization. This means, that instead
      virtualization is enabled on creation of the first virtual machine
      and disabled on destruction of the last one.
      
      So using this, KVM can be easily autoloaded, while keeping other
      hypervisors usable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      10474ae8
    • A
      KVM: Move assigned device code to own file · bfd99ff5
      Avi Kivity 提交于
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      bfd99ff5
    • G
      KVM: Move irq ack notifier list to arch independent code · 136bdfee
      Gleb Natapov 提交于
      Mask irq notifier list is already there.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      136bdfee
    • G
      KVM: Maintain back mapping from irqchip/pin to gsi · 3e71f88b
      Gleb Natapov 提交于
      Maintain back mapping from irqchip/pin to gsi to speedup
      interrupt acknowledgment notifications.
      
      [avi: build fix on non-x86/ia64]
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      3e71f88b
    • G
      KVM: Change irq routing table to use gsi indexed array · 46e624b9
      Gleb Natapov 提交于
      Use gsi indexed array instead of scanning all entries on each interrupt
      injection.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      46e624b9
    • G
      KVM: Move irq sharing information to irqchip level · 1a6e4a8c
      Gleb Natapov 提交于
      This removes assumptions that max GSIs is smaller than number of pins.
      Sharing is tracked on pin level not GSI level.
      
      [avi: no PIC on ia64]
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      1a6e4a8c