1. 12 6月, 2009 5 次提交
    • E
      fsnotify: generic notification queue and waitq · a2d8bc6c
      Eric Paris 提交于
      inotify needs to do asyc notification in which event information is stored
      on a queue until the listener is ready to receive it.  This patch
      implements a generic notification queue for inotify (and later fanotify) to
      store events to be sent at a later time.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      a2d8bc6c
    • E
      dnotify: reimplement dnotify using fsnotify · 3c5119c0
      Eric Paris 提交于
      Reimplement dnotify using fsnotify.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      3c5119c0
    • E
      fsnotify: parent event notification · c28f7e56
      Eric Paris 提交于
      inotify and dnotify both use a similar parent notification mechanism.  We
      add a generic parent notification mechanism to fsnotify for both of these
      to use.  This new machanism also adds the dentry flag optimization which
      exists for inotify to dnotify.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      c28f7e56
    • E
      fsnotify: add marks to inodes so groups can interpret how to handle those inodes · 3be25f49
      Eric Paris 提交于
      This patch creates a way for fsnotify groups to attach marks to inodes.
      These marks have little meaning to the generic fsnotify infrastructure
      and thus their meaning should be interpreted by the group that attached
      them to the inode's list.
      
      dnotify and inotify  will make use of these markings to indicate which
      inodes are of interest to their respective groups.  But this implementation
      has the useful property that in the future other listeners could actually
      use the marks for the exact opposite reason, aka to indicate which inodes
      it had NO interest in.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      3be25f49
    • E
      fsnotify: unified filesystem notification backend · 90586523
      Eric Paris 提交于
      fsnotify is a backend for filesystem notification.  fsnotify does
      not provide any userspace interface but does provide the basis
      needed for other notification schemes such as dnotify.  fsnotify
      can be extended to be the backend for inotify or the upcoming
      fanotify.  fsnotify provides a mechanism for "groups" to register for
      some set of filesystem events and to then deliver those events to
      those groups for processing.
      
      fsnotify has a number of benefits, the first being actually shrinking the size
      of an inode.  Before fsnotify to support both dnotify and inotify an inode had
      
              unsigned long           i_dnotify_mask; /* Directory notify events */
              struct dnotify_struct   *i_dnotify; /* for directory notifications */
              struct list_head        inotify_watches; /* watches on this inode */
              struct mutex            inotify_mutex;  /* protects the watches list
      
      But with fsnotify this same functionallity (and more) is done with just
      
              __u32                   i_fsnotify_mask; /* all events for this inode */
              struct hlist_head       i_fsnotify_mark_entries; /* marks on this inode */
      
      That's right, inotify, dnotify, and fanotify all in 64 bits.  We used that
      much space just in inotify_watches alone, before this patch set.
      
      fsnotify object lifetime and locking is MUCH better than what we have today.
      inotify locking is incredibly complex.  See 8f7b0ba1 as an example of
      what's been busted since inception.  inotify needs to know internal semantics
      of superblock destruction and unmounting to function.  The inode pinning and
      vfs contortions are horrible.
      
      no fsnotify implementers do allocation under locks.  This means things like
      f04b30de which (due to an overabundance of caution) changes GFP_KERNEL to
      GFP_NOFS can be reverted.  There are no longer any allocation rules when using
      or implementing your own fsnotify listener.
      
      fsnotify paves the way for fanotify.  In brief fanotify is a notification
      mechanism that delivers the lisener both an 'event' and an open file descriptor
      to the object in question.  This means that fanotify is pathname agnostic.
      Some on lkml may not care for the original companies or users that pushed for
      TALPA, but fanotify was designed with flexibility and input for other users in
      mind.  The readahead group expressed interest in fanotify as it could be used
      to profile disk access on boot without breaking the audit system.  The desktop
      search groups have also expressed interest in fanotify as it solves a number
      of the race conditions and problems present with managing inotify when more
      than a limited number of specific files are of interest.  fanotify can provide
      for a userspace access control system which makes it a clean interface for AV
      vendors to hook without trying to do binary patching on the syscall table,
      LSM, and everywhere else they do their things today.  With this patch series
      fanotify can be implemented in less than 1200 lines of easy to review code.
      Almost all of which is the socket based user interface.
      
      This patch series builds fsnotify to the point that it can implement
      dnotify and inotify_user.  Patches exist and will be sent soon after
      acceptance to finish the in kernel inotify conversion (audit) and implement
      fanotify.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      90586523
  2. 11 6月, 2009 16 次提交
  3. 10 6月, 2009 19 次提交
    • L
      tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK · f1db457c
      Li Zefan 提交于
      Fix building failures when CONFIG_BLOCK == n.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2F1520.8020003@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f1db457c
    • B
      spinlock: Add missing __raw_spin_lock_flags() stub for UP · 04dce7d9
      Benjamin Herrenschmidt 提交于
      This was only defined with CONFIG_DEBUG_SPINLOCK set, but some
      obscure arch/powerpc code wants it always.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      04dce7d9
    • M
      KVM: protect assigned dev workqueue, int handler and irq acker · 547de29e
      Marcelo Tosatti 提交于
      kvm_assigned_dev_ack_irq is vulnerable to a race condition with the
      interrupt handler function. It does:
      
              if (dev->host_irq_disabled) {
                      enable_irq(dev->host_irq);
                      dev->host_irq_disabled = false;
              }
      
      If an interrupt triggers before the host->dev_irq_disabled assignment,
      it will disable the interrupt and set dev->host_irq_disabled to true.
      
      On return to kvm_assigned_dev_ack_irq, dev->host_irq_disabled is set to
      false, and the next kvm_assigned_dev_ack_irq call will fail to reenable
      it.
      
      Other than that, having the interrupt handler and work handlers run in
      parallel sounds like asking for trouble (could not spot any obvious
      problem, but better not have to, its fragile).
      
      CC: sheng.yang@intel.com
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      547de29e
    • M
      KVM: use smp_send_reschedule in kvm_vcpu_kick · 32f88400
      Marcelo Tosatti 提交于
      KVM uses a function call IPI to cause the exit of a guest running on a
      physical cpu. For virtual interrupt notification there is no need to
      wait on IPI receival, or to execute any function.
      
      This is exactly what the reschedule IPI does, without the overhead
      of function IPI. So use it instead of smp_call_function_single in
      kvm_vcpu_kick.
      
      Also change the "guest_mode" variable to a bit in vcpu->requests, and
      use that to collapse multiple IPI's that would be issued between the
      first one and zeroing of guest mode.
      
      This allows kvm_vcpu_kick to called with interrupts disabled.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      32f88400
    • S
      KVM: Enable snooping control for supported hardware · 522c68c4
      Sheng Yang 提交于
      Memory aliases with different memory type is a problem for guest. For the guest
      without assigned device, the memory type of guest memory would always been the
      same as host(WB); but for the assigned device, some part of memory may be used
      as DMA and then set to uncacheable memory type(UC/WC), which would be a conflict of
      host memory type then be a potential issue.
      
      Snooping control can guarantee the cache correctness of memory go through the
      DMA engine of VT-d.
      
      [avi: fix build on ia64]
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      522c68c4
    • N
      KVM: Make kvm header C++ friendly · 2f8b9ee1
      nathan binkert 提交于
      Two things needed fixing: 1) g++ does not allow a named structure type
      within an anonymous union and 2) Avoid name clash between two padding
      fields within the same struct by giving them different names as is
      done elsewhere in the header.
      Signed-off-by: NNathan Binkert <nate@binkert.org>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      2f8b9ee1
    • G
      KVM: Fix interrupt unhalting a vcpu when it shouldn't · 78646121
      Gleb Natapov 提交于
      kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking
      if interrupt window is actually opened.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      78646121
    • S
      KVM: Device assignment framework rework · e56d532f
      Sheng Yang 提交于
      After discussion with Marcelo, we decided to rework device assignment framework
      together. The old problems are kernel logic is unnecessary complex. So Marcelo
      suggest to split it into a more elegant way:
      
      1. Split host IRQ assign and guest IRQ assign. And userspace determine the
      combination. Also discard msi2intx parameter, userspace can specific
      KVM_DEV_IRQ_HOST_MSI | KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to
      enable MSI to INTx convertion.
      
      2. Split assign IRQ and deassign IRQ. Import two new ioctls:
      KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ.
      
      This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the
      old interface).
      
      [avi: replace homemade bitcount() by hweight_long()]
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      e56d532f
    • G
      KVM: APIC: get rid of deliver_bitmask · 58c2dde1
      Gleb Natapov 提交于
      Deliver interrupt during destination matching loop.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Acked-by: NXiantao Zhang <xiantao.zhang@intel.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      58c2dde1
    • G
      KVM: consolidate ioapic/ipi interrupt delivery logic · 343f94fe
      Gleb Natapov 提交于
      Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead
      of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in
      apic_send_ipi() to figure out ipi destination instead of reimplementing
      the logic.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      343f94fe
    • G
      KVM: ioapic/msi interrupt delivery consolidation · a53c17d2
      Gleb Natapov 提交于
      ioapic_deliver() and kvm_set_msi() have code duplication. Move
      the code into ioapic_deliver_entry() function and call it from
      both places.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      a53c17d2
    • C
      KVM: declare ioapic functions only on affected hardware · b95b51d5
      Christian Borntraeger 提交于
      Since "KVM: Unify the delivery of IOAPIC and MSI interrupts"
      I get the following warnings:
      
        CC [M]  arch/s390/kvm/kvm-s390.o
      In file included from arch/s390/kvm/kvm-s390.c:22:
      include/linux/kvm_host.h:357: warning: 'struct kvm_ioapic' declared inside parameter list
      include/linux/kvm_host.h:357: warning: its scope is only this definition or declaration, which is probably not what you want
      
      This patch limits IOAPIC functions for architectures that have one.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b95b51d5
    • S
      KVM: Enable MSI-X for KVM assigned device · d510d6cc
      Sheng Yang 提交于
      This patch finally enable MSI-X.
      
      What we need for MSI-X:
      1. Intercept one page in MMIO region of device. So that we can get guest desired
      MSI-X table and set up the real one. Now this have been done by guest, and
      transfer to kernel using ioctl KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY.
      
      2. Information for incoming interrupt. Now one device can have more than one
      interrupt, and they are all handled by one workqueue structure. So we need to
      identify them. The previous patch enable gsi_msg_pending_bitmap get this done.
      
      3. Mapping from host IRQ to guest gsi as well as guest gsi to real MSI/MSI-X
      message address/data. We used same entry number for the host and guest here, so
      that it's easy to find the correlated guest gsi.
      
      What we lack for now:
      1. The PCI spec said nothing can existed with MSI-X table in the same page of
      MMIO region, except pending bits. The patch ignore pending bits as the first
      step (so they are always 0 - no pending).
      
      2. The PCI spec allowed to change MSI-X table dynamically. That means, the OS
      can enable MSI-X, then mask one MSI-X entry, modify it, and unmask it. The patch
      didn't support this, and Linux also don't work in this way.
      
      3. The patch didn't implement MSI-X mask all and mask single entry. I would
      implement the former in driver/pci/msi.c later. And for single entry, userspace
      should have reposibility to handle it.
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d510d6cc
    • S
      KVM: Add MSI-X interrupt injection logic · 2350bd1f
      Sheng Yang 提交于
      We have to handle more than one interrupt with one handler for MSI-X. Avi
      suggested to use a flag to indicate the pending. So here is it.
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      2350bd1f
    • S
      KVM: Ioctls for init MSI-X entry · c1e01514
      Sheng Yang 提交于
      Introduce KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY two ioctls.
      
      This two ioctls are used by userspace to specific guest device MSI-X entry
      number and correlate MSI-X entry with GSI during the initialization stage.
      
      MSI-X should be well initialzed before enabling.
      
      Don't support change MSI-X entry number for now.
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c1e01514
    • S
      KVM: Unify the delivery of IOAPIC and MSI interrupts · 116191b6
      Sheng Yang 提交于
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      116191b6
    • S
      KVM: Split IOAPIC structure · cf9e4e15
      Sheng Yang 提交于
      Prepared for reuse ioapic_redir_entry for MSI.
      Signed-off-by: NSheng Yang <sheng@linux.intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      cf9e4e15
    • S
      tracing: add trace_seq_vprint interface · 725c624a
      Steven Rostedt 提交于
      The code to update the print formats for events requires a vprintf
      format in the trace_seq. This patch adds that interface.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      725c624a
    • L
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan 提交于
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      55782138