1. 06 12月, 2012 10 次提交
    • P
      KVM: PPC: Book3S PR: Emulate PURR, SPURR and DSCR registers · b0a94d4e
      Paul Mackerras 提交于
      This adds basic emulation of the PURR and SPURR registers.  We assume
      we are emulating a single-threaded core, so these advance at the same
      rate as the timebase.  A Linux kernel running on a POWER7 expects to
      be able to access these registers and is not prepared to handle a
      program interrupt on accessing them.
      
      This also adds a very minimal emulation of the DSCR (data stream
      control register).  Writes are ignored and reads return zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b0a94d4e
    • P
      KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages · 1cc8ed0b
      Paul Mackerras 提交于
      Currently, if the guest does an H_PROTECT hcall requesting that the
      permissions on a HPT entry be changed to allow writing, we make the
      requested change even if the page is marked read-only in the host
      Linux page tables.  This is a problem since it would for instance
      allow a guest to modify a page that KSM has decided can be shared
      between multiple guests.
      
      To fix this, if the new permissions for the page allow writing, we need
      to look up the memslot for the page, work out the host virtual address,
      and look up the Linux page tables to get the PTE for the page.  If that
      PTE is read-only, we reduce the HPTE permissions to read-only.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1cc8ed0b
    • P
      KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT · 05dd85f7
      Paul Mackerras 提交于
      This fixes a bug in the code which allows userspace to read out the
      contents of the guest's hashed page table (HPT).  On the second and
      subsequent passes through the HPT, when we are reporting only those
      entries that have changed, we were incorrectly initializing the index
      field of the header with the index of the first entry we skipped
      rather than the first changed entry.  This fixes it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      05dd85f7
    • P
      KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT · a64fd707
      Paul Mackerras 提交于
      With HV-style KVM, we maintain reverse-mapping lists that enable us to
      find all the HPT (hashed page table) entries that reference each guest
      physical page, with the heads of the lists in the memslot->arch.rmap
      arrays.  When we reset the HPT (i.e. when we reboot the VM), we clear
      out all the HPT entries but we were not clearing out the reverse
      mapping lists.  The result is that as we create new HPT entries, the
      lists get corrupted, which can easily lead to loops, resulting in the
      host kernel hanging when it tries to traverse those lists.
      
      This fixes the problem by zeroing out all the reverse mapping lists
      when we zero out the HPT.  This incidentally means that we are also
      zeroing our record of the referenced and changed bits (not the bits
      in the Linux PTEs, used by the Linux MM subsystem, but the bits used
      by the KVM_GET_DIRTY_LOG ioctl, and those used by kvm_age_hva() and
      kvm_test_age_hva()).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a64fd707
    • P
      KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT · a2932923
      Paul Mackerras 提交于
      A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
      this fd return the contents of the HPT (hashed page table), writes
      create and/or remove entries in the HPT.  There is a new capability,
      KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
      takes an argument structure with the index of the first HPT entry to
      read out and a set of flags.  The flags indicate whether the user is
      intending to read or write the HPT, and whether to return all entries
      or only the "bolted" entries (those with the bolted bit, 0x10, set in
      the first doubleword).
      
      This is intended for use in implementing qemu's savevm/loadvm and for
      live migration.  Therefore, on reads, the first pass returns information
      about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
      end of the HPT, it returns from the read.  Subsequent reads only return
      information about HPTEs that have changed since they were last read.
      A read that finds no changed HPTEs in the HPT following where the last
      read finished will return 0 bytes.
      
      The format of the data provides a simple run-length compression of the
      invalid entries.  Each block of data starts with a header that indicates
      the index (position in the HPT, which is just an array), the number of
      valid entries starting at that index (may be zero), and the number of
      invalid entries following those valid entries.  The valid entries, 16
      bytes each, follow the header.  The invalid entries are not explicitly
      represented.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix documentation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a2932923
    • P
      KVM: PPC: Book3S HV: Make a HPTE removal function available · 6b445ad4
      Paul Mackerras 提交于
      This makes a HPTE removal function, kvmppc_do_h_remove(), available
      outside book3s_hv_rm_mmu.c.  This will be used by the HPT writing
      code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6b445ad4
    • P
      KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs · 44e5f6be
      Paul Mackerras 提交于
      This uses a bit in our record of the guest view of the HPTE to record
      when the HPTE gets modified.  We use a reserved bit for this, and ensure
      that this bit is always cleared in HPTE values returned to the guest.
      
      The recording of modified HPTEs is only done if other code indicates
      its interest by setting kvm->arch.hpte_mod_interest to a non-zero value.
      The reason for this is that when later commits add facilities for
      userspace to read the HPT, the first pass of reading the HPT will be
      quicker if there are no (or very few) HPTEs marked as modified,
      rather than having most HPTEs marked as modified.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      44e5f6be
    • P
      KVM: PPC: Book3S HV: Fix bug causing loss of page dirty state · 4879f241
      Paul Mackerras 提交于
      This fixes a bug where adding a new guest HPT entry via the H_ENTER
      hcall would lose the "changed" bit in the reverse map information
      for the guest physical page being mapped.  The result was that the
      KVM_GET_DIRTY_LOG could return a zero bit for the page even though
      the page had been modified by the guest.
      
      This fixes it by only modifying the index and present bits in the
      reverse map entry, thus preserving the reference and change bits.
      We were also unnecessarily setting the reference bit, and this
      fixes that too.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4879f241
    • P
      KVM: PPC: Book3S HV: Restructure HPT entry creation code · 7ed661bf
      Paul Mackerras 提交于
      This restructures the code that creates HPT (hashed page table)
      entries so that it can be called in situations where we don't have a
      struct vcpu pointer, only a struct kvm pointer.  It also fixes a bug
      where kvmppc_map_vrma() would corrupt the guest R4 value.
      
      Most of the work of kvmppc_virtmode_h_enter is now done by a new
      function, kvmppc_virtmode_do_h_enter, which itself calls another new
      function, kvmppc_do_h_enter, which contains most of the old
      kvmppc_h_enter.  The new kvmppc_do_h_enter takes explicit arguments
      for the place to return the HPTE index, the Linux page tables to use,
      and whether it is being called in real mode, thus removing the need
      for it to have the vcpu as an argument.
      
      Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
      HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
      to handle H_ENTER hcalls from the guest that need to pin a page of
      memory.  Since H_ENTER returns the index of the created HPTE in R4,
      kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
      in the case when it gets called from kvmppc_map_vrma on the first
      VCPU_RUN ioctl.  With this, kvmppc_map_vrma instead calls
      kvmppc_virtmode_do_h_enter with the address of a dummy word as the
      place to store the HPTE index, thus avoiding corrupting the guest R4.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7ed661bf
    • A
      KVM: PPC: Support eventfd · 0e673fb6
      Alexander Graf 提交于
      In order to support the generic eventfd infrastructure on PPC, we need
      to call into the generic KVM in-kernel device mmio code.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      0e673fb6
  2. 28 11月, 2012 1 次提交
  3. 31 10月, 2012 1 次提交
  4. 30 10月, 2012 12 次提交
    • P
      KVM: PPC: Book3S HV: Fix thinko in try_lock_hpte() · 8b5869ad
      Paul Mackerras 提交于
      This fixes an error in the inline asm in try_lock_hpte() where we
      were erroneously using a register number as an immediate operand.
      The bug only affects an error path, and in fact the code will still
      work as long as the compiler chooses some register other than r0
      for the "bits" variable.  Nevertheless it should still be fixed.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8b5869ad
    • P
      KVM: PPC: Book3S HV: Allow DTL to be set to address 0, length 0 · 9f8c8c78
      Paul Mackerras 提交于
      Commit 55b665b0 ("KVM: PPC: Book3S HV: Provide a way for userspace
      to get/set per-vCPU areas") includes a check on the length of the
      dispatch trace log (DTL) to make sure the buffer is at least one entry
      long.  This is appropriate when registering a buffer, but the
      interface also allows for any existing buffer to be unregistered by
      specifying a zero address.  In this case the length check is not
      appropriate.  This makes the check conditional on the address being
      non-zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9f8c8c78
    • P
      KVM: PPC: Book3S HV: Fix accounting of stolen time · c7b67670
      Paul Mackerras 提交于
      Currently the code that accounts stolen time tends to overestimate the
      stolen time, and will sometimes report more stolen time in a DTL
      (dispatch trace log) entry than has elapsed since the last DTL entry.
      This can cause guests to underflow the user or system time measured
      for some tasks, leading to ridiculous CPU percentages and total runtimes
      being reported by top and other utilities.
      
      In addition, the current code was designed for the previous policy where
      a vcore would only run when all the vcpus in it were runnable, and so
      only counted stolen time on a per-vcore basis.  Now that a vcore can
      run while some of the vcpus in it are doing other things in the kernel
      (e.g. handling a page fault), we need to count the time when a vcpu task
      is preempted while it is not running as part of a vcore as stolen also.
      
      To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
      vcpu_load/put functions to count preemption time while the vcpu is
      in that state.  Handling the transitions between the RUNNING and
      BUSY_IN_HOST states requires checking and updating two variables
      (accumulated time stolen and time last preempted), so we add a new
      spinlock, vcpu->arch.tbacct_lock.  This protects both the per-vcpu
      stolen/preempt-time variables, and the per-vcore variables while this
      vcpu is running the vcore.
      
      Finally, we now don't count time spent in userspace as stolen time.
      The task could be executing in userspace on behalf of the vcpu, or
      it could be preempted, or the vcpu could be genuinely stopped.  Since
      we have no way of dividing up the time between these cases, we don't
      count any of it as stolen.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c7b67670
    • P
      KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run · 8455d79e
      Paul Mackerras 提交于
      Currently the Book3S HV code implements a policy on multi-threaded
      processors (i.e. POWER7) that requires all of the active vcpus in a
      virtual core to be ready to run before we run the virtual core.
      However, that causes problems on reset, because reset stops all vcpus
      except vcpu 0, and can also reduce throughput since all four threads
      in a virtual core have to wait whenever any one of them hits a
      hypervisor page fault.
      
      This relaxes the policy, allowing the virtual core to run as soon as
      any vcpu in it is runnable.  With this, the KVMPPC_VCPU_STOPPED state
      and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single
      KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish
      between them.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8455d79e
    • P
      KVM: PPC: Book3S HV: Fixes for late-joining threads · 2f12f034
      Paul Mackerras 提交于
      If a thread in a virtual core becomes runnable while other threads
      in the same virtual core are already running in the guest, it is
      possible for the latecomer to join the others on the core without
      first pulling them all out of the guest.  Currently this only happens
      rarely, when a vcpu is first started.  This fixes some bugs and
      omissions in the code in this case.
      
      First, we need to check for VPA updates for the latecomer and make
      a DTL entry for it.  Secondly, if it comes along while the master
      vcpu is doing a VPA update, we don't need to do anything since the
      master will pick it up in kvmppc_run_core.  To handle this correctly
      we introduce a new vcore state, VCORE_STARTING.  Thirdly, there is
      a race because we currently clear the hardware thread's hwthread_req
      before waiting to see it get to nap.  A latecomer thread could have
      its hwthread_req cleared before it gets to test it, and therefore
      never increment the nap_count, leading to messages about wait_for_nap
      timeouts.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2f12f034
    • P
      KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock · 913d3ff9
      Paul Mackerras 提交于
      There were a few places where we were traversing the list of runnable
      threads in a virtual core, i.e. vc->runnable_threads, without holding
      the vcore spinlock.  This extends the places where we hold the vcore
      spinlock to cover everywhere that we traverse that list.
      
      Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
      this moves the call of it from kvmppc_handle_exit out to
      kvmppc_vcpu_run, where we don't hold the vcore lock.
      
      In kvmppc_vcore_blocked, we don't actually need to check whether
      all vcpus are ceded and don't have any pending exceptions, since the
      caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
      actually checking for pending exceptions, so we add that.
      
      The change of if to while in kvmppc_run_vcpu is to make sure that we
      never call kvmppc_remove_runnable() when the vcore state is RUNNING or
      EXITING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      913d3ff9
    • P
      KVM: PPC: Book3S HV: Fix some races in starting secondary threads · 7b444c67
      Paul Mackerras 提交于
      Subsequent patches implementing in-kernel XICS emulation will make it
      possible for IPIs to arrive at secondary threads at arbitrary times.
      This fixes some races in how we start the secondary threads, which
      if not fixed could lead to occasional crashes of the host kernel.
      
      This makes sure that (a) we have grabbed all the secondary threads,
      and verified that they are no longer in the kernel, before we start
      any thread, (b) that the secondary thread loads its vcpu pointer
      after clearing the IPI that woke it up (so we don't miss a wakeup),
      and (c) that the secondary thread clears its vcpu pointer before
      incrementing the nap count.  It also removes unnecessary setting
      of the vcpu and vcore pointers in the paca in kvmppc_core_vcpu_load.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7b444c67
    • P
      KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online · 512691d4
      Paul Mackerras 提交于
      When a Book3S HV KVM guest is running, we need the host to be in
      single-thread mode, that is, all of the cores (or at least all of
      the cores where the KVM guest could run) to be running only one
      active hardware thread.  This is because of the hardware restriction
      in POWER processors that all of the hardware threads in the core
      must be in the same logical partition.  Complying with this restriction
      is much easier if, from the host kernel's point of view, only one
      hardware thread is active.
      
      This adds two hooks in the SMP hotplug code to allow the KVM code to
      make sure that secondary threads (i.e. hardware threads other than
      thread 0) cannot come online while any KVM guest exists.  The KVM
      code still has to check that any core where it runs a guest has the
      secondary threads offline, but having done that check it can now be
      sure that they will not come online while the guest is running.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      512691d4
    • A
      PPC: ePAPR: Convert header to uapi · c99ec973
      Alexander Graf 提交于
      The new uapi framework splits kernel internal and user space exported
      bits of header files more cleanly. Adjust the ePAPR header accordingly.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c99ec973
    • A
      KVM: PPC: Move mtspr/mfspr emulation into own functions · 388cf9ee
      Alexander Graf 提交于
      The mtspr/mfspr emulation code became quite big over time. Move it
      into its own function so things stay more readable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      388cf9ee
    • A
      KVM: PPC: 44x: fix DCR read/write · e43a0287
      Alexander Graf 提交于
      When remembering the direction of a DCR transaction, we should write
      to the same variable that we interpret on later when doing vcpu_run
      again.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Cc: stable@vger.kernel.org
      e43a0287
    • X
      KVM: do not treat noslot pfn as a error pfn · 81c52c56
      Xiao Guangrong 提交于
      This patch filters noslot pfn out from error pfns based on Marcelo comment:
      noslot pfn is not a error pfn
      
      After this patch,
      - is_noslot_pfn indicates that the gfn is not in slot
      - is_error_pfn indicates that the gfn is in slot but the error is occurred
        when translate the gfn to pfn
      - is_error_noslot_pfn indicates that the pfn either it is error pfns or it
        is noslot pfn
      And is_invalid_pfn can be removed, it makes the code more clean
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      81c52c56
  5. 23 10月, 2012 1 次提交
  6. 18 10月, 2012 5 次提交
    • D
      cpuidle/powerpc: Fix snooze state problem in the cpuidle design on pseries. · 83dac594
      Deepthi Dharwar 提交于
      Earlier without cpuidle framework on pseries, the native arch
      idle routine comprised of both snooze and nap
      states.  smt_snooze_delay variable was used to delay
      the idle process entry to deeper idle state like  nap.
      With the coming of cpuidle, this arch specific idle was replaced
      by two different idle routines, one for supporting snooze and other
      for nap. This enabled addition of more
      low level idle states on pseries in the future.
      
      On adopting the generic cpuidle framework for POWER systems,
      the decision of which idle state to choose from,  given a predicted
      idle time is taken by the menu governor based on
      target_residency and  exit_latency of the idle states.
      target_residency is the minimum time to be resident in that idle state.
      Exit_latency is time taken to exit out of idle state.
      Deeper the idle state, both the target residency and exit latency
      would be higher.
      
      In the current design, smt_snooze_delay is used as target_residency
      for the  snooze state which is incorrect, as it is not the
      minimum but the maximum duration to be in snooze state.
      This would  result in the governor in taking bad decision,
      as presently target_residency of nap < target_residency of snooze
      inspite of nap being deeper idle state.
      
      This patch aims to fix this problem by replacing the smt_snooze_delay loop
      in snooze state, with the need_resched()  as the governor is aware of
      entry and exit of various idle transitions based on which
      next idle time prediction.
      
      The governor is intelligent enough to determine the idle state the needs to
      be transitioned to and maintains a whole of heuristics including
      io load, previous idle states predictions etc for the same, based on
      which idle state entry decision is taken.
      
      With this fix, of setting target_residency of snooze to 0
      					     nap to smt_snooze_delay
      if the predicted idle time is less
      than smt_snooze_delay (target_residency of nap)
      value governor would pick snooze state, else nap. This adhers to the
      previous native idle design.
      Signed-off-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      83dac594
    • D
      cpuidle/powerpc: Fix smt_snooze_delay functionality. · 8ea959a1
      Deepthi Dharwar 提交于
      smt_snooze_delay was designed to  delay idle loop's nap entry
      in the native idle code before it got  ported over to use as part of
      the cpuidle framework.
      
      A -ve value  assigned to smt_snooze_delay should result in
      busy looping, in other words disabling the entry to nap state.
      
      	- https://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082450.html
      
      This particular functionality can be achieved currently by
      echo 1 > /sys/devices/system/cpu/cpu*/state1/disable
      but it is broken when one assigns -ve value to  the smt_snooze_delay
      variable either via sysfs entry or ppc64_cpu util.
      
      This patch aims to fix this, by disabling nap state when smt_snooze_delay
      variable is set to -ve value.
      Signed-off-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8ea959a1
    • D
      cpuidle/powerpc: Fix target residency initialisation in pseries cpuidle · 817deb05
      Deepthi Dharwar 提交于
      Remove the redundant target residency initialisation in pseries_cpuidle_driver_init().
      This is currently over-writing the residency time updated as part of the static
      table, resulting in  all the idle states having the same target
      residency of 100us which is incorrect. This may result in the menu governor making
      wrong state decisions.
      Signed-off-by: NDeepthi Dharwar <deepthi@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      817deb05
    • A
      powerpc: Build fix for powerpc KVM · ce236ab5
      Aneesh Kumar K.V 提交于
      Fix build failure for powerpc KVM by adding missing VPN_SHIFT definition
      and the ';'
      
      arch/powerpc/kvm/book3s_32_mmu_host.c: In function 'kvmppc_mmu_map_page':
      arch/powerpc/kvm/book3s_32_mmu_host.c:176: error: 'VPN_SHIFT' undeclared (first use in this function)
      arch/powerpc/kvm/book3s_32_mmu_host.c:176: error: (Each undeclared identifier is reported only once
      arch/powerpc/kvm/book3s_32_mmu_host.c:176: error: for each function it appears in.)
      arch/powerpc/kvm/book3s_32_mmu_host.c:178: error: expected ';' before 'next_pteg'
      arch/powerpc/kvm/book3s_32_mmu_host.c:190: error: label 'next_pteg' used but not defined
      make[1]: *** [arch/powerpc/kvm/book3s_32_mmu_host.o] Error 1
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ce236ab5
    • B
      Revert "powerpc/perf: Use pmc_overflow() to detect rolled back events" · 72523d80
      Benjamin Herrenschmidt 提交于
      This reverts commit 81331211.
      
      This revert was requested by the author of the patch as it seems
      to cause system hangs with some low frequency events
      72523d80
  7. 12 10月, 2012 1 次提交
  8. 11 10月, 2012 2 次提交
  9. 09 10月, 2012 7 次提交
    • D
    • Y
      memory-hotplug: suppress "Trying to free nonexistent resource... · d760afd4
      Yasuaki Ishimatsu 提交于
      memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
      
      When our x86 box calls __remove_pages(), release_mem_region() shows many
      warnings.  And x86 box cannot unregister iomem_resource.
      
        "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"
      
      release_mem_region() has been changed to be called in each
      PAGES_PER_SECTION by commit de7f0cba ("memory hotplug: release
      memory regions in PAGES_PER_SECTION chunks").  Because powerpc registers
      iomem_resource in each PAGES_PER_SECTION chunk.  But when I hot add
      memory on x86 box, iomem_resource is register in each _CRS not
      PAGES_PER_SECTION chunk.  So x86 box unregisters iomem_resource.
      
      The patch fixes the problem.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nathan Fontenot <nfont@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d760afd4
    • S
      readahead: fault retry breaks mmap file read random detection · 45cac65b
      Shaohua Li 提交于
      .fault now can retry.  The retry can break state machine of .fault.  In
      filemap_fault, if page is miss, ra->mmap_miss is increased.  In the second
      try, since the page is in page cache now, ra->mmap_miss is decreased.  And
      these are done in one fault, so we can't detect random mmap file access.
      
      Add a new flag to indicate .fault is tried once.  In the second try, skip
      ra->mmap_miss decreasing.  The filemap_fault state machine is ok with it.
      
      I only tested x86, didn't test other archs, but looks the change for other
      archs is obvious, but who knows :)
      Signed-off-by: NShaohua Li <shaohua.li@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45cac65b
    • S
      atomic: implement generic atomic_dec_if_positive() · e79bee24
      Shaohua Li 提交于
      The x86 implementation of atomic_dec_if_positive is quite generic, so make
      it available to all architectures.
      
      This is needed for "swap: add a simple detector for inappropriate swapin
      readahead".
      
      [akpm@linux-foundation.org: do the "#define foo foo" trick in the conventional manner]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e79bee24
    • W
      mm: hugetlb: add arch hook for clearing page flags before entering pool · 5d3a551c
      Will Deacon 提交于
      The core page allocator ensures that page flags are zeroed when freeing
      pages via free_pages_check.  A number of architectures (ARM, PPC, MIPS)
      rely on this property to treat new pages as dirty with respect to the data
      cache and perform the appropriate flushing before mapping the pages into
      userspace.
      
      This can lead to cache synchronisation problems when using hugepages,
      since the allocator keeps its own pool of pages above the usual page
      allocator and does not reset the page flags when freeing a page into the
      pool.
      
      This patch adds a new architecture hook, arch_clear_hugepage_flags, so
      that architectures which rely on the page flags being in a particular
      state for fresh allocations can adjust the flags accordingly when a page
      is freed into the pool.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d3a551c
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • K
      mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file · 2dd8ad81
      Konstantin Khlebnikov 提交于
      Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
      a task's executable file.  After this patch they will use mm->exe_file
      directly.  mm->exe_file is protected with mm->mmap_sem, so locking stays
      the same.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Chris Metcalf <cmetcalf@tilera.com>			[arch/tile]
      Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	[tomoyo]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: NJames Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd8ad81