1. 07 2月, 2014 2 次提交
    • R
      ACPI / hotplug / PCI: Rework the handling of eject requests · dd2151be
      Rafael J. Wysocki 提交于
      To avoid the need to install a hotplug notify handler for each ACPI
      namespace node representing a device and having a matching scan
      handler, move the check whether or not the ejection of the given
      device is enabled through its scan handler from acpi_hotplug_notify_cb()
      to acpi_generic_hotplug_event().  Also, move the execution of
      ACPI_OST_SC_EJECT_IN_PROGRESS _OST to acpi_generic_hotplug_event(),
      because in acpi_hotplug_notify_cb() or in acpi_eject_store() we really
      don't know whether or not the eject is going to be in progress (for
      example, acpi_hotplug_execute() may still fail without queuing up the
      work item).
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      dd2151be
    • R
      ACPI / hotplug / PCI: Consolidate ACPIPHP with ACPI core hotplug · 3c2cc7ff
      Rafael J. Wysocki 提交于
      The ACPI-based PCI hotplug (ACPIPHP) code currently attaches its
      hotplug context objects directly to ACPI namespace nodes representing
      hotplug devices.  However, after recent changes causing struct
      acpi_device to be created for every namespace node representing a
      device (regardless of its status), that is not necessary any more.
      Moreover, it's vulnerable to the theoretical issue that the ACPI
      handle passed in the context between handle_hotplug_event() and
      hotplug_event_work() may become invalid in the meantime (as a
      result of a concurrent table unload).
      
      In principle, this issue might be addressed by adding a non-empty
      release handler for ACPIPHP hotplug context objects analogous to
      acpi_scan_drop_device(), but that would duplicate the code in that
      function and in acpi_device_del_work_fn().  For this reason, it's
      better to modify ACPIPHP to attach its device hotplug contexts to
      struct device objects representing hotplug devices and make it
      use acpi_hotplug_notify_cb() as its notify handler.  At the same
      time, acpi_device_hotplug() can be modified to dispatch the new
      .hp.event() callback pointing to acpiphp_hotplug_event() from ACPI
      device objects associated with PCI devices or use the generic
      ACPI device hotplug code for device objects with matching scan
      handlers.
      
      This allows the existing code duplication between ACPIPHP and the
      ACPI core to be reduced too and makes further ACPI-based device
      hotplug consolidation possible.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      3c2cc7ff
  2. 06 2月, 2014 14 次提交
  3. 04 2月, 2014 5 次提交
    • R
      ACPI / hotplug / PCI: Fix bridge removal race vs dock events · af9d8adc
      Rafael J. Wysocki 提交于
      If a PCI bridge with an ACPIPHP context attached is removed via
      sysfs, the code path executed as a result is the following:
      
      pci_stop_and_remove_bus_device_locked
       pci_remove_bus
        pcibios_remove_bus
         acpi_pci_remove_bus
          acpiphp_remove_slots
           cleanup_bridge
            unregister_hotplug_dock_device (drops dock references to the bridge)
           put_bridge
            free_bridge
             acpiphp_put_context (for each child, under context lock)
              kfree (context)
      
      Now, if a dock event affecting one of the bridge's child devices
      occurs (roughly at the same time), it will lead to the following code
      path:
      
      acpi_dock_deferred_cb
       dock_notify
        handle_eject_request
         hot_remove_dock_devices
          dock_hotplug_event
           hotplug_event (dereferences context)
      
      That may lead to a kernel crash in hotplug_event() if it is executed
      after the last kfree() in the bridge removal code path.
      
      To prevent that from happening, add a wrapper around hotplug_event()
      called dock_event() and point the .handler pointer in acpiphp_dock_ops
      to it.  Make that wrapper retrieve the device's ACPIPHP context using
      acpiphp_get_context() (instead of taking it from the data argument)
      under acpiphp_context_lock and check if the parent bridge's
      is_going_away flag is set.  If that flag is set, it will return
      immediately and if it is not set it will grab a reference to the
      device's parent bridge before executing hotplug_event().
      
      Then, in the above scenario, the reference to the parent bridge
      held by dock_event() will prevent free_bridge() from being executed
      for it until hotplug_event() returns.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      af9d8adc
    • R
      ACPI / hotplug / PCI: Fix bridge removal race in handle_hotplug_event() · 1b360f44
      Rafael J. Wysocki 提交于
      If a PCI bridge with an ACPIPHP context attached is removed via
      sysfs, the code path executed as a result is the following:
      
      pci_stop_and_remove_bus_device_locked
       pci_remove_bus
        pcibios_remove_bus
         acpi_pci_remove_bus
          acpiphp_remove_slots
           cleanup_bridge
           put_bridge
            free_bridge
             acpiphp_put_context (for each child, under context lock)
              kfree (child context)
      
      Now, if a hotplug notify is dispatched for one of the bridge's
      children and the timing is such that handle_hotplug_event() for
      that notify is executed while free_bridge() above is running,
      the get_bridge(context->func.parent) in handle_hotplug_event()
      will not really help, because it is too late to prevent the bridge
      from going away and the child's context may be freed before
      hotplug_event_work() scheduled from handle_hotplug_event()
      dereferences the pointer to it passed via the data argument.
      That will cause a kernel crash to happpen in hotplug_event_work().
      
      To prevent that from happening, make handle_hotplug_event()
      check the is_going_away flag of the function's parent bridge
      (under acpiphp_context_lock) and bail out if it's set.  Also,
      make cleanup_bridge() set the bridge's is_going_away flag under
      acpiphp_context_lock so that it cannot be changed between the
      check and the subsequent get_bridge(context->func.parent) in
      handle_hotplug_event().
      
      Then, in the above scenario, handle_hotplug_event() will notice
      that context->func.parent->is_going_away is already set and it
      will exit immediately preventing the crash from happening.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1b360f44
    • R
      ACPI / hotplug / PCI: Scan root bus under the PCI rescan-remove lock · d42f5da2
      Rafael J. Wysocki 提交于
      Since acpiphp_check_bridge() called by acpiphp_check_host_bridge()
      does things that require PCI rescan-remove locking around it,
      make acpiphp_check_host_bridge() use that locking.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      d42f5da2
    • R
      ACPI / hotplug / PCI: Move PCI rescan-remove locking to hotplug_event() · f41b3261
      Rafael J. Wysocki 提交于
      Commit 9217a984 (ACPI / hotplug / PCI: Use global PCI rescan-remove
      locking) modified ACPIPHP to protect its PCI device removal and addition
      code paths from races against sysfs-driven rescan and remove operations
      with the help of PCI rescan-remove locking.  However, it overlooked the
      fact that hotplug_event_work() is not the only caller of hotplug_event()
      which may also be called by dock_hotplug_event() and that code path
      is missing the PCI rescan-remove locking.  This means that, although
      the PCI rescan-remove lock is held as appropriate during the handling
      of events originating from handle_hotplug_event(), the ACPIPHP's
      operations resulting from dock events may still suffer the race
      conditions that commit 9217a984 was supposed to eliminate.
      
      To address that problem, move the PCI rescan-remove locking from
      hotplug_event_work() to hotplug_event() so that it is used regardless
      of the way that function is invoked.
      
      Revamps: 9217a984 (ACPI / hotplug / PCI: Use global PCI rescan-remove locking)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      f41b3261
    • R
      ACPI / hotplug / PCI: Remove entries from bus->devices in reverse order · 2d7c1b77
      Rafael J. Wysocki 提交于
      According to the changelog of commit 29ed1f29 (PCI: pciehp: Fix null
      pointer deref when hot-removing SR-IOV device) it is unsafe to walk the
      bus->devices list of a PCI bus and remove devices from it in direct order,
      because that may lead to NULL pointer dereferences related to virtual
      functions.
      
      For this reason, change all of the bus->devices list walks in
      acpiphp_glue.c during which devices may be removed to be carried out in
      reverse order.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      2d7c1b77
  4. 03 2月, 2014 3 次提交
  5. 02 2月, 2014 1 次提交
    • R
      Revert "PCI: Remove from bus_list and release resources in pci_release_dev()" · 04480094
      Rafael J. Wysocki 提交于
      Revert commit ef83b078 "PCI: Remove from bus_list and release
      resources in pci_release_dev()" that made some nasty race conditions
      become possible.  For example, if a Thunderbolt link is unplugged
      and then replugged immediately, the pci_release_dev() resulting from
      the hot-remove code path may be racing with the hot-add code path
      which after that commit causes various kinds of breakage to happen
      (up to and including a hard crash of the whole system).
      
      Moreover, the problem that commit ef83b078 attempted to address
      cannot happen any more after commit 8a4c5c32 "PCI: Check parent
      kobject in pci_destroy_dev()", because pci_destroy_dev() will now
      return immediately if it has already been executed for the given
      device.
      
      Note, however, that the invocation of msi_remove_pci_irq_vectors()
      removed by commit ef83b078 from pci_free_resources() along with
      the other changes made by it is not added back because of subsequent
      code changes depending on that modification.
      
      Fixes: ef83b078 (PCI: Remove from bus_list and release resources in pci_release_dev())
      Reported-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04480094
  6. 01 2月, 2014 2 次提交
  7. 31 1月, 2014 13 次提交
    • B
      drivers: xen: deaggressive selfballoon driver · bc1b0df5
      Bob Liu 提交于
      Current xen-selfballoon driver is too aggressive which may cause OOM be
      triggered more often. Eg. this bug reported by James:
      https://lkml.org/lkml/2013/11/21/158
      
      There are two mainly reasons:
      1) The original goal_page didn't consider some pages used by kernel space, like
      slab pages and pages used by device drivers.
      
      2) The balloon driver may not give back memory to guest OS fast enough when the
      workload suddenly aquries a lot of physical memory.
      
      In both cases, the guest OS will suffer from memory pressure and OOM may
      be triggered.
      
      The fix is make xen-selfballoon driver not that aggressive by adding extra 10%
      of total ram pages to goal_page.
      It's more valuable to keep the guest system reliable and response faster than
      balloon out these 10% pages to XEN.
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      bc1b0df5
    • Z
      xen/grant-table: Avoid m2p_override during mapping · 08ece5bb
      Zoltan Kiss 提交于
      The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
      for blkback and future netback patches it just cause a lock contention, as
      those pages never go to userspace. Therefore this series does the following:
      - the original functions were renamed to __gnttab_[un]map_refs, with a new
        parameter m2p_override
      - based on m2p_override either they follow the original behaviour, or just set
        the private flag and call set_phys_to_machine
      - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
        m2p_override false
      - a new function gnttab_[un]map_refs_userspace provides the old behaviour
      
      It also removes a stray space from page.h and change ret to 0 if
      XENFEAT_auto_translated_physmap, as that is the only possible return value
      there.
      
      v2:
      - move the storing of the old mfn in page->index to gnttab_map_refs
      - move the function header update to a separate patch
      
      v3:
      - a new approach to retain old behaviour where it needed
      - squash the patches into one
      
      v4:
      - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
      - clear page->private before doing anything with the page, so m2p_find_override
        won't race with this
      
      v5:
      - change return value handling in __gnttab_[un]map_refs
      - remove a stray space in page.h
      - add detail why ret = 0 now at some places
      
      v6:
      - don't pass pfn to m2p* functions, just get it locally
      Signed-off-by: NZoltan Kiss <zoltan.kiss@citrix.com>
      Suggested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      08ece5bb
    • M
      zram: remove zram->lock in read path and change it with mutex · e46e3315
      Minchan Kim 提交于
      Finally, we separated zram->lock dependency from 32bit stat/ table
      handling so there is no reason to use rw_semaphore between read and
      write path so this patch removes the lock from read path totally and
      changes rw_semaphore with mutex.  So, we could do
      
      old:
      
        read-read: OK
        read-write: NO
        write-write: NO
      
      Now:
      
        read-read: OK
        read-write: OK
        write-write: NO
      
      The below data proves mixed workload performs well 11 times and there is
      also enhance on write-write path because current rw-semaphore doesn't
      support SPIN_ON_OWNER.  It's side effect but anyway good thing for us.
      
      Write-related tests perform better (from 61% to 1058%) but read path has
      good/bad(from -2.22% to 1.45%) but they are all marginal within stddev.
      
        CPU 12
        iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0
      
        ==Initial write                ==Initial write
        records: 10                    records: 10
        avg:  516189.16                avg:  839907.96
        std:   22486.53 (4.36%)        std:   47902.17 (5.70%)
        max:  546970.60                max:  909910.35
        min:  481131.54                min:  751148.38
        ==Rewrite                      ==Rewrite
        records: 10                    records: 10
        avg:  509527.98                avg: 1050156.37
        std:   45799.94 (8.99%)        std:   40695.44 (3.88%)
        max:  611574.27                max: 1111929.26
        min:  443679.95                min:  980409.62
        ==Read                         ==Read
        records: 10                    records: 10
        avg: 4408624.17                avg: 4472546.76
        std:  281152.61 (6.38%)        std:  163662.78 (3.66%)
        max: 4867888.66                max: 4727351.03
        min: 4058347.69                min: 4126520.88
        ==Re-read                      ==Re-read
        records: 10                    records: 10
        avg: 4462147.53                avg: 4363257.75
        std:  283546.11 (6.35%)        std:  247292.63 (5.67%)
        max: 4912894.44                max: 4677241.75
        min: 4131386.50                min: 4035235.84
        ==Reverse Read                 ==Reverse Read
        records: 10                    records: 10
        avg: 4565865.97                avg: 4485818.08
        std:  313395.63 (6.86%)        std:  248470.10 (5.54%)
        max: 5232749.16                max: 4789749.94
        min: 4185809.62                min: 3963081.34
        ==Stride read                  ==Stride read
        records: 10                    records: 10
        avg: 4515981.80                avg: 4418806.01
        std:  211192.32 (4.68%)        std:  212837.97 (4.82%)
        max: 4889287.28                max: 4686967.22
        min: 4210362.00                min: 4083041.84
        ==Random read                  ==Random read
        records: 10                    records: 10
        avg: 4410525.23                avg: 4387093.18
        std:  236693.22 (5.37%)        std:  235285.23 (5.36%)
        max: 4713698.47                max: 4669760.62
        min: 4057163.62                min: 3952002.16
        ==Mixed workload               ==Mixed workload
        records: 10                    records: 10
        avg:  243234.25                avg: 2818677.27
        std:   28505.07 (11.72%)       std:  195569.70 (6.94%)
        max:  288905.23                max: 3126478.11
        min:  212473.16                min: 2484150.69
        ==Random write                 ==Random write
        records: 10                    records: 10
        avg:  555887.07                avg: 1053057.79
        std:   70841.98 (12.74%)       std:   35195.36 (3.34%)
        max:  683188.28                max: 1096125.73
        min:  437299.57                min:  992481.93
        ==Pwrite                       ==Pwrite
        records: 10                    records: 10
        avg:  501745.93                avg:  810363.09
        std:   16373.54 (3.26%)        std:   19245.01 (2.37%)
        max:  518724.52                max:  833359.70
        min:  464208.73                min:  765501.87
        ==Pread                        ==Pread
        records: 10                    records: 10
        avg: 4539894.60                avg: 4457680.58
        std:  197094.66 (4.34%)        std:  188965.60 (4.24%)
        max: 4877170.38                max: 4689905.53
        min: 4226326.03                min: 4095739.72
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46e3315
    • M
      zram: remove workqueue for freeing removed pending slot · f614a9f4
      Minchan Kim 提交于
      Commit a0c516cb ("zram: don't grab mutex in zram_slot_free_noity")
      introduced free request pending code to avoid scheduling by mutex under
      spinlock and it was a mess which made code lenghty and increased
      overhead.
      
      Now, we don't need zram->lock any more to free slot so this patch
      reverts it and then, tb_lock should protect it.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f614a9f4
    • M
      zram: introduce zram->tb_lock · 92967471
      Minchan Kim 提交于
      Currently, the zram table is protected by zram->lock but it's rather
      coarse-grained lock and it makes hard for scalibility.
      
      Let's use own rwlock instead of depending on zram->lock.  This patch
      adds new locking so obviously, it would make slow but this patch is just
      prepartion for removing coarse-grained rw_semaphore(ie, zram->lock)
      which is hurdle about zram scalability.
      
      Final patch in this patchset series will remove the lock from read-path
      and change rw_semaphore with mutex in write path.  With bonus, we could
      drop pending slot free mess in next patch.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92967471
    • M
      zram: use atomic operation for stat · deb0bdeb
      Minchan Kim 提交于
      Some of fields in zram->stats are protected by zram->lock which is
      rather coarse-grained so let's use atomic operation without explict
      locking.
      
      This patch is ready for removing dependency of zram->lock in read path
      which is very coarse-grained rw_semaphore.  Of course, this patch adds
      new atomic operation so it might make slow but my 12CPU test couldn't
      spot any regression.  All gain/lose is marginal within stddev.
      
        iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0
      
        ==Initial write                ==Initial write
        records: 50                    records: 50
        avg:  412875.17                avg:  415638.23
        std:   38543.12 (9.34%)        std:   36601.11 (8.81%)
        max:  521262.03                max:  502976.72
        min:  343263.13                min:  351389.12
        ==Rewrite                      ==Rewrite
        records: 50                    records: 50
        avg:  416640.34                avg:  397914.33
        std:   60798.92 (14.59%)       std:   46150.42 (11.60%)
        max:  543057.07                max:  522669.17
        min:  304071.67                min:  316588.77
        ==Read                         ==Read
        records: 50                    records: 50
        avg: 4147338.63                avg: 4070736.51
        std:  179333.25 (4.32%)        std:  223499.89 (5.49%)
        max: 4459295.28                max: 4539514.44
        min: 3753057.53                min: 3444686.31
        ==Re-read                      ==Re-read
        records: 50                    records: 50
        avg: 4096706.71                avg: 4117218.57
        std:  229735.04 (5.61%)        std:  171676.25 (4.17%)
        max: 4430012.09                max: 4459263.94
        min: 2987217.80                min: 3666904.28
        ==Reverse Read                 ==Reverse Read
        records: 50                    records: 50
        avg: 4062763.83                avg: 4078508.32
        std:  186208.46 (4.58%)        std:  172684.34 (4.23%)
        max: 4401358.78                max: 4424757.22
        min: 3381625.00                min: 3679359.94
        ==Stride read                  ==Stride read
        records: 50                    records: 50
        avg: 4094933.49                avg: 4082170.22
        std:  185710.52 (4.54%)        std:  196346.68 (4.81%)
        max: 4478241.25                max: 4460060.97
        min: 3732593.23                min: 3584125.78
        ==Random read                  ==Random read
        records: 50                    records: 50
        avg: 4031070.04                avg: 4074847.49
        std:  192065.51 (4.76%)        std:  206911.33 (5.08%)
        max: 4356931.16                max: 4399442.56
        min: 3481619.62                min: 3548372.44
        ==Mixed workload               ==Mixed workload
        records: 50                    records: 50
        avg:  149925.73                avg:  149675.54
        std:    7701.26 (5.14%)        std:    6902.09 (4.61%)
        max:  191301.56                max:  175162.05
        min:  133566.28                min:  137762.87
        ==Random write                 ==Random write
        records: 50                    records: 50
        avg:  404050.11                avg:  393021.47
        std:   58887.57 (14.57%)       std:   42813.70 (10.89%)
        max:  601798.09                max:  524533.43
        min:  325176.99                min:  313255.34
        ==Pwrite                       ==Pwrite
        records: 50                    records: 50
        avg:  411217.70                avg:  411237.96
        std:   43114.99 (10.48%)       std:   33136.29 (8.06%)
        max:  530766.79                max:  471899.76
        min:  320786.84                min:  317906.94
        ==Pread                        ==Pread
        records: 50                    records: 50
        avg: 4154908.65                avg: 4087121.92
        std:  151272.08 (3.64%)        std:  219505.04 (5.37%)
        max: 4459478.12                max: 4435857.38
        min: 3730512.41                min: 3101101.67
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      deb0bdeb
    • M
      zram: remove unnecessary free · 874e3cdd
      Minchan Kim 提交于
      Commit a0c516cb ("zram: don't grab mutex in zram_slot_free_noity")
      introduced pending zram slot free in zram's write path in case of
      missing slot free by memory allocation failure in zram_slot_free_notify
      but it is not necessary because we have already freed the slot right
      before overwriting.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      874e3cdd
    • M
      zram: delay pending free request in read path · 9b353db1
      Minchan Kim 提交于
      Sergey reported we don't need to handle pending free request every I/O
      so that this patch removes it in read path while we remain it in write
      path.
      
      Let's consider below example.
      
      Swap subsystem ask to zram "A" block free by swap_slot_free_notify but
      zram had been pended it without real freeing.  Swap subsystem allocates
      "A" block for new data but request pended for a long time just handled
      and zram blindly free new data on the "A" block.  :(
      
      That's why we couldn't remove handle pending free request right before
      zram-write.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b353db1
    • M
      zram: fix race between reset and flushing pending work · da4a0412
      Minchan Kim 提交于
      Dan and Sergey reported that there is a racy between reset and flushing
      of pending work so that it could make oops by freeing zram->meta in
      reset while zram_slot_free can access zram->meta if new request is
      adding during the race window.
      
      This patch moves flush after taking init_lock so it prevents new request
      so that it closes the race.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da4a0412
    • M
      zram: add copyright · 7bfb3de8
      Minchan Kim 提交于
      Add my copyright to the zram source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bfb3de8
    • M
      zram: remove old private project comment · 49061236
      Minchan Kim 提交于
      Remove the old private compcache project address so upcoming patches
      should be sent to LKML because we Linux kernel community will take care.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49061236
    • M
      zram: promote zram from staging · cd67e10a
      Minchan Kim 提交于
      Zram has lived in staging for a LONG LONG time and have been
      fixed/improved by many contributors so code is clean and stable now.  Of
      course, there are lots of product using zram in real practice.
      
      The major TV companys have used zram as swap since two years ago and
      recently our production team released android smart phone with zram
      which is used as swap, too and recently Android Kitkat start to use zram
      for small memory smart phone.  And there was a report Google released
      their ChromeOS with zram, too and cyanogenmod have been used zram long
      time ago.  And I heard some disto have used zram block device for tmpfs.
      In addition, I saw many report from many other peoples.  For example,
      Lubuntu start to use it.
      
      The benefit of zram is very clear.  With my experience, one of the
      benefit was to remove jitter of video application with backgroud memory
      pressure.  It would be effect of efficient memory usage by compression
      but more issue is whether swap is there or not in the system.  Recent
      mobile platforms have used JAVA so there are many anonymous pages.  But
      embedded system normally are reluctant to use eMMC or SDCard as swap
      because there is wear-leveling and latency issues so if we do not use
      swap, it means we can't reclaim anoymous pages and at last, we could
      encounter OOM kill.  :(
      
      Although we have real storage as swap, it was a problem, too.  Because
      it sometime ends up making system very unresponsible caused by slow swap
      storage performance.
      
      Quote from Luigi on Google
       "Since Chrome OS was mentioned: the main reason why we don't use swap
        to a disk (rotating or SSD) is because it doesn't degrade gracefully
        and leads to a bad interactive experience.  Generally we prefer to
        manage RAM at a higher level, by transparently killing and restarting
        processes.  But we noticed that zram is fast enough to be competitive
        with the latter, and it lets us make more efficient use of the
        available RAM.  " and he announced.
      http://www.spinics.net/lists/linux-mm/msg57717.html
      
      Other uses case is to use zram for block device.  Zram is block device
      so anyone can format the block device and mount on it so some guys on
      the internet start zram as /var/tmp.
      http://forums.gentoo.org/viewtopic-t-838198-start-0.html
      
      Let's promote zram and enhance/maintain it instead of removing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd67e10a
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d