1. 30 3月, 2014 1 次提交
    • A
      rbd: drop an unsafe assertion · 638c323c
      Alex Elder 提交于
      Olivier Bonvalet reported having repeated crashes due to a failed
      assertion he was hitting in rbd_img_obj_callback():
      
          Assertion failure in rbd_img_obj_callback() at line 2165:
      	rbd_assert(which >= img_request->next_completion);
      
      With a lot of help from Olivier with reproducing the problem
      we were able to determine the object and image requests had
      already been completed (and often freed) at the point the
      assertion failed.
      
      There was a great deal of discussion on the ceph-devel mailing list
      about this.  The problem only arose when there were two (or more)
      object requests in an image request, and the problem was always
      seen when the second request was being completed.
      
      The problem is due to a race in the window between setting the
      "done" flag on an object request and checking the image request's
      next completion value.  When the first object request completes, it
      checks to see if its successor request is marked "done", and if
      so, that request is also completed.  In the process, the image
      request's next_completion value is updated to reflect that both
      the first and second requests are completed.  By the time the
      second request is able to check the next_completion value, it
      has been set to a value *greater* than its own "which" value,
      which caused an assertion to fail.
      
      Fix this problem by skipping over any completion processing
      unless the completing object request is the next one expected.
      Test only for inequality (not >=), and eliminate the bad
      assertion.
      Tested-by: NOlivier Bonvalet <ob@daevel.fr>
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      638c323c
  2. 11 3月, 2014 1 次提交
    • J
      mtip32xx: fix bad use of smp_processor_id() · 7f328908
      Jens Axboe 提交于
      mtip_pci_probe() dumps the current CPU when loaded, but it does
      so in a preemptible context. Hence smp_processor_id() correctly
      warns:
      
      BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/155
      caller is mtip_pci_probe+0x53/0x880 [mtip32xx]
      
      Switch to raw_smp_processor_id(), since it's just informational
      and persistent accuracy isn't important.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7f328908
  3. 04 3月, 2014 2 次提交
    • M
      zram: avoid null access when fail to alloc meta · db5d711e
      Minchan Kim 提交于
      zram_meta_alloc could fail so caller should check it.  Otherwise, your
      system will hang.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db5d711e
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  4. 19 2月, 2014 1 次提交
  5. 12 2月, 2014 1 次提交
  6. 11 2月, 2014 2 次提交
  7. 08 2月, 2014 6 次提交
  8. 03 2月, 2014 2 次提交
  9. 31 1月, 2014 11 次提交
    • Z
      xen/grant-table: Avoid m2p_override during mapping · 08ece5bb
      Zoltan Kiss 提交于
      The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
      for blkback and future netback patches it just cause a lock contention, as
      those pages never go to userspace. Therefore this series does the following:
      - the original functions were renamed to __gnttab_[un]map_refs, with a new
        parameter m2p_override
      - based on m2p_override either they follow the original behaviour, or just set
        the private flag and call set_phys_to_machine
      - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
        m2p_override false
      - a new function gnttab_[un]map_refs_userspace provides the old behaviour
      
      It also removes a stray space from page.h and change ret to 0 if
      XENFEAT_auto_translated_physmap, as that is the only possible return value
      there.
      
      v2:
      - move the storing of the old mfn in page->index to gnttab_map_refs
      - move the function header update to a separate patch
      
      v3:
      - a new approach to retain old behaviour where it needed
      - squash the patches into one
      
      v4:
      - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
      - clear page->private before doing anything with the page, so m2p_find_override
        won't race with this
      
      v5:
      - change return value handling in __gnttab_[un]map_refs
      - remove a stray space in page.h
      - add detail why ret = 0 now at some places
      
      v6:
      - don't pass pfn to m2p* functions, just get it locally
      Signed-off-by: NZoltan Kiss <zoltan.kiss@citrix.com>
      Suggested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      08ece5bb
    • M
      zram: remove zram->lock in read path and change it with mutex · e46e3315
      Minchan Kim 提交于
      Finally, we separated zram->lock dependency from 32bit stat/ table
      handling so there is no reason to use rw_semaphore between read and
      write path so this patch removes the lock from read path totally and
      changes rw_semaphore with mutex.  So, we could do
      
      old:
      
        read-read: OK
        read-write: NO
        write-write: NO
      
      Now:
      
        read-read: OK
        read-write: OK
        write-write: NO
      
      The below data proves mixed workload performs well 11 times and there is
      also enhance on write-write path because current rw-semaphore doesn't
      support SPIN_ON_OWNER.  It's side effect but anyway good thing for us.
      
      Write-related tests perform better (from 61% to 1058%) but read path has
      good/bad(from -2.22% to 1.45%) but they are all marginal within stddev.
      
        CPU 12
        iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0
      
        ==Initial write                ==Initial write
        records: 10                    records: 10
        avg:  516189.16                avg:  839907.96
        std:   22486.53 (4.36%)        std:   47902.17 (5.70%)
        max:  546970.60                max:  909910.35
        min:  481131.54                min:  751148.38
        ==Rewrite                      ==Rewrite
        records: 10                    records: 10
        avg:  509527.98                avg: 1050156.37
        std:   45799.94 (8.99%)        std:   40695.44 (3.88%)
        max:  611574.27                max: 1111929.26
        min:  443679.95                min:  980409.62
        ==Read                         ==Read
        records: 10                    records: 10
        avg: 4408624.17                avg: 4472546.76
        std:  281152.61 (6.38%)        std:  163662.78 (3.66%)
        max: 4867888.66                max: 4727351.03
        min: 4058347.69                min: 4126520.88
        ==Re-read                      ==Re-read
        records: 10                    records: 10
        avg: 4462147.53                avg: 4363257.75
        std:  283546.11 (6.35%)        std:  247292.63 (5.67%)
        max: 4912894.44                max: 4677241.75
        min: 4131386.50                min: 4035235.84
        ==Reverse Read                 ==Reverse Read
        records: 10                    records: 10
        avg: 4565865.97                avg: 4485818.08
        std:  313395.63 (6.86%)        std:  248470.10 (5.54%)
        max: 5232749.16                max: 4789749.94
        min: 4185809.62                min: 3963081.34
        ==Stride read                  ==Stride read
        records: 10                    records: 10
        avg: 4515981.80                avg: 4418806.01
        std:  211192.32 (4.68%)        std:  212837.97 (4.82%)
        max: 4889287.28                max: 4686967.22
        min: 4210362.00                min: 4083041.84
        ==Random read                  ==Random read
        records: 10                    records: 10
        avg: 4410525.23                avg: 4387093.18
        std:  236693.22 (5.37%)        std:  235285.23 (5.36%)
        max: 4713698.47                max: 4669760.62
        min: 4057163.62                min: 3952002.16
        ==Mixed workload               ==Mixed workload
        records: 10                    records: 10
        avg:  243234.25                avg: 2818677.27
        std:   28505.07 (11.72%)       std:  195569.70 (6.94%)
        max:  288905.23                max: 3126478.11
        min:  212473.16                min: 2484150.69
        ==Random write                 ==Random write
        records: 10                    records: 10
        avg:  555887.07                avg: 1053057.79
        std:   70841.98 (12.74%)       std:   35195.36 (3.34%)
        max:  683188.28                max: 1096125.73
        min:  437299.57                min:  992481.93
        ==Pwrite                       ==Pwrite
        records: 10                    records: 10
        avg:  501745.93                avg:  810363.09
        std:   16373.54 (3.26%)        std:   19245.01 (2.37%)
        max:  518724.52                max:  833359.70
        min:  464208.73                min:  765501.87
        ==Pread                        ==Pread
        records: 10                    records: 10
        avg: 4539894.60                avg: 4457680.58
        std:  197094.66 (4.34%)        std:  188965.60 (4.24%)
        max: 4877170.38                max: 4689905.53
        min: 4226326.03                min: 4095739.72
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46e3315
    • M
      zram: remove workqueue for freeing removed pending slot · f614a9f4
      Minchan Kim 提交于
      Commit a0c516cb ("zram: don't grab mutex in zram_slot_free_noity")
      introduced free request pending code to avoid scheduling by mutex under
      spinlock and it was a mess which made code lenghty and increased
      overhead.
      
      Now, we don't need zram->lock any more to free slot so this patch
      reverts it and then, tb_lock should protect it.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f614a9f4
    • M
      zram: introduce zram->tb_lock · 92967471
      Minchan Kim 提交于
      Currently, the zram table is protected by zram->lock but it's rather
      coarse-grained lock and it makes hard for scalibility.
      
      Let's use own rwlock instead of depending on zram->lock.  This patch
      adds new locking so obviously, it would make slow but this patch is just
      prepartion for removing coarse-grained rw_semaphore(ie, zram->lock)
      which is hurdle about zram scalability.
      
      Final patch in this patchset series will remove the lock from read-path
      and change rw_semaphore with mutex in write path.  With bonus, we could
      drop pending slot free mess in next patch.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92967471
    • M
      zram: use atomic operation for stat · deb0bdeb
      Minchan Kim 提交于
      Some of fields in zram->stats are protected by zram->lock which is
      rather coarse-grained so let's use atomic operation without explict
      locking.
      
      This patch is ready for removing dependency of zram->lock in read path
      which is very coarse-grained rw_semaphore.  Of course, this patch adds
      new atomic operation so it might make slow but my 12CPU test couldn't
      spot any regression.  All gain/lose is marginal within stddev.
      
        iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0
      
        ==Initial write                ==Initial write
        records: 50                    records: 50
        avg:  412875.17                avg:  415638.23
        std:   38543.12 (9.34%)        std:   36601.11 (8.81%)
        max:  521262.03                max:  502976.72
        min:  343263.13                min:  351389.12
        ==Rewrite                      ==Rewrite
        records: 50                    records: 50
        avg:  416640.34                avg:  397914.33
        std:   60798.92 (14.59%)       std:   46150.42 (11.60%)
        max:  543057.07                max:  522669.17
        min:  304071.67                min:  316588.77
        ==Read                         ==Read
        records: 50                    records: 50
        avg: 4147338.63                avg: 4070736.51
        std:  179333.25 (4.32%)        std:  223499.89 (5.49%)
        max: 4459295.28                max: 4539514.44
        min: 3753057.53                min: 3444686.31
        ==Re-read                      ==Re-read
        records: 50                    records: 50
        avg: 4096706.71                avg: 4117218.57
        std:  229735.04 (5.61%)        std:  171676.25 (4.17%)
        max: 4430012.09                max: 4459263.94
        min: 2987217.80                min: 3666904.28
        ==Reverse Read                 ==Reverse Read
        records: 50                    records: 50
        avg: 4062763.83                avg: 4078508.32
        std:  186208.46 (4.58%)        std:  172684.34 (4.23%)
        max: 4401358.78                max: 4424757.22
        min: 3381625.00                min: 3679359.94
        ==Stride read                  ==Stride read
        records: 50                    records: 50
        avg: 4094933.49                avg: 4082170.22
        std:  185710.52 (4.54%)        std:  196346.68 (4.81%)
        max: 4478241.25                max: 4460060.97
        min: 3732593.23                min: 3584125.78
        ==Random read                  ==Random read
        records: 50                    records: 50
        avg: 4031070.04                avg: 4074847.49
        std:  192065.51 (4.76%)        std:  206911.33 (5.08%)
        max: 4356931.16                max: 4399442.56
        min: 3481619.62                min: 3548372.44
        ==Mixed workload               ==Mixed workload
        records: 50                    records: 50
        avg:  149925.73                avg:  149675.54
        std:    7701.26 (5.14%)        std:    6902.09 (4.61%)
        max:  191301.56                max:  175162.05
        min:  133566.28                min:  137762.87
        ==Random write                 ==Random write
        records: 50                    records: 50
        avg:  404050.11                avg:  393021.47
        std:   58887.57 (14.57%)       std:   42813.70 (10.89%)
        max:  601798.09                max:  524533.43
        min:  325176.99                min:  313255.34
        ==Pwrite                       ==Pwrite
        records: 50                    records: 50
        avg:  411217.70                avg:  411237.96
        std:   43114.99 (10.48%)       std:   33136.29 (8.06%)
        max:  530766.79                max:  471899.76
        min:  320786.84                min:  317906.94
        ==Pread                        ==Pread
        records: 50                    records: 50
        avg: 4154908.65                avg: 4087121.92
        std:  151272.08 (3.64%)        std:  219505.04 (5.37%)
        max: 4459478.12                max: 4435857.38
        min: 3730512.41                min: 3101101.67
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      deb0bdeb
    • M
      zram: remove unnecessary free · 874e3cdd
      Minchan Kim 提交于
      Commit a0c516cb ("zram: don't grab mutex in zram_slot_free_noity")
      introduced pending zram slot free in zram's write path in case of
      missing slot free by memory allocation failure in zram_slot_free_notify
      but it is not necessary because we have already freed the slot right
      before overwriting.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      874e3cdd
    • M
      zram: delay pending free request in read path · 9b353db1
      Minchan Kim 提交于
      Sergey reported we don't need to handle pending free request every I/O
      so that this patch removes it in read path while we remain it in write
      path.
      
      Let's consider below example.
      
      Swap subsystem ask to zram "A" block free by swap_slot_free_notify but
      zram had been pended it without real freeing.  Swap subsystem allocates
      "A" block for new data but request pended for a long time just handled
      and zram blindly free new data on the "A" block.  :(
      
      That's why we couldn't remove handle pending free request right before
      zram-write.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b353db1
    • M
      zram: fix race between reset and flushing pending work · da4a0412
      Minchan Kim 提交于
      Dan and Sergey reported that there is a racy between reset and flushing
      of pending work so that it could make oops by freeing zram->meta in
      reset while zram_slot_free can access zram->meta if new request is
      adding during the race window.
      
      This patch moves flush after taking init_lock so it prevents new request
      so that it closes the race.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da4a0412
    • M
      zram: add copyright · 7bfb3de8
      Minchan Kim 提交于
      Add my copyright to the zram source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bfb3de8
    • M
      zram: remove old private project comment · 49061236
      Minchan Kim 提交于
      Remove the old private compcache project address so upcoming patches
      should be sent to LKML because we Linux kernel community will take care.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49061236
    • M
      zram: promote zram from staging · cd67e10a
      Minchan Kim 提交于
      Zram has lived in staging for a LONG LONG time and have been
      fixed/improved by many contributors so code is clean and stable now.  Of
      course, there are lots of product using zram in real practice.
      
      The major TV companys have used zram as swap since two years ago and
      recently our production team released android smart phone with zram
      which is used as swap, too and recently Android Kitkat start to use zram
      for small memory smart phone.  And there was a report Google released
      their ChromeOS with zram, too and cyanogenmod have been used zram long
      time ago.  And I heard some disto have used zram block device for tmpfs.
      In addition, I saw many report from many other peoples.  For example,
      Lubuntu start to use it.
      
      The benefit of zram is very clear.  With my experience, one of the
      benefit was to remove jitter of video application with backgroud memory
      pressure.  It would be effect of efficient memory usage by compression
      but more issue is whether swap is there or not in the system.  Recent
      mobile platforms have used JAVA so there are many anonymous pages.  But
      embedded system normally are reluctant to use eMMC or SDCard as swap
      because there is wear-leveling and latency issues so if we do not use
      swap, it means we can't reclaim anoymous pages and at last, we could
      encounter OOM kill.  :(
      
      Although we have real storage as swap, it was a problem, too.  Because
      it sometime ends up making system very unresponsible caused by slow swap
      storage performance.
      
      Quote from Luigi on Google
       "Since Chrome OS was mentioned: the main reason why we don't use swap
        to a disk (rotating or SSD) is because it doesn't degrade gracefully
        and leads to a bad interactive experience.  Generally we prefer to
        manage RAM at a higher level, by transparently killing and restarting
        processes.  But we noticed that zram is fast enough to be competitive
        with the latter, and it lets us make more efficient use of the
        available RAM.  " and he announced.
      http://www.spinics.net/lists/linux-mm/msg57717.html
      
      Other uses case is to use zram for block device.  Zram is block device
      so anyone can format the block device and mount on it so some guys on
      the internet start zram as /var/tmp.
      http://forums.gentoo.org/viewtopic-t-838198-start-0.html
      
      Let's promote zram and enhance/maintain it instead of removing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd67e10a
  10. 30 1月, 2014 1 次提交
  11. 28 1月, 2014 12 次提交