1. 02 10月, 2013 2 次提交
    • T
      SUNRPC: Enable the keepalive option for TCP sockets · 7f260e85
      Trond Myklebust 提交于
      For NFSv4 we want to avoid retransmitting RPC calls unless the TCP
      connection breaks. However we still want to detect TCP connection
      breakage as soon as possible. Do this by setting the keepalive option
      with the idle timeout and count set to the 'timeo' and 'retrans' mount
      options.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      7f260e85
    • T
      NFSv4: Fix a use-after-free situation in _nfs4_proc_getlk() · a6f951dd
      Trond Myklebust 提交于
      In nfs4_proc_getlk(), when some error causes a retry of the call to
      _nfs4_proc_getlk(), we can end up with Oopses of the form
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000134
       IP: [<ffffffff8165270e>] _raw_spin_lock+0xe/0x30
      <snip>
       Call Trace:
        [<ffffffff812f287d>] _atomic_dec_and_lock+0x4d/0x70
        [<ffffffffa053c4f2>] nfs4_put_lock_state+0x32/0xb0 [nfsv4]
        [<ffffffffa053c585>] nfs4_fl_release_lock+0x15/0x20 [nfsv4]
        [<ffffffffa0522c06>] _nfs4_proc_getlk.isra.40+0x146/0x170 [nfsv4]
        [<ffffffffa052ad99>] nfs4_proc_lock+0x399/0x5a0 [nfsv4]
      
      The problem is that we don't clear the request->fl_ops after the first
      try and so when we retry, nfs4_set_lock_state() exits early without
      setting the lock stateid.
      Regression introduced by commit 70cc6487
      (locks: make ->lock release private data before returning in GETLK case)
      Reported-by: NWeston Andros Adamson <dros@netapp.com>
      Reported-by: NJorge Mora <mora@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: <stable@vger.kernel.org> #2.6.22+
      a6f951dd
  2. 01 10月, 2013 30 次提交
    • L
      Merge tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · f9273188
      Linus Torvalds 提交于
      Pull NFS client bugfixes from Trond Myklebust:
       - Stable fix for Oopses in the pNFS files layout driver
       - Fix a regression when doing a non-exclusive file create on NFSv4.x
       - NFSv4.1 security negotiation fixes when looking up the root
         filesystem
       - Fix a memory ordering issue in the pNFS files layout driver
      
      * tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        NFS: Give "flavor" an initial value to fix a compile warning
        NFSv4.1: try SECINFO_NO_NAME flavs until one works
        NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds
        NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
        NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method
      f9273188
    • L
      Merge branch 'akpm' (fixes from Andrew Morton) · 522d6d38
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
        pidns: fix free_pid() to handle the first fork failure
        ipc,msg: prevent race with rmid in msgsnd,msgrcv
        ipc/sem.c: update sem_otime for all operations
        mm/hwpoison: fix the lack of one reference count against poisoned page
        mm/hwpoison: fix false report on 2nd attempt at page recovery
        mm/hwpoison: fix test for a transparent huge page
        mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
        block: change config option name for cmdline partition parsing
        mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
        mm: avoid reinserting isolated balloon pages into LRU lists
        arch/parisc/mm/fault.c: fix uninitialized variable usage
        include/asm-generic/vtime.h: avoid zero-length file
        nilfs2: fix issue with race condition of competition between segments for dirty blocks
        Documentation/kernel-parameters.txt: replace kernelcore with Movable
        mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
        kernel/kmod.c: check for NULL in call_usermodehelper_exec()
        ipc/sem.c: synchronize the proc interface
        ipc/sem.c: optimize sem_lock()
        ipc/sem.c: fix race in sem_lock()
        mm/compaction.c: periodically schedule when freeing pages
        ...
      522d6d38
    • O
      pidns: fix free_pid() to handle the first fork failure · 314a8ad0
      Oleg Nesterov 提交于
      "case 0" in free_pid() assumes that disable_pid_allocation() should
      clear PIDNS_HASH_ADDING before the last pid goes away.
      
      However this doesn't happen if the first fork() fails to create the
      child reaper which should call disable_pid_allocation().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314a8ad0
    • D
      ipc,msg: prevent race with rmid in msgsnd,msgrcv · 4271b05a
      Davidlohr Bueso 提交于
      This fixes a race in both msgrcv() and msgsnd() between finding the msg
      and actually dealing with the queue, as another thread can delete shmid
      underneath us if we are preempted before acquiring the
      kern_ipc_perm.lock.
      
      Manfred illustrates this nicely:
      
      Assume a preemptible kernel that is preempted just after
      
          msq = msq_obtain_object_check(ns, msqid)
      
      in do_msgrcv().  The only lock that is held is rcu_read_lock().
      
      Now the other thread processes IPC_RMID.  When the first task is
      resumed, then it will happily wait for messages on a deleted queue.
      
      Fix this by checking for if the queue has been deleted after taking the
      lock.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: <stable@vger.kernel.org> 	[3.11]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4271b05a
    • M
      ipc/sem.c: update sem_otime for all operations · 0e8c6656
      Manfred Spraul 提交于
      In commit 0a2b9d4c ("ipc/sem.c: move wake_up_process out of the
      spinlock section"), the update of semaphore's sem_otime(last semop time)
      was moved to one central position (do_smart_update).
      
      But since do_smart_update() is only called for operations that modify
      the array, this means that wait-for-zero semops do not update sem_otime
      anymore.
      
      The fix is simple:
      Non-alter operations must update sem_otime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Reported-by: NJia He <jiakernel@gmail.com>
      Tested-by: NJia He <jiakernel@gmail.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e8c6656
    • W
      mm/hwpoison: fix the lack of one reference count against poisoned page · fb31ba30
      Wanpeng Li 提交于
      The lack of one reference count against poisoned page for hwpoison_inject
      w/o hwpoison_filter enabled result in hwpoison detect -1 users still
      referenced the page, however, the number should be 0 except the poison
      handler held one after successfully unmap.  This patch fix it by hold one
      referenced count against poisoned page for hwpoison_inject w/ and w/o
      hwpoison_filter enabled.
      
      Before patch:
      
      [   71.902112] Injecting memory failure at pfn 224706
      [   71.902137] MCE 0x224706: dirty LRU page recovery: Failed
      [   71.902138] MCE 0x224706: dirty LRU page still referenced by -1 users
      
      After patch:
      
      [   94.710860] Injecting memory failure at pfn 215b68
      [   94.710885] MCE 0x215b68: dirty LRU page recovery: Recovered
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb31ba30
    • W
      mm/hwpoison: fix false report on 2nd attempt at page recovery · 2d421acd
      Wanpeng Li 提交于
      If the page is poisoned by software injection w/ MF_COUNT_INCREASED
      flag, there is a false report during the 2nd attempt at page recovery
      which is not truthful.
      
      This patch fixes it by reporting the first attempt to try free buddy
      page recovery if MF_COUNT_INCREASED is set.
      
      Before patch:
      
      [  346.332041] Injecting memory failure at pfn 200010
      [  346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed
      
      After patch:
      
      [  297.742600] Injecting memory failure at pfn 200010
      [  297.742941] MCE 0x200010: free buddy page recovery: Delayed
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d421acd
    • W
      mm/hwpoison: fix test for a transparent huge page · e76d30e2
      Wanpeng Li 提交于
      PageTransHuge() can't guarantee the page is a transparent huge page
      since it returns true for both transparent huge and hugetlbfs pages.
      
      This patch fixes it by checking the page is also !hugetlbfs page.
      
      Before patch:
      
      [  121.571128] Injecting memory failure at pfn 23a200
      [  121.571141] MCE 0x23a200: huge page recovery: Delayed
      [  140.355100] MCE: Memory failure is now running on 0x23a200
      
      After patch:
      
      [   94.290793] Injecting memory failure at pfn 23a000
      [   94.290800] MCE 0x23a000: huge page recovery: Delayed
      [  105.722303] MCE: Software-unpoisoned page 0x23a000
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e76d30e2
    • W
      mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood · 20cb6cab
      Wanpeng Li 提交于
      madvise_hwpoison won't check if the page is small page or huge page and
      traverses in small page granularity against the range unconditionally,
      which result in a printk flood "MCE xxx: already hardware poisoned" if
      the page is a huge page.
      
      This patch fixes it by using compound_order(compound_head(page)) for
      huge page iterator.
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 3
      #define PAGE_SIZE	4096 * 512
      
      int main(void)
      {
      	char *mem;
      	int i;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
      
      	return 0;
      }
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20cb6cab
    • P
      block: change config option name for cmdline partition parsing · 080506ad
      Paul Gortmaker 提交于
      Recently commit bab55417 ("block: support embedded device command
      line partition") introduced CONFIG_CMDLINE_PARSER.  However, that name
      is too generic and sounds like it enables/disables generic kernel boot
      arg processing, when it really is block specific.
      
      Before this option becomes a part of a full/final release, add the BLK_
      prefix to it so that it is clear in absence of any other context that it
      is block specific.
      
      In addition, fix up the following less critical items:
       - help text was not really at all helpful.
       - index file for Documentation was not updated
       - add the new arg to Documentation/kernel-parameters.txt
       - clarify wording in source comments
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Cai Zhiyong <caizhiyong@huawei.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      080506ad
    • V
      mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration · eadb41ae
      Vlastimil Babka 提交于
      The function __munlock_pagevec_fill() introduced in commit 7a8010cd
      ("mm: munlock: manual pte walk in fast path instead of
      follow_page_mask()") uses pmd_addr_end() for restricting its operation
      within current page table.
      
      This is insufficient on architectures/configurations where pmd is folded
      and pmd_addr_end() just returns the end of the full range to be walked.
      In this case, it allows pte++ to walk off the end of a page table
      resulting in unpredictable behaviour.
      
      This patch fixes the function by using pgd_addr_end() and pud_addr_end()
      before pmd_addr_end(), which will yield correct page table boundary on
      all configurations.  This is similar to what existing page walkers do
      when walking each level of the page table.
      
      Additionaly, the patch clarifies a comment for get_locked_pte() call in the
      function.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eadb41ae
    • R
      mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e
      Rafael Aquini 提交于
      Isolated balloon pages can wrongly end up in LRU lists when
      migrate_pages() finishes its round without draining all the isolated
      page list.
      
      The same issue can happen when reclaim_clean_pages_from_list() tries to
      reclaim pages from an isolated page list, before migration, in the CMA
      path.  Such balloon page leak opens a race window against LRU lists
      shrinkers that leads us to the following kernel panic:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
        IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
        PGD 3cda2067 PUD 3d713067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: shrink_page_list+0x24e/0x897
        RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
        RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
        RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
        R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
        R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
        FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
        Call Trace:
          shrink_inactive_list+0x240/0x3de
          shrink_lruvec+0x3e0/0x566
          __shrink_zone+0x94/0x178
          shrink_zone+0x3a/0x82
          balance_pgdat+0x32a/0x4c2
          kswapd+0x2f0/0x372
          kthread+0xa2/0xaa
          ret_from_fork+0x7c/0xb0
        Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
        RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
         RSP <ffff88003da499b8>
        CR2: 0000000000000028
        ---[ end trace 703d2451af6ffbfd ]---
        Kernel panic - not syncing: Fatal exception
      
      This patch fixes the issue, by assuring the proper tests are made at
      putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
      isolated balloon pages being wrongly reinserted in LRU lists.
      
      [akpm@linux-foundation.org: clarify awkward comment text]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Reported-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Tested-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117aad1e
    • F
      arch/parisc/mm/fault.c: fix uninitialized variable usage · 0772dac1
      Felipe Pena 提交于
      The FAULT_FLAG_WRITE flag has been set based on uninitialized variable.
      
      Fixes a regression added by commit 759496ba ("arch: mm: pass
      userspace fault flag to generic fault handler")
      Signed-off-by: NFelipe Pena <felipensp@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0772dac1
    • A
      include/asm-generic/vtime.h: avoid zero-length file · 2a156a6b
      Andrew Morton 提交于
      patch(1) can't handle zero-length files - it appears to simply not create
      the file, so my powerpc build fails.
      
      Put something in here to make life easier.
      
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a156a6b
    • V
      nilfs2: fix issue with race condition of competition between segments for dirty blocks · 7f42ec39
      Vyacheslav Dubeyko 提交于
      Many NILFS2 users were reported about strange file system corruption
      (for example):
      
         NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
         NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)
      
      But such error messages are consequence of file system's issue that takes
      place more earlier.  Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
      and Anton Eliasson <devel@antoneliasson.se> were reported about another
      issue not so recently.  These reports describe the issue with segctor
      thread's crash:
      
        BUG: unable to handle kernel paging request at 0000000000004c83
        IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
        Call Trace:
         nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
         nilfs_segctor_construct+0x17b/0x290 [nilfs2]
         nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
         kthread+0xc0/0xd0
         ret_from_fork+0x7c/0xb0
      
      These two issues have one reason.  This reason can raise third issue
      too.  Third issue results in hanging of segctor thread with eating of
      100% CPU.
      
      REPRODUCING PATH:
      
      One of the possible way or the issue reproducing was described by
      Jermoe me Poulin <jeromepoulin@gmail.com>:
      
      1. init S to get to single user mode.
      2. sysrq+E to make sure only my shell is running
      3. start network-manager to get my wifi connection up
      4. login as root and launch "screen"
      5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
      6. lscp | xz -9e > lscp.txt.xz
      7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
      8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
      9. start a screen and launch strace -f -o find-cat.log -t find
      /mnt/nilfs -type f -exec cat {} > /dev/null \;
      10. start a screen and launch strace -f -o apt-get.log -t apt-get update
      11. launch the last command again as it did not crash the first time
      12. apt-get crashes
      13. ps aux > ps-aux-crashed.log
      13. sysrq+W
      14. sysrq+E  wait for everything to terminate
      15. sysrq+SUSB
      
      Simplified way of the issue reproducing is starting kernel compilation
      task and "apt-get update" in parallel.
      
      REPRODUCIBILITY:
      
      The issue is reproduced not stable [60% - 80%].  It is very important to
      have proper environment for the issue reproducing.  The critical
      conditions for successful reproducing:
      
      (1) It should have big modified file by mmap() way.
      
      (2) This file should have the count of dirty blocks are greater that
          several segments in size (for example, two or three) from time to time
          during processing.
      
      (3) It should be intensive background activity of files modification
          in another thread.
      
      INVESTIGATION:
      
      First of all, it is possible to see that the reason of crash is not valid
      page address:
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
        NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783
      
      Moreover, value of b_page (0x1a82) is 6786.  This value looks like segment
      number.  And b_blocknr with b_size values look like block numbers.  So,
      buffer_head's pointer points on not proper address value.
      
      Detailed investigation of the issue is discovered such picture:
      
        [-----------------------------SEGMENT 6783-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783
      
        [-----------------------------SEGMENT 6784-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15
      
        [-----------------------------SEGMENT 6785-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12
      
        NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
      
        BUG: unable to handle kernel paging request at 0000000000001a82
        IP: [<ffffffffa024d0f2>] nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
      Usually, for every segment we collect dirty files in list.  Then, dirty
      blocks are gathered for every dirty file, prepared for write and
      submitted by means of nilfs_segbuf_submit_bh() call.  Finally, it takes
      place complete write phase after calling nilfs_end_bio_write() on the
      block layer.  Buffers/pages are marked as not dirty on final phase and
      processed files removed from the list of dirty files.
      
      It is possible to see that we had three prepare_write and submit_bio
      phases before segbuf_wait and complete_write phase.  Moreover, segments
      compete between each other for dirty blocks because on every iteration
      of segments processing dirty buffer_heads are added in several lists of
      payload_buffers:
      
        [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
      
      The next pointer is the same but prev pointer has changed.  It means
      that buffer_head has next pointer from one list but prev pointer from
      another.  Such modification can be made several times.  And, finally, it
      can be resulted in various issues: (1) segctor hanging, (2) segctor
      crashing, (3) file system metadata corruption.
      
      FIX:
      This patch adds:
      
      (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
          for every proccessed dirty block;
      
      (2) checking of BH_Async_Write flag in
          nilfs_lookup_dirty_data_buffers() and
          nilfs_lookup_dirty_node_buffers();
      
      (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
          nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().
      Reported-by: NJerome Poulin <jeromepoulin@gmail.com>
      Reported-by: NAnton Eliasson <devel@antoneliasson.se>
      Cc: Paul Fertser <fercerpav@gmail.com>
      Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
      Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
      Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net>
      Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
      Cc: Elmer Zhang <freeboy6716@gmail.com>
      Cc: Kenneth Langga <klangga@gmail.com>
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f42ec39
    • W
      Documentation/kernel-parameters.txt: replace kernelcore with Movable · 675217fd
      Weiping Pan 提交于
      Han Pingtian found a typo in Documentation/kernel-parameters.txt about
      "kernelcore=", that "kernelcore" should be replaced with "Movable" here.
      Signed-off-by: NWeiping Pan <wpan@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      675217fd
    • D
      mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored · 83b2944f
      Darrick J. Wong 提交于
      The "force" parameter in __blk_queue_bounce was being ignored, which
      means that stable page snapshots are not always happening (on ext3).
      This of course leads to DIF disks reporting checksum errors, so fix this
      regression.
      
      The regression was introduced in commit 6bc454d1 ("bounce: Refactor
      __blk_queue_bounce to not use bi_io_vec")
      Reported-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83b2944f
    • T
      kernel/kmod.c: check for NULL in call_usermodehelper_exec() · 4c1c7be9
      Tetsuo Handa 提交于
      If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
      dereference happens upon core dump because argv_split("") returns
      argv[0] == NULL.
      
      This bug was once fixed by commit 264b83c0 ("usermodehelper: check
      subprocess_info->path != NULL") but was by error reintroduced by commit
      7f57cfa4 ("usermodehelper: kill the sub_info->path[0] check").
      
      This bug seems to exist since 2.6.19 (the version which core dump to
      pipe was added).  Depending on kernel version and config, some side
      effect might happen immediately after this oops (e.g.  kernel panic with
      2.6.32-358.18.1.el6).
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c1c7be9
    • M
      ipc/sem.c: synchronize the proc interface · d8c63376
      Manfred Spraul 提交于
      The proc interface is not aware of sem_lock(), it instead calls
      ipc_lock_object() directly.  This means that simple semop() operations
      can run in parallel with the proc interface.  Right now, this is
      uncritical, because the implementation doesn't do anything that requires
      a proper synchronization.
      
      But it is dangerous and therefore should be fixed.
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8c63376
    • M
      ipc/sem.c: optimize sem_lock() · 6d07b68c
      Manfred Spraul 提交于
      Operations that need access to the whole array must guarantee that there
      are no simple operations ongoing.  Right now this is achieved by
      spin_unlock_wait(sem->lock) on all semaphores.
      
      If complex_count is nonzero, then this spin_unlock_wait() is not
      necessary, because it was already performed in the past by the thread
      that increased complex_count and even though sem_perm.lock was dropped
      inbetween, no simple operation could have started, because simple
      operations cannot start when complex_count is non-zero.
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d07b68c
    • M
      ipc/sem.c: fix race in sem_lock() · 5e9d5275
      Manfred Spraul 提交于
      The exclusion of complex operations in sem_lock() is insufficient: after
      acquiring the per-semaphore lock, a simple op must first check that
      sem_perm.lock is not locked and only after that test check
      complex_count.  The current code does it the other way around - and that
      creates a race.  Details are below.
      
      The patch is a complete rewrite of sem_lock(), based in part on the code
      from Mike Galbraith.  It removes all gotos and all loops and thus the
      risk of livelocks.
      
      I have tested the patch (together with the next one) on my i3 laptop and
      it didn't cause any problems.
      
      The bug is probably also present in 3.10 and 3.11, but for these kernels
      it might be simpler just to move the test of sma->complex_count after
      the spin_is_locked() test.
      
      Details of the bug:
      
      Assume:
       - sma->complex_count = 0.
       - Thread 1: semtimedop(complex op that must sleep)
       - Thread 2: semtimedop(simple op).
      
      Pseudo-Trace:
      
      Thread 1: sem_lock(): acquire sem_perm.lock
      Thread 1: sem_lock(): check for ongoing simple ops
      			Nothing ongoing, thread 2 is still before sem_lock().
      Thread 1: try_atomic_semop()
      	<<< preempted.
      
      Thread 2: sem_lock():
              static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
                                            int nsops)
              {
                      int locknum;
               again:
                      if (nsops == 1 && !sma->complex_count) {
                              struct sem *sem = sma->sem_base + sops->sem_num;
      
                              /* Lock just the semaphore we are interested in. */
                              spin_lock(&sem->lock);
      
                              /*
                               * If sma->complex_count was set while we were spinning,
                               * we may need to look at things we did not lock here.
                               */
                              if (unlikely(sma->complex_count)) {
                                      spin_unlock(&sem->lock);
                                      goto lock_array;
                              }
              <<<<<<<<<
      	<<< complex_count is still 0.
      	<<<
              <<< Here it is preempted
              <<<<<<<<<
      
      Thread 1: try_atomic_semop() returns, notices that it must sleep.
      Thread 1: increases sma->complex_count.
      Thread 1: drops sem_perm.lock
      Thread 2:
                      /*
                       * Another process is holding the global lock on the
                       * sem_array; we cannot enter our critical section,
                       * but have to wait for the global lock to be released.
                       */
                      if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
                              spin_unlock(&sem->lock);
                              spin_unlock_wait(&sma->sem_perm.lock);
                              goto again;
                      }
      	<<< sem_perm.lock already dropped, thus no "goto again;"
      
                      locknum = sops->sem_num;
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e9d5275
    • D
      mm/compaction.c: periodically schedule when freeing pages · f6ea3adb
      David Rientjes 提交于
      We've been getting warnings about an excessive amount of time spent
      allocating pages for migration during memory compaction without
      scheduling.  isolate_freepages_block() already periodically checks for
      contended locks or the need to schedule, but isolate_freepages() never
      does.
      
      When a zone is massively long and no suitable targets can be found, this
      iteration can be quite expensive without ever doing cond_resched().
      
      Check periodically for the need to reschedule while the compaction free
      scanner iterates.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6ea3adb
    • D
      fs/binfmt_elf.c: prevent a coredump with a large vm_map_count from Oopsing · 72023656
      Dan Aloni 提交于
      A high setting of max_map_count, and a process core-dumping with a large
      enough vm_map_count could result in an NT_FILE note not being written,
      and the kernel crashing immediately later because it has assumed
      otherwise.
      
      Reproduction of the oops-causing bug described here:
      
          https://lkml.org/lkml/2013/8/30/50
      
      Rge ussue originated in commit 2aa362c4 ("coredump: extend core dump
      note section to contain file names of mapped file") from Oct 4, 2012.
      
      This patch make that section optional in that case.  fill_files_note()
      should signify the error, and also let the info struct in
      elf_core_dump() be zero-initialized so that we can check for the
      optionally written note.
      
      [akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
      [akpm@linux-foundation.org: fix sparse warning]
      Signed-off-by: NDan Aloni <alonid@stratoscale.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Reported-by: NMartin MOKREJS <mmokrejs@gmail.com>
      Tested-by: NMartin MOKREJS <mmokrejs@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72023656
    • J
      revert "mm/memory-hotplug: fix lowmem count overflow when offline pages" · 7393dc45
      Joonyoung Shim 提交于
      This reverts commit cea27eb2 ("mm/memory-hotplug: fix lowmem count
      overflow when offline pages").
      
      The fixed bug by commit cea27eb2 was fixed to another way by commit
      3dcc0571 ("mm: correctly update zone->managed_pages").  That commit
      enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
      memory, for details please refer to:
      
        http://marc.info/?l=linux-mm&m=136957578620221&w=2
      
      As a result, commit cea27eb2 currently causes duplicated decreasing
      of totalhigh_pages, thus the revert.
      Signed-off-by: NJoonyoung Shim <jy0922.shim@samsung.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7393dc45
    • L
      Merge tag 'regulator-v3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator · df532d54
      Linus Torvalds 提交于
      Pull regulator fixes from Mark Brown:
       "Quite a few fixes here, mostly small driver specific ones.
      
        The stand out thing is a fix for errors generating the documentation
        from Randy Dunlap, otherwise unless you're using the driver in
        question there should be no impact"
      
      * tag 'regulator-v3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
        regulator: ti-abb: Fix bias voltage glitch in transition to bypass mode
        regulator: wm831x-ldo: Fix max_uV for gp_ldo and aldo linear range settings
        regulator: wm8350: correct the max_uV of LDO
        regulator: fix fatal kernel-doc error
        regulator: palmas: Remove wrong comment for the equation calculating num_voltages
        regulator: da9063: Fix PTR_ERR/ERR_PTR mismatch
        regulator: palmas: configure enable time for LDOs
        regulator: palmas: fix the n_voltages for smps to 122
      df532d54
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · b822cb18
      Linus Torvalds 提交于
      Pull apparmor fixes from James Morris:
       "Bugfixes for the Apparmor code for regressions introduced in the 3.12
        pull request"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
        apparmor: fix suspicious RCU usage warning in policy.c/policy.h
        apparmor: Use shash crypto API interface for profile hashes
      b822cb18
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · cbb16bec
      Linus Torvalds 提交于
      Pull assorted vfs fixes from Al Viro:
       "A couple of bug fixes + removal of dead code in afs ->d_revalidate()"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        afs: dget_parent() can't return a negative dentry
        ocfs2: needs ->d_lock to poke in ->d_parent->d_inode from ->d_revalidate()
        sysv: Add forgotten superblock lock init for v7 fs
      cbb16bec
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32 · 5c282e85
      Linus Torvalds 提交于
      Pull AVR32 fixes from Hans-Christian Egtvedt.
      
      Fix build warnings and use the Kbuild infrastructure for generic headers
      rather than doing it by hand.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
        avr32: cast syscall_return to silence compiler warning
        avr32: fix clockevents kernel warning
        avr32: use Kbuild infrastructure to handle the asm-generic headers
      5c282e85
    • L
      Merge tag 'for-linus-20130929' of git://github.com/sctscore/official-linux · 8945546d
      Linus Torvalds 提交于
      Pull S+core fixes from Lennox Wu:
       "These updates include updating information of maintainers, fix some
        trivial errors, and add a necessary function for supporting ipv6"
      
      * tag 'for-linus-20130929' of git://github.com/sctscore/official-linux:
        Score: Update the information of Score maintaners
        Score: Modify the Makefile of Score, remove -mlong-calls for compiling
        Score: Implement the function csum_ipv6_magic
        Score: The commit is for compiling successfully
      8945546d
    • L
      Merge tag 'arc-fixes-for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 815a4bb1
      Linus Torvalds 提交于
      Pull ARC Fixes from Vineet Gupta:
       - Handle unaligned access in zero delay loops
       - spinlock livelock fix for SMP systemC model
       - fix 32bit overflow in access_ok
       - better setup of clockevents
      
      * tag 'arc-fixes-for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: Use clockevents_config_and_register over clockevents_register_device
        ARC: Workaround spinlock livelock in SMP SystemC simulation
        ARC: Fix 32-bit wrap around in access_ok()
        ARC: Handle zero-overhead-loop in unaligned access handler
      815a4bb1
  3. 30 9月, 2013 8 次提交