1. 02 10月, 2013 2 次提交
    • J
      xfs: fix the wrong new_size/rnew_size at xfs_iext_realloc_direct() · 17ec81c1
      Jie Liu 提交于
      At xfs_iext_realloc_direct(), the new_size is changed by adding
      if_bytes if originally the extent records are stored at the inline
      extent buffer, and we have to switch from it to a direct extent
      list for those new allocated extents, this is wrong. e.g,
      
      Create a file with three extents which was showing as following,
      
      xfs_io -f -c "truncate 100m" /xfs/testme
      
      for i in $(seq 0 5 10); do
      	offset=$(($i * $((1 << 20))))
      	xfs_io -c "pwrite $offset 1m" /xfs/testme
      done
      
      Inline
      ------
      irec:	if_bytes	bytes_diff	new_size
      1st	0		16		16
      2nd	16		16		32
      
      Switching
      ---------						rnew_size
      3rd	32		16		48 + 32 = 80	roundup=128
      
      In this case, the desired value of new_size should be 48, and then
      it will be roundup to 64 and be assigned to rnew_size.
      
      However, this issue has been covered by resetting the if_bytes to
      the new_size which is calculated at the begnning of xfs_iext_add()
      before leaving out this function, and in turn make the rnew_size
      correctly again. Hence, this can not be detected via xfstestes.
      
      This patch fix above problem and revise the new_size comments at
      xfs_iext_realloc_direct() to make it more readable.  Also, fix the
      comments while switching from the inline extent buffer to a direct
      extent list to reflect this change.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      17ec81c1
    • J
      xfs: get rid of count from xfs_iomap_write_allocate() · 0799a3e8
      Jie Liu 提交于
      Get rid of function variable count from xfs_iomap_write_allocate() as
      it is unused.
      
      Additionally, checkpatch warn me of the following for this change:
      WARNING: extern prototypes should be avoided in .h files
      +extern int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
      
      So this patch also remove all extern function prototypes at xfs_iomap.h
      to suppress it to make this code style in consistent manner in this file.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0799a3e8
  2. 01 10月, 2013 4 次提交
    • T
      xfs: Use kmem_free() instead of free() · aaaae980
      Thierry Reding 提交于
      This fixes a build failure caused by calling the free() function which
      does not exist in the Linux kernel.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aaaae980
    • T
      xfs: fix memory leak in xlog_recover_add_to_trans · 519ccb81
      tinguely@sgi.com 提交于
      Free the memory in error path of xlog_recover_add_to_trans().
      Normally this memory is freed in recovery pass2, but is leaked
      in the error path.
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      519ccb81
    • D
      xfs: dirent dtype presence is dependent on directory magic numbers · 367993e7
      Dave Chinner 提交于
      The determination of whether a directory entry contains a dtype
      field originally was dependent on the filesystem having CRCs
      enabled. This meant that the format for dtype beign enabled could be
      determined by checking the directory block magic number rather than
      doing a feature bit check. This was useful in that it meant that we
      didn't need to pass a struct xfs_mount around to functions that
      were already supplied with a directory block header.
      
      Unfortunately, the introduction of dtype fields into the v4
      structure via a feature bit meant this "use the directory block
      magic number" method of discriminating the dirent entry sizes is
      broken. Hence we need to convert the places that use magic number
      checks to use feature bit checks so that they work correctly and not
      by chance.
      
      The current code works on v4 filesystems only because the dirent
      size roundup covers the extra byte needed by the dtype field in the
      places where this problem occurs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      367993e7
    • D
      xfs: lockdep needs to know about 3 dquot-deep nesting · f112a049
      Dave Chinner 提交于
      Michael Semon reported that xfs/299 generated this lockdep warning:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.12.0-rc2+ #2 Not tainted
      ---------------------------------------------
      touch/21072 is trying to acquire lock:
       (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
      
      but task is already holding lock:
       (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&xfs_dquot_other_class);
        lock(&xfs_dquot_other_class);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      7 locks held by touch/21072:
       #0:  (sb_writers#10){++++.+}, at: [<c11185b6>] mnt_want_write+0x1e/0x3e
       #1:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<c11078ee>] do_last+0x245/0xe40
       #2:  (sb_internal#2){++++.+}, at: [<c122c9e0>] xfs_trans_alloc+0x1f/0x35
       #3:  (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [<c126cd1b>] xfs_ilock+0x100/0x1f1
       #4:  (&(&ip->i_lock)->mr_lock){++++-.}, at: [<c126cf52>] xfs_ilock_nowait+0x105/0x22f
       #5:  (&dqp->q_qlock){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
       #6:  (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64
      
      The lockdep annotation for dquot lock nesting only understands
      locking for user and "other" dquots, not user, group and quota
      dquots. Fix the annotations to match the locking heirarchy we now
      have.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f112a049
  3. 26 9月, 2013 1 次提交
    • M
      xfs: fix node forward in xfs_node_toosmall · 997def25
      Mark Tinguely 提交于
      Commit f5ea1100 cleans up the disk to host conversions for
      node directory entries, but because a variable is reused in
      xfs_node_toosmall() the next node is not correctly found.
      If the original node is small enough (<= 3/8 of the node size),
      this change may incorrectly cause a node collapse when it should
      not. That will cause an assert in xfstest generic/319:
      
         Assertion failed: first <= last && last < BBTOB(bp->b_length),
         file: /root/newest/xfs/fs/xfs/xfs_trans_buf.c, line: 569
      
      Keep the original node header to get the correct forward node.
      
      (When a node is considered for a merge with a sibling, it overwrites the
       sibling pointers of the original incore nodehdr with the sibling's
       pointers.  This leads to loop considering the original node as a merge
       candidate with itself in the second pass, and so it incorrectly
       determines a merge should occur.)
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      [v3: added Dave Chinner's (slightly modified) suggestion to the commit header,
      	cleaned up whitespace.  -bpm]
      997def25
  4. 25 9月, 2013 4 次提交
    • D
      xfs: log recovery lsn ordering needs uuid check · 566055d3
      Dave Chinner 提交于
      After a fair number of xfstests runs, xfs/182 started to fail
      regularly with a corrupted directory - a directory read verifier was
      failing after recovery because it found a block with a XARM magic
      number (remote attribute block) rather than a directory data block.
      
      The first time I saw this repeated failure I did /something/ and the
      problem went away, so I was never able to find the underlying
      problem. Test xfs/182 failed again today, and I found the root
      cause before I did /something else/ that made it go away.
      
      Tracing indicated that the block in question was being correctly
      logged, the log was being flushed by sync, but the buffer was not
      being written back before the shutdown occurred. Tracing also
      indicated that log recovery was also reading the block, but then
      never writing it before log recovery invalidated the cache,
      indicating that it was not modified by log recovery.
      
      More detailed analysis of the corpse indicated that the filesystem
      had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM
      block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which
      indicated it was a block from an older filesystem. The reason that
      log recovery didn't replay it was that the LSN in the XARM block was
      larger than the LSN of the transaction being replayed, and so the
      block was not overwritten by log recovery.
      
      Hence, log recovery cant blindly trust the magic number and LSN in
      the block - it must verify that it belongs to the filesystem being
      recovered before using the LSN. i.e. if the UUIDs don't match, we
      need to unconditionally recovery the change held in the log.
      
      This patch was first tested on a block device that was repeatedly
      causing xfs/182 to fail with the same failure on the same block with
      the same directory read corruption signature (i.e. XARM block). It
      did not fail, and hasn't failed since.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      566055d3
    • D
      xfs: fix XFS_IOC_FREE_EOFBLOCKS definition · b771af2f
      Dave Chinner 提交于
      It uses a kernel internal structure in it's definition rather than
      the user visible structure that is passed to the ioctl.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b771af2f
    • D
      xfs: asserting lock not held during freeing not valid · b313a5f1
      Dave Chinner 提交于
      When we free an inode, we do so via RCU. As an RCU lookup can occur
      at any time before we free an inode, and that lookup takes the inode
      flags lock, we cannot safely assert that the flags lock is not held
      just before marking it dead and running call_rcu() to free the
      inode.
      
      We check on allocation of a new inode structre that the lock is not
      held, so we still have protection against locks being leaked and
      hence not correctly initialised when allocated out of the slab.
      Hence just remove the assert...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b313a5f1
    • D
      xfs: lock the AIL before removing the buffer item · 48852358
      Dave Chinner 提交于
      Regression introduced by commit 46f9d2eb ("xfs: aborted buf items can
      be in the AIL") which fails to lock the AIL before removing the
      item. Spinlock debugging throws a warning about this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      48852358
  5. 15 9月, 2013 1 次提交
  6. 14 9月, 2013 3 次提交
  7. 13 9月, 2013 8 次提交
  8. 12 9月, 2013 17 次提交
    • M
      xfs: remove dead code from xlog_recover_inode_pass2 · 08474ed6
      Mark Tinguely 提交于
      Additional code in the error handler of xlog_recover_inode_pass2()
      results in the following error:
      
      static checker warning: "fs/xfs/xfs_log_recover.c:2999
      xlog_recover_inode_pass2()
      	 info: ignoring unreachable code."
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: Ben Myers <bpm@sgi.com
      Signed-off-by: NBen Myers <bpm@sgi.com>
      08474ed6
    • D
      xfs: = vs == typo in ASSERT() · aa9e1040
      Dan Carpenter 提交于
      There is a '=' vs '==' typo so the ASSERT()s are always true.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aa9e1040
    • R
      initmpfs: move rootfs code from fs/ramfs/ to init/ · 57f150a5
      Rob Landley 提交于
      When the rootfs code was a wrapper around ramfs, having them in the same
      file made sense.  Now that it can wrap another filesystem type, move it in
      with the init code instead.
      
      This also allows a subsequent patch to access rootfstype= command line
      arg.
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f150a5
    • R
      initmpfs: move bdi setup from init_rootfs to init_ramfs · 4bbee76b
      Rob Landley 提交于
      Even though ramfs hasn't got a backing device, commit e0bf68dd ("mm:
      bdi init hooks") added one anyway, and put the initialization in
      init_rootfs() since that's the first user, leaving it out of init_ramfs()
      to avoid duplication.
      
      But initmpfs uses init_tmpfs() instead, so move the init into the
      filesystem's init function, add a "once" guard to prevent duplicate
      initialization, and call the filesystem init from rootfs init.
      
      This goes part of the way to allowing ramfs to be built as a module.
      
      [akpm@linux-foundation.org; using bit 1 was odd]
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bbee76b
    • R
      initmpfs: replace MS_NOUSER in initramfs · 137fdcc1
      Rob Landley 提交于
      Mounting MS_NOUSER prevents --bind mounts from rootfs.  Prevent new rootfs
      mounts with a different mechanism that doesn't affect bind mounts.
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      137fdcc1
    • J
      lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt · 5e4c0d97
      Jan Kara 提交于
      With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
      one such possible user), the following race can happen:
      
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      <interrupt>
      ...
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      
      And we give out one radix tree node twice.  That clearly results in radix
      tree corruption with different results (usually OOPS) depending on which
      two users of radix tree race.
      
      We fix the problem by making radix_tree_node_alloc() always allocate fresh
      radix tree nodes when in interrupt.  Using preloading when in interrupt
      doesn't make sense since all the allocations have to be atomic anyway and
      we cannot steal nodes from process-context users because some users rely
      on radix_tree_insert() succeeding after radix_tree_preload().
      in_interrupt() check is somewhat ugly but we cannot simply key off passed
      gfp_mask as that is acquired from root_gfp_mask() and thus the same for
      all preload users.
      
      Another part of the fix is to avoid node preallocation in
      radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
      preallocation in such case doesn't make sense and when preallocation would
      happen in interrupt we could possibly leak some allocated nodes.  However,
      some users of radix_tree_preload() require following radix_tree_insert()
      to succeed.  To avoid unexpected effects for these users,
      radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
      and we provide a new function radix_tree_maybe_preload() for those users
      which get different gfp mask from different call sites and which are
      prepared to handle radix_tree_insert() failure.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e4c0d97
    • D
      affs: use loff_t in affs_truncate() · 63259326
      Dan Carpenter 提交于
      It seems pretty unlikely that AFFS supports files over 4GB but we may as
      well leave use loff_t just for cleanness sake instead of truncating it to
      32 bits.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63259326
    • M
      vmcore: enable /proc/vmcore mmap for s390 · 11e376a3
      Michael Holzheu 提交于
      The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
      now to use mmap also on s390.
      
      So enable mmap for s390 again.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11e376a3
    • M
      vmcore: introduce remap_oldmem_pfn_range() · 9cb21813
      Michael Holzheu 提交于
      For zfcpdump we can't map the HSA storage because it is only available via
      a read interface.  Therefore, for the new vmcore mmap feature we have
      introduce a new mechanism to create mappings on demand.
      
      This patch introduces a new architecture function remap_oldmem_pfn_range()
      that should be used to create mappings with remap_pfn_range() for oldmem
      areas that can be directly mapped.  For zfcpdump this is everything
      besides of the HSA memory.  For the areas that are not mapped by
      remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
      handler mmap_vmcore_fault() is called.
      
      This handler works as follows:
      
      * Get already available or new page from page cache (find_or_create_page)
      * Check if /proc/vmcore page is filled with data (PageUptodate)
      * If yes:
        Return that page
      * If no:
        Fill page using __vmcore_read(), set PageUptodate, and return page
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cb21813
    • M
      vmcore: introduce ELF header in new memory feature · be8a8d06
      Michael Holzheu 提交于
      For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
      (zfcpdump).  We have support where the first HSA_SIZE bytes are saved into
      a hypervisor owned memory area (HSA) before the kdump kernel is booted.
      When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.
      
      The advantages of this mechanism are:
      
       * No crashkernel memory has to be defined in the old kernel.
       * Early boot problems (before kexec_load has been done) can be dumped
       * Non-Linux systems can be dumped.
      
      We modify the s390 copy_oldmem_page() function to read from the HSA memory
      if memory below HSA_SIZE bytes is requested.
      
      Since we cannot use the kexec tool to load the kernel in this scenario,
      we have to build the ELF header in the 2nd (kdump/new) kernel.
      
      So with the following patch set we would like to introduce the new
      function that the ELF header for /proc/vmcore can be created in the 2nd
      kernel memory.
      
      The following steps are done during zfcpdump execution:
      
      1.  Production system crashes
      2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
      3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
      4.  Boot loader loads kernel into low memory area
      5.  Kernel boots and uses only HSA_SIZE bytes of memory
      6.  Kernel saves registers of non-boot CPUs
      7.  Kernel does memory detection for dump memory map
      8.  Kernel creates ELF header for /proc/vmcore
      9.  /proc/vmcore uses this header for initialization
      10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
          - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
          - copy_oldmem_page() copies from real memory for memory above HSA_SIZE
      
      Currently for s390 we create the ELF core header in the 2nd kernel with a
      small trick.  We relocate the addresses in the ELF header in a way that
      for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
      and the read_from_oldmem() returns the correct data.  This allows the
      /proc/vmcore code to use the ELF header in the 2nd kernel.
      
      This patch:
      
      Exchange the old mechanism with the new and much cleaner function call
      override feature that now offcially allows to create the ELF core header
      in the 2nd kernel.
      
      To use the new feature the following function have to be defined
      by the architecture backend code to read from new memory:
      
       * elfcorehdr_alloc: Allocate ELF header
       * elfcorehdr_free: Free the memory of the ELF header
       * elfcorehdr_read: Read from ELF header
       * elfcorehdr_read_notes: Read from ELF notes
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be8a8d06
    • O
      exec: cleanup the error handling in search_binary_handler() · 6b3c538f
      Oleg Nesterov 提交于
      The error hanling and ret-from-loop look confusing and inconsistent.
      
      - "retval >= 0" simply returns
      
      - "!bprm->file" returns too but with read_unlock() because
         binfmt_lock was already re-acquired
      
      - "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
        relies on the same check after the main loop
      
      Consolidate these checks into a single if/return statement.
      
      need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
      the main loop are not needed.  This is only for pathological and
      impossible list_empty(&formats) case.
      
      It is not clear why do we check "bprm->mm == NULL", probably this
      should be removed.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b3c538f
    • O
      exec: don't retry if request_module() fails · 4e0621a0
      Oleg Nesterov 提交于
      A separate one-liner for better documentation.
      
      It doesn't make sense to retry if request_module() fails to exec
      /sbin/modprobe, add the additional "request_module() < 0" check.
      
      However, this logic still doesn't look exactly right:
      
      1. It would be better to check "request_module() != 0", the user
         space modprobe process should report the correct exit code.
         But I didn't dare to add the user-visible change.
      
      2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
         to exec a "#!path-to-unsupported-binary" script. In this case
         request_module() + "retry" will be done twice: first by the
         "depth == 1" code, and then again by the "depth == 0" caller
         which doesn't make sense.
      
      3. And note that in the case above bprm->buf was already changed
         by load_script()->prepare_binprm(), so this looks even more
         ugly.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e0621a0
    • O
      exec: cleanup the CONFIG_MODULES logic · cb7b6b1c
      Oleg Nesterov 提交于
      search_binary_handler() uses "for (try=0; try<2; try++)" to avoid "goto"
      but the code looks too complicated and horrible imho.  We still need to
      check "try == 0" before request_module() and add the additional "break"
      for !CONFIG_MODULES case.
      
      Kill this loop and use a simple "bool need_retry" + "goto retry".  The
      code looks much simpler and we do not even need ifdef's, gcc can optimize
      out the "if (need_retry)" block if !IS_ENABLED().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb7b6b1c
    • O
      exec: kill ->load_binary != NULL check in search_binary_handler() · 92eaa565
      Oleg Nesterov 提交于
      search_binary_handler() checks ->load_binary != NULL for no reason, this
      method should be always defined.  Turn this check into WARN_ON() and move
      it into __register_binfmt().
      
      Also, kill the function pointer.  The current code looks confusing, as if
      ->load_binary can go away after read_unlock(&binfmt_lock).  But we rely on
      module_get(fmt->module), this fmt can't be changed or unregistered,
      otherwise this code is buggy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92eaa565
    • O
      exec: move allow_write_access/fput to exec_binprm() · 52f14282
      Oleg Nesterov 提交于
      When search_binary_handler() succeeds it does allow_write_access() and
      fput(), then it clears bprm->file to ensure the caller will not do the
      same.
      
      We can simply move this code to exec_binprm() which is called only once.
      In fact we could move this to free_bprm() and remove the same code in
      do_execve_common's error path.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52f14282
    • O
      exec: proc_exec_connector() should be called only once · 9beb266f
      Oleg Nesterov 提交于
      A separate one-liner with the minor fix.
      
      PROC_EVENT_EXEC reports the "exec" event, but this message is sent at
      least twice if search_binary_handler() is called by ->load_binary()
      recursively, say, load_script().
      
      Move it to exec_binprm(), this is "depth == 0" code too.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9beb266f
    • O
      exec: kill "int depth" in search_binary_handler() · 131b2f9f
      Oleg Nesterov 提交于
      Nobody except search_binary_handler() should touch ->recursion_depth, "int
      depth" buys nothing but complicates the code, kill it.
      
      Probably we should also kill "fn" and the !NULL check, ->load_binary
      should be always defined.  And it can not go away after read_unlock() or
      this code is buggy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      131b2f9f