1. 05 9月, 2015 22 次提交
    • G
      ocfs2: acknowledge return value of ocfs2_error() · 17a5b9ab
      Goldwyn Rodrigues 提交于
      Caveat: This may return -EROFS for a read case, which seems wrong.  This
      is happening even without this patch series though.  Should we convert
      EROFS to EIO?
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17a5b9ab
    • G
      ocfs2: add errors=continue · 7d0fb914
      Goldwyn Rodrigues 提交于
      OCFS2 is often used in high-availaibility systems.  However, ocfs2
      converts the filesystem to read-only at the drop of the hat.  This may
      not be necessary, since turning the filesystem read-only would affect
      other running processes as well, decreasing availability.
      
      This attempt is to add errors=continue, which would return the EIO to
      the calling process and terminate furhter processing so that the
      filesystem is not corrupted further.  However, the filesystem is not
      converted to read-only.
      
      As a future plan, I intend to create a small utility or extend
      fsck.ocfs2 to fix small errors such as in the inode.  The input to the
      utility such as the inode can come from the kernel logs so we don't have
      to schedule a downtime for fixing small-enough errors.
      
      The patch changes the ocfs2_error to return an error.  The error
      returned depends on the mount option set.  If none is set, the default
      is to turn the filesystem read-only.
      
      Perhaps errors=continue is not the best option name.  Historically it is
      used for making an attempt to progress in the current process itself.
      Should we call it errors=eio? or errors=killproc? Suggestions/Comments
      welcome.
      
      Sources are available at:
        https://github.com/goldwynr/linux/tree/error-contSigned-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d0fb914
    • X
      ocfs2: flush inode data to disk and free inode when i_count becomes zero · 513e2dae
      Xue jiufei 提交于
      Disk inode deletion may be heavily delayed when one node unlink a file
      after the same dentry is freed on another node(say N1) because of memory
      shrink but inode is left in memory.  This inode can only be freed while
      N1 doing the orphan scan work.
      
      However, N1 may skip orphan scan for several times because other nodes
      may do the work earlier.  In our tests, it may take 1 hour on 4 nodes
      cluster and it hurts the user experience.  So we think the inode should
      be freed after the data flushed to disk when i_count becomes zero to
      avoid such circumstances.
      Signed-off-by: NJoyce.xue <xuejiufei@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      513e2dae
    • S
      ocfs2: trusted xattr missing CAP_SYS_ADMIN check · 0f5e7b41
      Sanidhya Kashyap 提交于
      The trusted extended attributes are only visible to the process which
      hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
      xattr_handler trusted list.  The check is important because this will be
      used for implementing mechanisms in the userspace for which other
      ordinary processes should not have access to.
      Signed-off-by: NSanidhya Kashyap <sanidhya.gatech@gmail.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Taesoo kim <taesoo@gatech.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f5e7b41
    • J
      ocfs2: set filesytem read-only when ocfs2_delete_entry failed. · 807a7907
      jiangyiwen 提交于
      In ocfs2_rename, it will lead to an inode with two entried(old and new) if
      ocfs2_delete_entry(old) failed.  Thus, filesystem will be inconsistent.
      
      The case is described below:
      
      ocfs2_rename
          -> ocfs2_start_trans
          -> ocfs2_add_entry(new)
          -> ocfs2_delete_entry(old)
              -> __ocfs2_journal_access *failed* because of -ENOMEM
          -> ocfs2_commit_trans
      
      So filesystem should be set to read-only at the moment.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      807a7907
    • J
      ocfs2/dlm: use list_for_each_entry instead of list_for_each · f83c7b5e
      Joseph Qi 提交于
      Use list_for_each_entry instead of list_for_each to simplify code.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f83c7b5e
    • J
      ocfs2: remove unneeded code in dlm_register_domain_handlers · 0e3d9eaf
      Joseph Qi 提交于
      The last goto statement is unneeded, so remove it.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e3d9eaf
    • J
      ocfs2: fix BUG when o2hb_register_callback fails · cdd09f49
      Joseph Qi 提交于
      In dlm_register_domain_handlers, if o2hb_register_callback fails, it
      will call dlm_unregister_domain_handlers to unregister.  This will
      trigger the BUG_ON in o2hb_unregister_callback because hc_magic is 0.
      So we should call o2hb_setup_callback to initialize hc first.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdd09f49
    • J
      ocfs2: remove unneeded code in ocfs2_dlm_init · 914a9b74
      Joseph Qi 提交于
      status is already initialized and it will only be 0 or negatives in the
      code flow.  So remove the unneeded assignment after the lable 'local'.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      914a9b74
    • J
      ocfs2: adjust code to match locking/unlocking order · 3cb2ec43
      Joseph Qi 提交于
      Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the
      corresponding locking order, although it won't cause issues, adjust the
      code so that it looks more reasonable.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cb2ec43
    • J
      ocfs2: clean up unused local variables in ocfs2_file_write_iter · bf59e662
      Joseph Qi 提交于
      Since commit 86b9c6f3 ("ocfs2: remove filesize checks for sync I/O
      journal commit") removes filesize checks for sync I/O journal commit,
      variables old_size and old_clusters are not actually used any more.  So
      clean them up.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf59e662
    • C
      ocfs2: do not log twice error messages · 372a447c
      Christophe JAILLET 提交于
      'o2hb_map_slot_data' and 'o2hb_populate_slot_data' are called from only
      one place, in 'o2hb_region_dev_write'.  Return value is checked and
      'mlog_errno' is called to log a message if it is not 0.
      
      So there is no need to call 'mlog_errno' directly within these functions.
      This would result on logging the message twice.
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      372a447c
    • J
      ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access · acf8fdbe
      Joseph Qi 提交于
      When storage network is unstable, it may trigger the BUG in
      __ocfs2_journal_access because of buffer not uptodate.  We can retry the
      write in this case or return error instead of BUG.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NZhangguanghui <zhang.guanghui@h3c.com>
      Tested-by: NZhangguanghui <zhang.guanghui@h3c.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acf8fdbe
    • J
      ocfs2: fix several issues of append dio · faaebf18
      Joseph Qi 提交于
      1) Take rw EX lock in case of append dio.
      2) Explicitly treat the error code -EIOCBQUEUED as normal.
      3) Set di_bh to NULL after brelse if it may be used again later.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Yiwen Jiang <jiangyiwen@huawei.com>
      Cc: Weiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      faaebf18
    • J
      ocfs2: fix race between dio and recover orphan · 512f62ac
      Joseph Qi 提交于
      During direct io the inode will be added to orphan first and then
      deleted from orphan.  There is a race window that the orphan entry will
      be deleted twice and thus trigger the BUG when validating
      OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.
      
      ocfs2_direct_IO_write
          ...
          ocfs2_add_inode_to_orphan
          >>>>>>>> race window.
                   1) another node may rm the file and then down, this node
                   take care of orphan recovery and clear flag
                   OCFS2_DIO_ORPHANED_FL.
                   2) since rw lock is unlocked, it may race with another
                   orphan recovery and append dio.
          ocfs2_del_inode_from_orphan
      
      So take inode mutex lock when recovering orphans and make rw unlock at the
      end of aio write in case of append dio.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Weiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      512f62ac
    • S
      ntfs: delete unnecessary checks before calling iput() · 917520e1
      SF Markus Elfring 提交于
      iput() tests whether its argument is NULL and then returns immediately.
      Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Reviewed-by: NAnton Altaparmakov <anton@tuxera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917520e1
    • J
      fsnotify: get rid of fsnotify_destroy_mark_locked() · 4712e722
      Jan Kara 提交于
      fsnotify_destroy_mark_locked() is subtle to use because it temporarily
      releases group->mark_mutex.  To avoid future problems with this
      function, split it into two.
      
      fsnotify_detach_mark() is the part that needs group->mark_mutex and
      fsnotify_free_mark() is the part that must be called outside of
      group->mark_mutex.  This way it's much clearer what's going on and we
      also avoid some pointless acquisitions of group->mark_mutex.
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4712e722
    • J
      fsnotify: remove mark->free_list · 925d1132
      Jan Kara 提交于
      Free list is used when all marks on given inode / mount should be
      destroyed when inode / mount is going away.  However we can free all of
      the marks without using a special list with some care.
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      925d1132
    • J
      fsnotify: fix check in inotify fdinfo printing · 3c53e514
      Jan Kara 提交于
      A check in inotify_fdinfo() checking whether mark is valid was always
      true due to a bug.  Luckily we can never get to invalidated marks since
      we hold mark_mutex and invalidated marks get removed from the group list
      when they are invalidated under that mutex.
      
      Anyway fix the check to make code more future proof.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c53e514
    • D
      fs/notify: optimize inotify/fsnotify code for unwatched files · 7c49b861
      Dave Hansen 提交于
      I have a _tiny_ microbenchmark that sits in a loop and writes single
      bytes to a file.  Writing one byte to a tmpfs file is around 2x slower
      than reading one byte from a file, which is a _bit_ more than I expecte.
      This is a dumb benchmark, but I think it's hard to deny that write() is
      a hot path and we should avoid unnecessary overhead there.
      
      I did a 'perf record' of 30-second samples of read and write.  The top
      item in a diffprofile is srcu_read_lock() from fsnotify().  There are
      active inotify fd's from systemd, but nothing is actually listening to
      the file or its part of the filesystem.
      
      I *think* we can avoid taking the srcu_read_lock() for the common case
      where there are no actual marks on the file.  This means that there will
      both be nothing to notify for *and* implies that there is no need for
      clearing the ignore mask.
      
      This patch gave a 13.1% speedup in writes/second on my test, which is an
      improvement from the 10.8% that I saw with the last version.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c49b861
    • A
      capabilities: ambient capabilities · 58319057
      Andy Lutomirski 提交于
      Credit where credit is due: this idea comes from Christoph Lameter with
      a lot of valuable input from Serge Hallyn.  This patch is heavily based
      on Christoph's patch.
      
      ===== The status quo =====
      
      On Linux, there are a number of capabilities defined by the kernel.  To
      perform various privileged tasks, processes can wield capabilities that
      they hold.
      
      Each task has four capability masks: effective (pE), permitted (pP),
      inheritable (pI), and a bounding set (X).  When the kernel checks for a
      capability, it checks pE.  The other capability masks serve to modify
      what capabilities can be in pE.
      
      Any task can remove capabilities from pE, pP, or pI at any time.  If a
      task has a capability in pP, it can add that capability to pE and/or pI.
      If a task has CAP_SETPCAP, then it can add any capability to pI, and it
      can remove capabilities from X.
      
      Tasks are not the only things that can have capabilities; files can also
      have capabilities.  A file can have no capabilty information at all [1].
      If a file has capability information, then it has a permitted mask (fP)
      and an inheritable mask (fI) as well as a single effective bit (fE) [2].
      File capabilities modify the capabilities of tasks that execve(2) them.
      
      A task that successfully calls execve has its capabilities modified for
      the file ultimately being excecuted (i.e.  the binary itself if that
      binary is ELF or for the interpreter if the binary is a script.) [3] In
      the capability evolution rules, for each mask Z, pZ represents the old
      value and pZ' represents the new value.  The rules are:
      
        pP' = (X & fP) | (pI & fI)
        pI' = pI
        pE' = (fE ? pP' : 0)
        X is unchanged
      
      For setuid binaries, fP, fI, and fE are modified by a moderately
      complicated set of rules that emulate POSIX behavior.  Similarly, if
      euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
      (primary, fP and fI usually end up being the full set).  For nonroot
      users executing binaries with neither setuid nor file caps, fI and fP
      are empty and fE is false.
      
      As an extra complication, if you execute a process as nonroot and fE is
      set, then the "secure exec" rules are in effect: AT_SECURE gets set,
      LD_PRELOAD doesn't work, etc.
      
      This is rather messy.  We've learned that making any changes is
      dangerous, though: if a new kernel version allows an unprivileged
      program to change its security state in a way that persists cross
      execution of a setuid program or a program with file caps, this
      persistent state is surprisingly likely to allow setuid or file-capped
      programs to be exploited for privilege escalation.
      
      ===== The problem =====
      
      Capability inheritance is basically useless.
      
      If you aren't root and you execute an ordinary binary, fI is zero, so
      your capabilities have no effect whatsoever on pP'.  This means that you
      can't usefully execute a helper process or a shell command with elevated
      capabilities if you aren't root.
      
      On current kernels, you can sort of work around this by setting fI to
      the full set for most or all non-setuid executable files.  This causes
      pP' = pI for nonroot, and inheritance works.  No one does this because
      it's a PITA and it isn't even supported on most filesystems.
      
      If you try this, you'll discover that every nonroot program ends up with
      secure exec rules, breaking many things.
      
      This is a problem that has bitten many people who have tried to use
      capabilities for anything useful.
      
      ===== The proposed change =====
      
      This patch adds a fifth capability mask called the ambient mask (pA).
      pA does what most people expect pI to do.
      
      pA obeys the invariant that no bit can ever be set in pA if it is not
      set in both pP and pI.  Dropping a bit from pP or pI drops that bit from
      pA.  This ensures that existing programs that try to drop capabilities
      still do so, with a complication.  Because capability inheritance is so
      broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
      then calling execve effectively drops capabilities.  Therefore,
      setresuid from root to nonroot conditionally clears pA unless
      SECBIT_NO_SETUID_FIXUP is set.  Processes that don't like this can
      re-add bits to pA afterwards.
      
      The capability evolution rules are changed:
      
        pA' = (file caps or setuid or setgid ? 0 : pA)
        pP' = (X & fP) | (pI & fI) | pA'
        pI' = pI
        pE' = (fE ? pP' : pA')
        X is unchanged
      
      If you are nonroot but you have a capability, you can add it to pA.  If
      you do so, your children get that capability in pA, pP, and pE.  For
      example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
      automatically bind low-numbered ports.  Hallelujah!
      
      Unprivileged users can create user namespaces, map themselves to a
      nonzero uid, and create both privileged (relative to their namespace)
      and unprivileged process trees.  This is currently more or less
      impossible.  Hallelujah!
      
      You cannot use pA to try to subvert a setuid, setgid, or file-capped
      program: if you execute any such program, pA gets cleared and the
      resulting evolution rules are unchanged by this patch.
      
      Users with nonzero pA are unlikely to unintentionally leak that
      capability.  If they run programs that try to drop privileges, dropping
      privileges will still work.
      
      It's worth noting that the degree of paranoia in this patch could
      possibly be reduced without causing serious problems.  Specifically, if
      we allowed pA to persist across executing non-pA-aware setuid binaries
      and across setresuid, then, naively, the only capabilities that could
      leak as a result would be the capabilities in pA, and any attacker
      *already* has those capabilities.  This would make me nervous, though --
      setuid binaries that tried to privilege-separate might fail to do so,
      and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
      unexpected side effects.  (Whether these unexpected side effects would
      be exploitable is an open question.) I've therefore taken the more
      paranoid route.  We can revisit this later.
      
      An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
      ambient capabilities.  I think that this would be annoying and would
      make granting otherwise unprivileged users minor ambient capabilities
      (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
      it is with this patch.
      
      ===== Footnotes =====
      
      [1] Files that are missing the "security.capability" xattr or that have
      unrecognized values for that xattr end up with has_cap set to false.
      The code that does that appears to be complicated for no good reason.
      
      [2] The libcap capability mask parsers and formatters are dangerously
      misleading and the documentation is flat-out wrong.  fE is *not* a mask;
      it's a single bit.  This has probably confused every single person who
      has tried to use file capabilities.
      
      [3] Linux very confusingly processes both the script and the interpreter
      if applicable, for reasons that elude me.  The results from thinking
      about a script's file capabilities and/or setuid bits are mostly
      discarded.
      
      Preliminary userspace code is here, but it needs updating:
      https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2
      
      Here is a test program that can be used to verify the functionality
      (from Christoph):
      
      /*
       * Test program for the ambient capabilities. This program spawns a shell
       * that allows running processes with a defined set of capabilities.
       *
       * (C) 2015 Christoph Lameter <cl@linux.com>
       * Released under: GPL v3 or later.
       *
       *
       * Compile using:
       *
       *	gcc -o ambient_test ambient_test.o -lcap-ng
       *
       * This program must have the following capabilities to run properly:
       * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
       *
       * A command to equip the binary with the right caps is:
       *
       *	setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
       *
       *
       * To get a shell with additional caps that can be inherited by other processes:
       *
       *	./ambient_test /bin/bash
       *
       *
       * Verifying that it works:
       *
       * From the bash spawed by ambient_test run
       *
       *	cat /proc/$$/status
       *
       * and have a look at the capabilities.
       */
      
      #include <stdlib.h>
      #include <stdio.h>
      #include <errno.h>
      #include <cap-ng.h>
      #include <sys/prctl.h>
      #include <linux/capability.h>
      
      /*
       * Definitions from the kernel header files. These are going to be removed
       * when the /usr/include files have these defined.
       */
      #define PR_CAP_AMBIENT 47
      #define PR_CAP_AMBIENT_IS_SET 1
      #define PR_CAP_AMBIENT_RAISE 2
      #define PR_CAP_AMBIENT_LOWER 3
      #define PR_CAP_AMBIENT_CLEAR_ALL 4
      
      static void set_ambient_cap(int cap)
      {
      	int rc;
      
      	capng_get_caps_process();
      	rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
      	if (rc) {
      		printf("Cannot add inheritable cap\n");
      		exit(2);
      	}
      	capng_apply(CAPNG_SELECT_CAPS);
      
      	/* Note the two 0s at the end. Kernel checks for these */
      	if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
      		perror("Cannot set cap");
      		exit(1);
      	}
      }
      
      int main(int argc, char **argv)
      {
      	int rc;
      
      	set_ambient_cap(CAP_NET_RAW);
      	set_ambient_cap(CAP_NET_ADMIN);
      	set_ambient_cap(CAP_SYS_NICE);
      
      	printf("Ambient_test forking shell\n");
      	if (execv(argv[1], argv + 1))
      		perror("Cannot exec");
      
      	return 0;
      }
      
      Signed-off-by: Christoph Lameter <cl@linux.com> # Original author
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Aaron Jones <aaronmdjones@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: Markku Savela <msa@moth.iki.fi>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58319057
    • R
      ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio · aa1057b3
      Ryan Ding 提交于
      ocfs2_file_write_iter() is usng the wrong return value ('written').  This
      will cause ocfs2_rw_unlock() be called both in write_iter & end_io,
      triggering a BUG_ON.
      
      This issue was introduced by commit 7da839c4 ("ocfs2: use
      __generic_file_write_iter()").
      
      Orabug: 21612107
      Fixes: 7da839c4 ("ocfs2: use __generic_file_write_iter()")
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa1057b3
  2. 04 9月, 2015 2 次提交
  3. 02 9月, 2015 1 次提交
  4. 29 8月, 2015 1 次提交
    • C
      f2fs: avoid accessing NULL pointer in f2fs_drop_largest_extent · 54d71856
      Chao Yu 提交于
      If extent cache is disable, we will encounter oops when triggering direct
      IO as below:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000c
      IP: [<f0b9c61e>] f2fs_drop_largest_extent+0xe/0x30 [f2fs]
      *pdpt = 000000002bb9a001 *pde = 0000000000000000
      Oops: 0000 [#1] SMP
      Modules linked in: f2fs(O) fuse bnep rfcomm bluetooth nfsd dm_crypt nfs_acl auth_rpcgss oid_registry nfs binfmt_misc fscache lockd
      sunrpc grace snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer
      snd_seq_device snd soundcore joydev psmouse hid_generic i2c_piix4 serio_raw ppdev mac_hid parport_pc lp parport ext4 jbd2 mbcache
      usbhid hid e1000
      CPU: 3 PID: 3608 Comm: dd Tainted: G           O    4.2.0-rc4 #12
      Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      task: ef161600 ti: ebd5e000 task.ti: ebd5e000
      EIP: 0060:[<f0b9c61e>] EFLAGS: 00010202 CPU: 3
      EIP is at f2fs_drop_largest_extent+0xe/0x30 [f2fs]
      EAX: 00000000 EBX: ddebc000 ECX: 00000000 EDX: 00000000
      ESI: ebd5fdf8 EDI: 00000000 EBP: ebd5fd58 ESP: ebd5fd58
       DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      CR0: 80050033 CR2: 0000000c CR3: 2c24ee40 CR4: 000006f0
      Stack:
       ebd5fda4 f0b8c005 00000000 00000001 00000000 f0b8c430 c816cd68 ddebc000
       ddebc088 00001000 00000555 00000555 ffffffff c160bb00 00055501 00000000
       00000000 00000100 00000000 ebd5fe20 f0b8c430 00000046 ef161600 00001000
      Call Trace:
       [<f0b8c005>] __allocate_data_block+0x1a5/0x260 [f2fs]
       [<f0b8c430>] ? f2fs_direct_IO+0x370/0x440 [f2fs]
       [<c160bb00>] ? down_read+0x30/0x50
       [<f0b8c430>] f2fs_direct_IO+0x370/0x440 [f2fs]
       [<c113e115>] generic_file_direct_write+0xa5/0x260
       [<c10b53f8>] ? current_fs_time+0x18/0x50
       [<c113e38b>] __generic_file_write_iter+0xbb/0x210
       [<c113e50f>] ? generic_file_write_iter+0x2f/0x320
       [<c113e63c>] generic_file_write_iter+0x15c/0x320
       [<f0b77f29>] f2fs_file_write_iter+0x39/0x80 [f2fs]
       [<c11984d9>] __vfs_write+0xa9/0xe0
       [<c1199227>] vfs_write+0x97/0x180
       [<c119955b>] SyS_write+0x5b/0xd0
       [<c160dcd0>] sysenter_do_call+0x12/0x12
      Code: 10 8b 50 1c 89 53 14 eb ca 8d 74 26 00 85 f6 74 86 eb a6 0f 0b 90 8d b4 26 00 00 00 00 55 89 e5 3e 8d 74 26 00 8b 80 d4 02 00
      00 <8b> 48 0c 39 d1 77 0e 03 48 14 39 ca 73 07 c7 40 14 00 00 00 00
      EIP: [<f0b9c61e>] f2fs_drop_largest_extent+0xe/0x30 [f2fs] SS:ESP 0068:ebd5fd58
      CR2: 000000000000000c
      ---[ end trace a38c07026a1afffd ]---
      
      This is because when extent cache is disable, extent_tree pointer in struct
      f2fs_inode_info should be NULL, but in f2fs_drop_largest_extent we access
      this NULL pointer directly without checking state of extent cache, then,
      the oops occurs. Let's fix it by checking state of extent cache before
      accessing.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      54d71856
  5. 27 8月, 2015 2 次提交
    • B
      dlm: print error from kernel_sendpage · b3a5bbfd
      Bob Peterson 提交于
      Print a dlm-specific error when a socket error occurs
      when sending a dlm message.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      b3a5bbfd
    • C
      f2fs: update extent tree in batches · 19b2c30d
      Chao Yu 提交于
      This patch introduce a new helper f2fs_update_extent_tree_range which can
      do extent mapping update at a specified range.
      
      The main idea is:
      1) punch all mapping info in extent node(s) which are at a specified range;
      2) try to merge new extent mapping with adjacent node, or failing that,
         insert the mapping into extent tree as a new node.
      
      In order to see the benefit, I add a function for stating time stamping
      count as below:
      
      uint64_t rdtsc(void)
      {
      	uint32_t lo, hi;
      	__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
      	return (uint64_t)hi << 32 | lo;
      }
      
      My test environment is: ubuntu, intel i7-3770, 16G memory, 256g micron ssd.
      
      truncation path:	update extent cache from truncate_data_blocks_range
      non-truncataion path:	update extent cache from other paths
      total:			all update paths
      
      a) Removing 128MB file which has one extent node mapping whole range of
      file:
      1. dd if=/dev/zero of=/mnt/f2fs/128M bs=1M count=128
      2. sync
      3. rm /mnt/f2fs/128M
      
      Before:
      		total		count		average
      truncation:	7651022		32768		233.49
      
      Patched:
      		total		count		average
      truncation:	3321		33		100.64
      
      b) fsstress:
      fsstress -d /mnt/f2fs -l 5 -n 100 -p 20
      Test times:		5 times.
      
      Before:
      		total		count		average
      truncation:	5812480.6	20911.6		277.95
      non-truncation:	7783845.6	13440.8		579.12
      total:		13596326.2	34352.4		395.79
      
      Patched:
      		total		count		average
      truncation:	1281283.0	3041.6		421.25
      non-truncation:	7355844.4	13662.8		538.38
      total:		8637127.4	16704.4		517.06
      
      1) For the updates in truncation path:
       - we can see updating in batches leads total tsc and update count reducing
         explicitly;
       - besides, for a single batched updating, punching multiple extent nodes
         in a loop, result in executing more operations, so our average tsc
         increase intensively.
      2) For the updates in non-truncation path:
       - there is a little improvement, that is because for the scenario that we
         just need to update in the head or tail of extent node, new interface
         optimize to update info in extent node directly, rather than removing
         original extent node for updating and then inserting that updated one
         into cache as new node.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      19b2c30d
  6. 26 8月, 2015 2 次提交
    • T
      writeback: sync_inodes_sb() must write out I_DIRTY_TIME inodes and always call wait_sb_inodes() · 006a0973
      Tejun Heo 提交于
      e7972912 ("writeback: don't issue wb_writeback_work if clean")
      updated writeback path to avoid kicking writeback work items if there
      are no inodes to be written out; unfortunately, the avoidance logic
      was too aggressive and broke sync_inodes_sb().
      
      * sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME
        inodes dont't contribute to bdi/wb_has_dirty_io() tests and were
        being skipped over.
      
      * inodes are taken off wb->b_dirty/io/more_io lists after writeback
        starts on them.  sync_inodes_sb() skipping wait_sb_inodes() when
        bdi_has_dirty_io() breaks it by making it return while writebacks
        are in-flight.
      
      This patch fixes the breakages by
      
      * Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs().
        The callers are already testing the condition.
      
      * Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that
        it always calls into bdi_split_work_to_wbs() and wait_sb_inodes().
      
      * Making bdi_split_work_to_wbs() consider the b_dirty_time list for
        WB_SYNC_ALL writebacks.
      
      Kudos to Eryu, Dave and Jan for tracking down the issue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: e7972912 ("writeback: don't issue wb_writeback_work if clean")
      Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.comReported-and-bisected-by: NEryu Guan <eguan@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Ted Ts'o <tytso@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      006a0973
    • D
      dlm: fix lvb copy for user locks · b96f4650
      David Teigland 提交于
      For a userland lock request, the previous and current
      lock modes are used to decide when the lvb should be
      copied back to the user.  The wrong previous value was
      used, so that it always matched the current value.
      This caused the lvb to be copied back to the user in
      the wrong cases.
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      b96f4650
  7. 25 8月, 2015 6 次提交
  8. 22 8月, 2015 4 次提交
    • C
      f2fs: lookup neighbor extent nodes for merging later · dac2ddef
      Chao Yu 提交于
      In __lookup_extent_tree_ret we will not try to find neighbor nodes if
      we find the target node, in this condition, we will lost the chance to
      merge the new mapping with exist extent node later.
      
      So our extent cache of inode will be fragmented after overwrite exist
      file, we can see the number of extent node increases intensively in
      following test case:
      
      dd if=/dev/zero of=/mnt/f2fs/4m bs=4K count=1024
      
      Extent Cache:
        - Hit Count: L1-1:0 L1-2:0 L2:0
        - Hit Ratio: 0% (0 / 3072)
        - Inner Struct Count: tree: 1, node: 1
      
      dd if=/dev/zero of=/mnt/f2fs/4m bs=4K count=1024 conv=notrunc
      
      Extent Cache:
        - Hit Count: L1-1:2048 L1-2:0 L2:0
        - Hit Ratio: 33% (2048 / 6144)
        - Inner Struct Count: tree: 1, node: 961
      
      This patch fixes to lookup neighbors of target node for further
      merging.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      dac2ddef
    • C
      f2fs: split __insert_extent_tree_ret for readability · ef05e221
      Chao Yu 提交于
      This patch splits __insert_extent_tree_ret into __try_merge_extent_node &
      __insert_extent_tree for code readability.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      ef05e221
    • C
      f2fs: kill dead code in __insert_extent_tree · a6f78345
      Chao Yu 提交于
      After commit 0f825ee6 ("f2fs: add new interfaces for extent tree"),
      f2fs_init_extent_tree becomes the only caller of __insert_extent_tree, and
      in f2fs_init_extent_tree, we will only insert extent node in an empty tree,
      so __try_{back,front}_merge in __insert_extent_tree will never be called.
      
      This patch removes these dead codes, besides, rename __insert_extent_tree
      to __init_extent_tree for readability.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      a6f78345
    • C
      f2fs: adjust showing of extent cache stat · 029e13cc
      Chao Yu 提交于
      This patch alters to replace total hit stat with rbtree hit stat,
      and then adjust showing of extent cache stat:
      
      Hit Count:
      L1-1: for largest node hit count;
      L1-2: for last cached node hit count;
      L2: for extent node hit after lookuping in rbtree.
      
      Hit Ratio:
      ratio (hit count / total lookup count)
      
      Inner Struct Count:
      tree count, node count.
      
      Before:
      Extent Hit Ratio: 0 / 2
      
      Extent Tree Count: 3
      
      Extent Node Count: 2
      
      Patched:
      Exten Cacache:
        - Hit Count: L1-1:4871 L1-2:2074 L2:208
        - Hit Ratio: 1% (7153 / 550751)
        - Inner Struct Count: tree: 26560, node: 11824
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      029e13cc