1. 26 5月, 2016 12 次提交
    • I
      libceph: switch to calc_target(), part 1 · a66dd383
      Ilya Dryomov 提交于
      Replace __calc_request_pg() and most of __map_request() with
      calc_target() and start using req->r_t.
      
      ceph_osdc_build_request() however still encodes base_oid, because it's
      called before calc_target() is and target_oid is empty at that point in
      time; a printf in osdc_show() also shows base_oid.  This is fixed in
      "libceph: switch to calc_target(), part 2".
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      a66dd383
    • I
      libceph: introduce ceph_osd_request_target, calc_target() · 63244fa1
      Ilya Dryomov 提交于
      Introduce ceph_osd_request_target, containing all mapping-related
      fields of ceph_osd_request and calc_target() for calculating mappings
      and populating it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      63244fa1
    • I
      libceph: pi->min_size, pi->last_force_request_resend · 04812acf
      Ilya Dryomov 提交于
      Add and decode pi->min_size and pi->last_force_request_resend.  These
      are going to be used by calc_target().
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      04812acf
    • I
      libceph: make pgid_cmp() global · f984cb76
      Ilya Dryomov 提交于
      calc_target() code is going to need to know how to compare PGs.  Take
      lhs and rhs pgid by const * while at it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f984cb76
    • I
      libceph: rename ceph_calc_pg_primary() · f81f1633
      Ilya Dryomov 提交于
      Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
      emphasise that it returns acting primary.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      f81f1633
    • I
      libceph: ceph_osds, ceph_pg_to_up_acting_osds() · 6f3bfd45
      Ilya Dryomov 提交于
      Knowning just acting set isn't enough, we need to be able to record up
      set as well to detect interval changes.  This means returning (up[],
      up_len, up_primary, acting[], acting_len, acting_primary) and passing
      it around.  Introduce and switch to ceph_osds to help with that.
      
      Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
      both up and acting sets from it.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      6f3bfd45
    • I
      libceph: rename ceph_oloc_oid_to_pg() · d9591f5e
      Ilya Dryomov 提交于
      Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
      that returned is raw PG and return -ENOENT instead of -EIO if the pool
      doesn't exist.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d9591f5e
    • I
      libceph: fix ceph_eversion encoding · 985c1673
      Ilya Dryomov 提交于
      eversion_t is version+epoch in userspace and is encoded in that order.
      ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
      in __send_request().  Reoder ceph_eversion fields.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      985c1673
    • I
      libceph: DEFINE_RB_FUNCS macro · fcd00b68
      Ilya Dryomov 提交于
      Given
      
          struct foo {
              u64 id;
              struct rb_node bar_node;
          };
      
      generate insert_bar(), erase_bar() and lookup_bar() functions with
      
          DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
      
      The key is assumed to be an integer (u64, int, etc), compared with
      < and >.  nodefld has to be initialized with RB_CLEAR_NODE().
      
      Start using it for MDS, MON and OSD requests and OSD sessions.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      fcd00b68
    • I
      libceph: nuke unused fields and functions · 0c0a8de1
      Ilya Dryomov 提交于
      Either unused or useless:
      
          osdmap->mkfs_epoch
          osd->o_marked_for_keepalive
          monc->num_generic_requests
          osdc->map_waiters
          osdc->last_requested_map
          osdc->timeout_tid
      
          osd_req_op_cls_response_data()
      
          osdmap_apply_incremental() @msgr arg
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      0c0a8de1
    • I
      libceph: variable-sized ceph_object_id · d30291b9
      Ilya Dryomov 提交于
      Currently ceph_object_id can hold object names of up to 100
      (CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
      expect one - long rbd image names:
      
      - a format 1 header is named "<imgname>.rbd"
      - an object that points to a format 2 header is named "rbd_id.<imgname>"
      
      We operate on these potentially long-named objects during rbd map, and,
      for format 1 images, during header refresh.  (A format 2 header name is
      a small system-generated string.)
      
      Lift this 100 character limit by making ceph_object_id be able to point
      to an externally-allocated string.  Apart from being able to work with
      almost arbitrarily-long named objects, this allows us to reduce the
      size of ceph_object_id from >100 bytes to 64 bytes.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      d30291b9
    • I
      libceph: move message allocation out of ceph_osdc_alloc_request() · 13d1ad16
      Ilya Dryomov 提交于
      The size of ->r_request and ->r_reply messages depends on the size of
      the object name (ceph_object_id), while the size of ceph_osd_request is
      fixed.  Move message allocation into a separate function that would
      have to be called after ceph_object_id and ceph_object_locator (which
      is also going to become variable in size with RADOS namespaces) have
      been filled in:
      
          req = ceph_osdc_alloc_request(...);
          <fill in req->r_base_oid>
          <fill in req->r_base_oloc>
          ceph_osdc_alloc_messages(req);
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      13d1ad16
  2. 13 5月, 2016 1 次提交
    • A
      mm: thp: calculate the mapcount correctly for THP pages during WP faults · 6d0a07ed
      Andrea Arcangeli 提交于
      This will provide fully accuracy to the mapcount calculation in the
      write protect faults, so page pinning will not get broken by false
      positive copy-on-writes.
      
      total_mapcount() isn't the right calculation needed in
      reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
      that is effectively the full accurate return value for page_mapcount()
      if dealing with Transparent Hugepages, however we only use the
      page_trans_huge_mapcount() during COW faults where it strictly needed,
      due to its higher runtime cost.
      
      This also provide at practical zero cost the total_mapcount
      information which is needed to know if we can still relocate the page
      anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
      can reuse the page no matter if it's a pte or a pmd_trans_huge
      triggering the fault, but we can only relocate the page anon_vma to
      the local vma->anon_vma if we're sure it's only this "vma" mapping the
      whole THP physical range.
      
      Kirill A. Shutemov discovered the problem with moving the page
      anon_vma to the local vma->anon_vma in a previous version of this
      patch and another problem in the way page_move_anon_rmap() was called.
      
      Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
      previous version, because reuse_swap_page must be a macro to call
      page_trans_huge_mapcount from swap.h, so this uses a macro again
      instead of an inline function. With this change at least it's a less
      dangerous usage than it was before, because "page" is used only once
      now, while with the previous code reuse_swap_page(page++) would have
      called page_mapcount on page+1 and it would have increased page twice
      instead of just once.
      
      Dean Luick noticed an uninitialized variable that could result in a
      rmap inefficiency for the non-THP case in a previous version.
      
      Mike Marciniszyn said:
      
      : Our RDMA tests are seeing an issue with memory locking that bisects to
      : commit 61f5d698 ("mm: re-enable THP")
      :
      : The test program registers two rather large MRs (512M) and RDMA
      : writes data to a passive peer using the first and RDMA reads it back
      : into the second MR and compares that data.  The sizes are chosen randomly
      : between 0 and 1024 bytes.
      :
      : The test will get through a few (<= 4 iterations) and then gets a
      : compare error.
      :
      : Tracing indicates the kernel logical addresses associated with the individual
      : pages at registration ARE correct , the data in the "RDMA read response only"
      : packets ARE correct.
      :
      : The "corruption" occurs when the packet crosse two pages that are not physically
      : contiguous.   The second page reads back as zero in the program.
      :
      : It looks like the user VA at the point of the compare error no longer points to
      : the same physical address as was registered.
      :
      : This patch totally resolves the issue!
      
      Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Reviewed-by: NDean Luick <dean.luick@intel.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Tested-by: NJosh Collier <josh.d.collier@intel.com>
      Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d0a07ed
  3. 11 5月, 2016 2 次提交
  4. 10 5月, 2016 4 次提交
    • J
      export tc ife uapi header · d99079e2
      Jamal Hadi Salim 提交于
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d99079e2
    • M
      uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h · 4a91cb61
      Mikko Rapeli 提交于
      glibc's net/if.h contains copies of definitions from linux/if.h and these
      conflict and cause build failures if both files are included by application
      source code. Changes in uapi headers, which fixed header file dependencies to
      include linux/if.h when it was needed, e.g. commit 1ffad83d, made the
      net/if.h and linux/if.h incompatibilities visible as build failures for
      userspace applications like iproute2 and xtables-addons.
      
      This patch fixes compile errors when glibc net/if.h is included before
      linux/if.h:
      
      ./linux/if.h:99:21: error: redeclaration of enumerator ‘IFF_NOARP’
      ./linux/if.h:98:23: error: redeclaration of enumerator ‘IFF_RUNNING’
      ./linux/if.h:97:26: error: redeclaration of enumerator ‘IFF_NOTRAILERS’
      ./linux/if.h:96:27: error: redeclaration of enumerator ‘IFF_POINTOPOINT’
      ./linux/if.h:95:24: error: redeclaration of enumerator ‘IFF_LOOPBACK’
      ./linux/if.h:94:21: error: redeclaration of enumerator ‘IFF_DEBUG’
      ./linux/if.h:93:25: error: redeclaration of enumerator ‘IFF_BROADCAST’
      ./linux/if.h:92:19: error: redeclaration of enumerator ‘IFF_UP’
      ./linux/if.h:252:8: error: redefinition of ‘struct ifconf’
      ./linux/if.h:203:8: error: redefinition of ‘struct ifreq’
      ./linux/if.h:169:8: error: redefinition of ‘struct ifmap’
      ./linux/if.h:107:23: error: redeclaration of enumerator ‘IFF_DYNAMIC’
      ./linux/if.h:106:25: error: redeclaration of enumerator ‘IFF_AUTOMEDIA’
      ./linux/if.h:105:23: error: redeclaration of enumerator ‘IFF_PORTSEL’
      ./linux/if.h:104:25: error: redeclaration of enumerator ‘IFF_MULTICAST’
      ./linux/if.h:103:21: error: redeclaration of enumerator ‘IFF_SLAVE’
      ./linux/if.h:102:22: error: redeclaration of enumerator ‘IFF_MASTER’
      ./linux/if.h:101:24: error: redeclaration of enumerator ‘IFF_ALLMULTI’
      ./linux/if.h:100:23: error: redeclaration of enumerator ‘IFF_PROMISC’
      
      The cases where linux/if.h is included before net/if.h need a similar fix in
      the glibc side, or the order of include files can be changed userspace
      code as a workaround.
      
      This change was tested in x86 userspace on Debian unstable with
      scripts/headers_compile_test.sh:
      
      $ make headers_install && \
        cd usr/include && ../../scripts/headers_compile_test.sh -l -k
      ...
      cc -Wall -c -nostdinc -I /usr/lib/gcc/i586-linux-gnu/5/include -I /usr/lib/gcc/i586-linux-gnu/5/include-fixed -I . -I /home/mcfrisk/src/linux-2.6/usr/headers_compile_test_include.2uX2zH -I /home/mcfrisk/src/linux-2.6/usr/headers_compile_test_include.2uX2zH/i586-linux-gnu -o /dev/null ./linux/if.h_libc_before_kernel.h
      PASSED libc before kernel test: ./linux/if.h
      Reported-by: NJan Engelhardt <jengelh@inai.de>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Reported-by: NStephen Hemminger <shemming@brocade.com>
      Reported-by: NWaldemar Brodkorb <mail@waldemar-brodkorb.de>
      Cc: Gabriel Laskar <gabriel@lse.epita.fr>
      Signed-off-by: NMikko Rapeli <mikko.rapeli@iki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a91cb61
    • J
      compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16() · 8634de6d
      Josh Poimboeuf 提交于
      gcc support for __builtin_bswap16() was supposedly added for powerpc in
      gcc 4.6, and was then later added for other architectures in gcc 4.8.
      
      However, Stephen Rothwell reported that attempting to use it on powerpc
      in gcc 4.6 fails with:
      
        lib/vsprintf.c:160:2: error: initializer element is not constant
        lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]')
        lib/vsprintf.c:160:2: error: initializer element is not constant
        lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]')
        ...
      
      I'm not entirely sure what those errors mean, but I don't see them on
      gcc 4.8.  So let's consider gcc 4.8 to be the official starting point
      for __builtin_bswap16().
      
      Arnd Bergmann adds:
       "I found the commit in gcc-4.8 that replaced the powerpc-specific
        implementation of __builtin_bswap16 with an architecture-independent
        one.  Apparently the powerpc version (gcc-4.6 and 4.7) just mapped to
        the lhbrx/sthbrx instructions, so it ended up not being a constant,
        though the intent of the patch was mainly to add support for the
        builtin to x86:
      
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52624
      
        has the patch that went into gcc-4.8 and more information."
      
      Fixes: 7322dd75 ("byteswap: try to avoid __builtin_constant_p gcc bug")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Tested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8634de6d
    • S
      cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces · 4f41fc59
      Serge E. Hallyn 提交于
      Patch summary:
      
      When showing a cgroupfs entry in mountinfo, show the path of the mount
      root dentry relative to the reader's cgroup namespace root.
      
      Short explanation (courtesy of mkerrisk):
      
      If we create a new cgroup namespace, then we want both /proc/self/cgroup
      and /proc/self/mountinfo to show cgroup paths that are correctly
      virtualized with respect to the cgroup mount point.  Previous to this
      patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
      does not.
      
      Long version:
      
      When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
      namespace, and then mounts a new instance of the freezer cgroup, the new
      mount will be rooted at /a/b.  The root dentry field of the mountinfo
      entry will show '/a/b'.
      
       cat > /tmp/do1 << EOF
       mount -t cgroup -o freezer freezer /mnt
       grep freezer /proc/self/mountinfo
       EOF
      
       unshare -Gm  bash /tmp/do1
       > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
       > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
      
      The task's freezer cgroup entry in /proc/self/cgroup will simply show
      '/':
      
       grep freezer /proc/self/cgroup
       9:freezer:/
      
      If instead the same task simply bind mounts the /a/b cgroup directory,
      the resulting mountinfo entry will again show /a/b for the dentry root.
      However in this case the task will find its own cgroup at /mnt/a/b,
      not at /mnt:
      
       mount --bind /sys/fs/cgroup/freezer/a/b /mnt
       130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
      
      In other words, there is no way for the task to know, based on what is
      in mountinfo, which cgroup directory is its own.
      
      Example (by mkerrisk):
      
      First, a little script to save some typing and verbiage:
      
      echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
      cat /proc/self/mountinfo | grep freezer |
              awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
      
      Create cgroup, place this shell into the cgroup, and look at the state
      of the /proc files:
      
      2653
      2653                         # Our shell
      14254                        # cat(1)
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      Create a shell in new cgroup and mount namespaces. The act of creating
      a new cgroup namespace causes the process's current cgroups directories
      to become its cgroup root directories. (Here, I'm using my own version
      of the "unshare" utility, which takes the same options as the util-linux
      version):
      
      Look at the state of the /proc files:
      
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      The third entry in /proc/self/cgroup (the pathname of the cgroup inside
      the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
      is rooted at /a/b in the outer namespace.
      
      However, the info in /proc/self/mountinfo is not for this cgroup
      namespace, since we are seeing a duplicate of the mount from the
      old mount namespace, and the info there does not correspond to the
      new cgroup namespace. However, trying to create a new mount still
      doesn't show us the right information in mountinfo:
      
                                            # propagating to other mountns
              /proc/self/cgroup:      7:freezer:/
              mountinfo:              /a/b    /mnt/freezer
      
      The act of creating a new cgroup namespace caused the process's
      current freezer directory, "/a/b", to become its cgroup freezer root
      directory. In other words, the pathname directory of the directory
      within the newly mounted cgroup filesystem should be "/",
      but mountinfo wrongly shows us "/a/b". The consequence of this is
      that the process in the cgroup namespace cannot correctly construct
      the pathname of its cgroup root directory from the information in
      /proc/PID/mountinfo.
      
      With this patch, the dentry root field in mountinfo is shown relative
      to the reader's cgroup namespace.  So the same steps as above:
      
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /../..  /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /mnt/freezer
      
      cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
      cgroup.procs           freezer.self_freezing    notify_on_release
      3164
      2653                   # First shell that placed in this cgroup
      3164                   # Shell started by 'unshare'
      14197                  # cat(1)
      Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4f41fc59
  5. 09 5月, 2016 1 次提交
  6. 07 5月, 2016 2 次提交
  7. 06 5月, 2016 4 次提交
    • A
      byteswap: try to avoid __builtin_constant_p gcc bug · 7322dd75
      Arnd Bergmann 提交于
      This is another attempt to avoid a regression in wwn_to_u64() after that
      started using get_unaligned_be64(), which in turn ran into a bug on
      gcc-4.9 through 6.1.
      
      The regression got introduced due to the combination of two separate
      workarounds (commits e3bde956: "include/linux/unaligned: force
      inlining of byteswap operations" and ef3fb242: "scsi: fc: use
      get/put_unaligned64 for wwn access") that each try to sidestep distinct
      problems with gcc behavior (code growth and increased stack usage).
      
      Unfortunately after both have been applied, a more serious gcc bug has
      been uncovered, leading to incorrect object code that discards part of a
      function and causes undefined behavior.
      
      As part of this problem is how __builtin_constant_p gets evaluated on an
      argument passed by reference into an inline function, this avoids the
      use of __builtin_constant_p() for all architectures that set
      CONFIG_ARCH_USE_BUILTIN_BSWAP.  Most architectures do not set
      ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not
      suffer from the problem in the qla2xxx driver, but they might still run
      into it elsewhere.
      
      Both of the original workarounds were only merged in the 4.6 kernel, and
      the bug that is fixed by this patch should only appear if both are
      there, so we probably don't need to backport the fix.  On the other
      hand, it works by simplifying the code path and should not have any
      negative effects.
      
      [arnd@arndb.de: fix older gcc warnings]
        (http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel)
      Link: https://lkml.org/lkml/headers/2016/4/12/1103
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
      Fixes: e3bde956 ("include/linux/unaligned: force inlining of byteswap operations")
      Fixes: ef3fb242 ("scsi: fc: use get/put_unaligned64 for wwn access")
      Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfelSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3
      Tested-by: NQuinn Tran <quinn.tran@qlogic.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Jan Hubicka <hubicka@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7322dd75
    • A
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli 提交于
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
    • A
      rapidio/mport_cdev: fix uapi type definitions · 4e1016da
      Alexandre Bounine 提交于
      Fix problems in uapi definitions reported by Gabriel Laskar: (see
      https://lkml.org/lkml/2016/4/5/205 for details)
      
       - move public header file rio_mport_cdev.h to include/uapi/linux directory
       - change types in data structures passed as IOCTL parameters
       - improve parameter checking in some IOCTL service routines
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Reported-by: NGabriel Laskar <gabriel@lse.epita.fr>
      Tested-by: NBarry Wood <barry.wood@idt.com>
      Cc: Gabriel Laskar <gabriel@lse.epita.fr>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e1016da
    • J
      mm: memcontrol: let v2 cgroups follow changes in system swappiness · 4550c4e1
      Johannes Weiner 提交于
      Cgroup2 currently doesn't have a per-cgroup swappiness setting.  We
      might want to add one later - that's a different discussion - but until
      we do, the cgroups should always follow the system setting.  Otherwise
      it will be unchangeably set to whatever the ancestor inherited from the
      system setting at the time of cgroup creation.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4550c4e1
  8. 05 5月, 2016 1 次提交
  9. 04 5月, 2016 1 次提交
  10. 03 5月, 2016 1 次提交
    • L
      Minimal fix-up of bad hashing behavior of hash_64() · 689de1d6
      Linus Torvalds 提交于
      This is a fairly minimal fixup to the horribly bad behavior of hash_64()
      with certain input patterns.
      
      In particular, because the multiplicative value used for the 64-bit hash
      was intentionally bit-sparse (so that the multiply could be done with
      shifts and adds on architectures without hardware multipliers), some
      bits did not get spread out very much.  In particular, certain fairly
      common bit ranges in the input (roughly bits 12-20: commonly with the
      most information in them when you hash things like byte offsets in files
      or memory that have block factors that mean that the low bits are often
      zero) would not necessarily show up much in the result.
      
      There's a bigger patch-series brewing to fix up things more completely,
      but this is the fairly minimal fix for the 64-bit hashing problem.  It
      simply picks a much better constant multiplier, spreading the bits out a
      lot better.
      
      NOTE! For 32-bit architectures, the bad old hash_64() remains the same
      for now, since 64-bit multiplies are expensive.  The bigger hashing
      cleanup will replace the 32-bit case with something better.
      
      The new constants were picked by George Spelvin who wrote that bigger
      cleanup series.  I just picked out the constants and part of the comment
      from that series.
      
      Cc: stable@vger.kernel.org
      Cc: George Spelvin <linux@horizon.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      689de1d6
  11. 02 5月, 2016 1 次提交
    • T
      net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case · 2c94b537
      Tim Bingham 提交于
      Prior to commit d92cff89 ("net_dbg_ratelimited: turn into no-op
      when !DEBUG") the implementation of net_dbg_ratelimited() was buggy
      for both the DEBUG and CONFIG_DYNAMIC_DEBUG cases.
      
      The bug was that net_ratelimit() was being called and, despite
      returning true, nothing was being printed to the console. This
      resulted in messages like the following -
      
      "net_ratelimit: %d callbacks suppressed"
      
      with no other output nearby.
      
      After commit d92cff89 ("net_dbg_ratelimited: turn into no-op when
      !DEBUG") the bug is fixed for the DEBUG case. However, there's no
      output at all for CONFIG_DYNAMIC_DEBUG case.
      
      This patch restores debug output (if enabled) for the
      CONFIG_DYNAMIC_DEBUG case.
      
      Add a definition of net_dbg_ratelimited() for the CONFIG_DYNAMIC_DEBUG
      case. The implementation takes care to check that dynamic debugging is
      enabled before calling net_ratelimit().
      
      Fixes: d92cff89 ("net_dbg_ratelimited: turn into no-op when !DEBUG")
      Signed-off-by: NTim Bingham <tbingham@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c94b537
  12. 29 4月, 2016 6 次提交
    • G
      numa: fix /proc/<pid>/numa_maps for THP · 28093f9f
      Gerald Schaefer 提交于
      In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
      because the layouts may differ depending on the architecture.  On s390
      this will lead to inaccurate numa_maps accounting in /proc because of
      misguided pte_present() and pte_dirty() checks on the fake pte.
      
      On other architectures pte_present() and pte_dirty() may work by chance,
      but there may be an issue with direct-access (dax) mappings w/o
      underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
      available.  In vm_normal_page() the fake pte will be checked with
      pte_special() and because there is no "special" bit in a pmd, this will
      always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
      skipped.  On dax mappings w/o struct pages, an invalid struct page
      pointer would then be returned that can crash the kernel.
      
      This patch fixes the numa_maps THP handling by introducing new "_pmd"
      variants of the can_gather_numa_stats() and vm_normal_page() functions.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28093f9f
    • K
      thp: keep huge zero page pinned until tlb flush · aa88b68c
      Kirill A. Shutemov 提交于
      Andrea has found[1] a race condition on MMU-gather based TLB flush vs
      split_huge_page() or shrinker which frees huge zero under us (patch 1/2
      and 2/2 respectively).
      
      With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
      page pinned until flush is complete and the pin prevents the page from
      being split under us.
      
      We still need patch 2/2.  This is simplified version of Andrea's patch.
      We don't need fancy encoding.
      
      [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa88b68c
    • S
      mm: exclude HugeTLB pages from THP page_mapped() logic · 66ee95d1
      Steve Capper 提交于
      HugeTLB pages cannot be split, so we use the compound_mapcount to track
      rmaps.
      
      Currently page_mapped() will check the compound_mapcount, but will also
      go through the constituent pages of a THP compound page and query the
      individual _mapcount's too.
      
      Unfortunately, page_mapped() does not distinguish between HugeTLB and
      THP compound pages and assumes that a compound page always needs to have
      HPAGE_PMD_NR pages querying.
      
      For most cases when dealing with HugeTLB this is just inefficient, but
      for scenarios where the HugeTLB page size is less than the pmd block
      size (e.g.  when using contiguous bit on ARM) this can lead to crashes.
      
      This patch adjusts the page_mapped function such that we skip the
      unnecessary THP reference checks for HugeTLB pages.
      
      Fixes: e1534ae9 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66ee95d1
    • A
      bpf: fix refcnt overflow · 92117d84
      Alexei Starovoitov 提交于
      On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
      the malicious application may overflow 32-bit bpf program refcnt.
      It's also possible to overflow map refcnt on 1Tb system.
      Impose 32k hard limit which means that the same bpf program or
      map cannot be shared by more than 32k processes.
      
      Fixes: 1be7f75d ("bpf: enable non-root eBPF programs")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92117d84
    • M
      net: fix net_gso_ok for new GSO types. · 7b748340
      Marcelo Ricardo Leitner 提交于
      Fix casting in net_gso_ok. Otherwise the shift on
      gso_type << NETIF_F_GSO_SHIFT may hit the 32th bit and make it look like
      a INT_MIN, which is then promoted from signed to uint64 which is
      0xffffffff80000000, resulting in wrong behavior when it is and'ed with
      the feature itself, as in:
      
      This test app:
      #include <stdio.h>
      #include <stdint.h>
      
      int main(int argc, char **argv)
      {
      	uint64_t feature1;
      	uint64_t feature2;
      	int gso_type = 1 << 15;
      
      	feature1 = gso_type << 16;
      	feature2 = (uint64_t)gso_type << 16;
      	printf("%lx %lx\n", feature1, feature2);
      
      	return 0;
      }
      
      Gives:
      ffffffff80000000 80000000
      
      So that this:
         return (features & feature) == feature;
      Actually works on more bits than expected and invalid ones.
      
      Fix is to promote it earlier.
      
      Issue noted while rebasing SCTP GSO patch but posting separetely as
      someone else may experience this meanwhile.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b748340
    • J
      IB/security: Restrict use of the write() interface · e6bd18f5
      Jason Gunthorpe 提交于
      The drivers/infiniband stack uses write() as a replacement for
      bi-directional ioctl().  This is not safe. There are ways to
      trigger write calls that result in the return structure that
      is normally written to user space being shunted off to user
      specified kernel memory instead.
      
      For the immediate repair, detect and deny suspicious accesses to
      the write API.
      
      For long term, update the user space libraries and the kernel API
      to something that doesn't present the same security vulnerabilities
      (likely a structured ioctl() interface).
      
      The impacted uAPI interfaces are generally only available if
      hardware from drivers/infiniband is installed in the system.
      Reported-by: NJann Horn <jann@thejh.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      [ Expanded check to all known write() entry points ]
      Cc: stable@vger.kernel.org
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      e6bd18f5
  13. 28 4月, 2016 2 次提交
  14. 27 4月, 2016 1 次提交
    • L
      devpts: more pty driver interface cleanups · 8ead9dd5
      Linus Torvalds 提交于
      This is more prep-work for the upcoming pty changes.  Still just code
      cleanup with no actual semantic changes.
      
      This removes a bunch pointless complexity by just having the slave pty
      side remember the dentry associated with the devpts slave rather than
      the inode.  That allows us to remove all the "look up the dentry" code
      for when we want to remove it again.
      
      Together with moving the tty pointer from "inode->i_private" to
      "dentry->d_fsdata" and getting rid of pointless inode locking, this
      removes about 30 lines of code.  Not only is the end result smaller,
      it's simpler and easier to understand.
      
      The old code, for example, depended on the d_find_alias() to not just
      find the dentry, but also to check that it is still hashed, which in
      turn validated the tty pointer in the inode.
      
      That is a _very_ roundabout way to say "invalidate the cached tty
      pointer when the dentry is removed".
      
      The new code just does
      
      	dentry->d_fsdata = NULL;
      
      in devpts_pty_kill() instead, invalidating the tty pointer rather more
      directly and obviously.  Don't do something complex and subtle when the
      obvious straightforward approach will do.
      
      The rest of the patch (ie apart from code deletion and the above tty
      pointer clearing) is just switching the calling convention to pass the
      dentry or file pointer around instead of the inode.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ead9dd5
  15. 26 4月, 2016 1 次提交