1. 14 9月, 2011 1 次提交
  2. 13 9月, 2011 1 次提交
  3. 08 9月, 2011 9 次提交
    • P
      posix-cpu-timers: Cure SMP accounting oddities · e8abccb7
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Cc: stable@kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      e8abccb7
    • M
      s390: Use direct ktime path for s390 clockevent device · 4f37a68c
      Martin Schwidefsky 提交于
      The clock comparator on s390 uses the same format as the TOD clock.
      If the value in the clock comparator is smaller than the current TOD
      value an interrupt is pending. Use the CLOCK_EVT_FEAT_KTIME feature
      to get the unmodified ktime of the next clockevent expiration and
      use it to program the clock comparator without querying the TOD clock.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133143.153017933@de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4f37a68c
    • M
      clockevents: Add direct ktime programming function · 65516f8a
      Martin Schwidefsky 提交于
      There is at least one architecture (s390) with a sane clockevent device
      that can be programmed with the equivalent of a ktime. No need to create
      a delta against the current time, the ktime can be used directly.
      
      A new clock device function 'set_next_ktime' is introduced that is called
      with the unmodified ktime for the timer if the clock event device has the 
      CLOCK_EVT_FEAT_KTIME bit set.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.815350967@de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      65516f8a
    • M
      clockevents: Make minimum delay adjustments configurable · d1748302
      Martin Schwidefsky 提交于
      The automatic increase of the min_delta_ns of a clockevents device
      should be done in the clockevents code as the minimum delay is an
      attribute of the clockevents device.
      
      In addition not all architectures want the automatic adjustment, on a
      massively virtualized system it can happen that the programming of a
      clock event fails several times in a row because the virtual cpu has
      been rescheduled quickly enough. In that case the minimum delay will
      erroneously be increased with no way back. The new config symbol
      GENERIC_CLOCKEVENTS_MIN_ADJUST is used to enable the automatic
      adjustment. The config option is selected only for x86.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.494157493@de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d1748302
    • H
      nohz: Remove "Switched to NOHz mode" debugging messages · 29c158e8
      Heiko Carstens 提交于
      When performing cpu hotplug tests the kernel printk log buffer gets flooded
      with pointless "Switched to NOHz mode..." messages. Especially when afterwards
      analyzing a dump this might have removed more interesting stuff out of the
      buffer.
      Assuming that switching to NOHz mode simply works just remove the printk.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Link: http://lkml.kernel.org/r/20110823112046.GB2540@osiris.boeblingen.de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      29c158e8
    • M
      proc: Consider NO_HZ when printing idle and iowait times · a25cac51
      Michal Hocko 提交于
      show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
      statistics when priting information about idle and iowait times.
      This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
      counters are updated periodically.
      With NO_HZ things got more tricky because we are not doing idle/iowait
      accounting while we are tickless so the value might get outdated.
      Users of /proc/stat will notice that by unchanged idle/iowait values
      which is then interpreted as 0% idle/iowait time. From the user space
      POV this is an unexpected behavior and a change of the interface.
      
      Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
      total idle/iowait time since boot and it doesn't rely on sampling or any
      other periodic activity. Fall back to the previous behavior if NO_HZ is
      disabled or not configured.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a25cac51
    • M
      nohz: Make idle/iowait counter update conditional · 09a1d34f
      Michal Hocko 提交于
      get_cpu_{idle,iowait}_time_us update idle/iowait counters
      unconditionally if the given CPU is in the idle loop.
      
      This doesn't work well outside of CPU governors which are singletons
      so nobody (except for IRQ) can race with them.
      
      We will need to use both functions from /proc/stat handler to properly
      handle nohz idle/iowait times.
      
      Make the update depend on a non NULL last_update_time argument.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/11f23179472635ce52e78921d47a20216b872f23.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      09a1d34f
    • M
      nohz: Fix update_ts_time_stat idle accounting · 6beea0cd
      Michal Hocko 提交于
      update_ts_time_stat currently updates idle time even if we are in
      iowait loop at the moment. The only real users of the idle counter
      (via get_cpu_idle_time_us) are CPU governors and they expect to get
      cumulative time for both idle and iowait times.
      The value (idle_sleeptime) is also printed to userspace by print_cpu
      but it prints both idle and iowait times so the idle part is misleading.
      
      Let's clean this up and fix update_ts_time_stat to account both counters
      properly and update consumers of idle to consider iowait time as well.
      If we do this we might use get_cpu_{idle,iowait}_time_us from other
      contexts as well and we will get expected values.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/e9c909c221a8da402c4da07e4cd968c3218f8eb1.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6beea0cd
    • M
      cputime: Clean up cputime_to_usecs and usecs_to_cputime macros · ef0e0f5e
      Michal Hocko 提交于
      Get rid of semicolon so that those expressions can be used also
      somewhere else than just in an assignment.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ef0e0f5e
  4. 11 8月, 2011 9 次提交
  5. 10 8月, 2011 2 次提交
  6. 08 8月, 2011 7 次提交
  7. 07 8月, 2011 11 次提交
    • A
      Fix POSIX ACL permission check · 206b1d09
      Ari Savolainen 提交于
      After commit 3567866b: "RCUify freeing acls, let check_acl() go ahead in
      RCU mode if acl is cached" posix_acl_permission is being called with an
      unsupported flag and the permission check fails. This patch fixes the issue.
      Signed-off-by: NAri Savolainen <ari.m.savolainen@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      206b1d09
    • L
      Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd · c2f340a6
      Linus Torvalds 提交于
      * 'for-linus' of git://git.open-osd.org/linux-open-osd:
        ore: Make ore its own module
        exofs: Rename raid engine from exofs/ios.c => ore
        exofs: ios: Move to a per inode components & device-table
        exofs: Move exofs specific osd operations out of ios.c
        exofs: Add offset/length to exofs_get_io_state
        exofs: Fix truncate for the raid-groups case
        exofs: Small cleanup of exofs_fill_super
        exofs: BUG: Avoid sbi realloc
        exofs: Remove pnfs-osd private definitions
        nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
      c2f340a6
    • L
      vfs: optimize inode cache access patterns · 3ddcd056
      Linus Torvalds 提交于
      The inode structure layout is largely random, and some of the vfs paths
      really do care.  The path lookup in particular is already quite D$
      intensive, and profiles show that accessing the 'inode->i_op->xyz'
      fields is quite costly.
      
      We already optimized the dcache to not unnecessarily load the d_op
      structure for members that are often NULL using the DCACHE_OP_xyz bits
      in dentry->d_flags, and this does something very similar for the inode
      ops that are used during pathname lookup.
      
      It also re-orders the fields so that the fields accessed by 'stat' are
      together at the beginning of the inode structure, and roughly in the
      order accessed.
      
      The effect of this seems to be in the 1-2% range for an empty kernel
      "make -j" run (which is fairly kernel-intensive, mostly in filename
      lookup), so it's visible.  The numbers are fairly noisy, though, and
      likely depend a lot on exact microarchitecture.  So there's more tuning
      to be done.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddcd056
    • L
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds 提交于
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 7cd4767e
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: Compute protocol sequence numbers and fragment IDs using MD5.
        crypto: Move md5_transform to lib/md5.c
      7cd4767e
    • B
      ore: Make ore its own module · cf283ade
      Boaz Harrosh 提交于
      Export everything from ore need exporting. Change Kbuild and Kconfig
      to build ore.ko as an independent module. Import ore from exofs
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      cf283ade
    • B
      exofs: Rename raid engine from exofs/ios.c => ore · 8ff660ab
      Boaz Harrosh 提交于
      ORE stands for "Objects Raid Engine"
      
      This patch is a mechanical rename of everything that was in ios.c
      and its API declaration to an ore.c and an osd_ore.h header. The ore
      engine will later be used by the pnfs objects layout driver.
      
      * File ios.c => ore.c
      
      * Declaration of types and API are moved from exofs.h to a new
        osd_ore.h
      
      * All used types are prefixed by ore_ from their exofs_ name.
      
      * Shift includes from exofs.h to osd_ore.h so osd_ore.h is
        independent, include it from exofs.h.
      
      Other than a pure rename there are no other changes. Next patch
      will move the ore into it's own module and will export the API
      to be used by exofs and later the layout driver
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      8ff660ab
    • B
      exofs: ios: Move to a per inode components & device-table · 9e9db456
      Boaz Harrosh 提交于
      Exofs raid engine was saving on memory space by having a single layout-info,
      single pid, and a single device-table, global to the filesystem. Then passing
      a credential and object_id info at the io_state level, private for each
      inode. It would also devise this contraption of rotating the device table
      view for each inode->ino to spread out the device usage.
      
      This is not compatible with the pnfs-objects standard, demanding that
      each inode can have it's own layout-info, device-table, and each object
      component it's own pid, oid and creds.
      
      So: Bring exofs raid engine to be usable for generic pnfs-objects use by:
      
      * Define an exofs_comp structure that holds obj_id and credential info.
      
      * Break up exofs_layout struct to an exofs_components structure that holds a
        possible array of exofs_comp and the array of devices + the size of the
        arrays.
      
      * Add a "comps" parameter to get_io_state() that specifies the ids creds
        and device array to use for each IO.
      
        This enables to keep the layout global, but the device-table view, creds
        and IDs at the inode level. It only adds two 64bit to each inode, since
        some of these members already existed in another form.
      
      * ios raid engine now access layout-info and comps-info through the passed
        pointers. Everything is pre-prepared by caller for generic access of
        these structures and arrays.
      
      At the exofs Level:
      
      * Super block holds an exofs_components struct that holds the device
        array, previously in layout. The devices there are in device-table
        order. The device-array is twice bigger and repeats the device-table
        twice so now each inode's device array can point to a random device
        and have a round-robin view of the table, making it compatible to
        previous exofs versions.
      
      * Each inode has an exofs_components struct that is initialized at
        load time, with it's own view of the device table IDs and creds.
        When doing IO this gets passed to the io_state together with the
        layout.
      
      While preforming this change. Bugs where found where credentials with the
      wrong IDs where used to access the different SB objects (super.c). As well
      as some dead code. It was never noticed because the target we use does not
      check the credentials.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      9e9db456
    • B
      exofs: Move exofs specific osd operations out of ios.c · 85e44df4
      Boaz Harrosh 提交于
      ios.c will be moving to an external library, for use by the
      objects-layout-driver. Remove from it some exofs specific functions.
      
      Also g_attr_logical_length is used both by inode.c and ios.c
      move definition to the later, to keep it independent
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      85e44df4
    • B
      exofs: Add offset/length to exofs_get_io_state · e1042ba0
      Boaz Harrosh 提交于
      In future raid code we will need to know the IO offset/length
      and if it's a read or write to determine some of the array
      sizes we'll need.
      
      So add a new exofs_get_rw_state() API for use when
      writeing/reading. All other simple cases are left using the
      old way.
      
      The major change to this is that now we need to call
      exofs_get_io_state later at inode.c::read_exec and
      inode.c::write_exec when we actually know these things. So this
      patch is kept separate so I can test things apart from other
      changes.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      e1042ba0
    • D
      net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea
      David S. Miller 提交于
      Computers have become a lot faster since we compromised on the
      partial MD4 hash which we use currently for performance reasons.
      
      MD5 is a much safer choice, and is inline with both RFC1948 and
      other ISS generators (OpenBSD, Solaris, etc.)
      
      Furthermore, only having 24-bits of the sequence number be truly
      unpredictable is a very serious limitation.  So the periodic
      regeneration and 8-bit counter have been removed.  We compute and
      use a full 32-bit sequence number.
      
      For ipv6, DCCP was found to use a 32-bit truncated initial sequence
      number (it needs 43-bits) and that is fixed here as well.
      Reported-by: NDan Kaminsky <dan@doxpara.com>
      Tested-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e5714ea