提交 96256460 编写于 作者: G Greg Kroah-Hartman

staging: zcache: delete it

zcache is obsolete and not used anymore, Bob Liu has rewritten it and is
submitting it for inclusion through the main -mm tree, as it should have
been done in the first place...

Cc: Bob Liu <lliubbo@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
上级 f21c5394
......@@ -130,8 +130,6 @@ source "drivers/staging/sb105x/Kconfig"
source "drivers/staging/fwserial/Kconfig"
source "drivers/staging/zcache/Kconfig"
source "drivers/staging/goldfish/Kconfig"
source "drivers/staging/netlogic/Kconfig"
......
......@@ -58,7 +58,6 @@ obj-$(CONFIG_DRM_IMX) += imx-drm/
obj-$(CONFIG_DGRP) += dgrp/
obj-$(CONFIG_SB105X) += sb105x/
obj-$(CONFIG_FIREWIRE_SERIAL) += fwserial/
obj-$(CONFIG_ZCACHE) += zcache/
obj-$(CONFIG_GOLDFISH) += goldfish/
obj-$(CONFIG_USB_DWC2) += dwc2/
obj-$(CONFIG_LUSTRE_FS) += lustre/
......
config ZCACHE
tristate "Dynamic compression of swap pages and clean pagecache pages"
depends on CRYPTO=y && SWAP=y && CLEANCACHE && FRONTSWAP
select CRYPTO_LZO
default n
help
Zcache doubles RAM efficiency while providing a significant
performance boosts on many workloads. Zcache uses
compression and an in-kernel implementation of transcendent
memory to store clean page cache pages and swap in RAM,
providing a noticeable reduction in disk I/O.
config ZCACHE_DEBUG
bool "Enable debug statistics"
depends on DEBUG_FS && ZCACHE
default n
help
This is used to provide an debugfs directory with counters of
how zcache is doing. You probably want to set this to 'N'.
config RAMSTER
tristate "Cross-machine RAM capacity sharing, aka peer-to-peer tmem"
depends on CONFIGFS_FS=y && SYSFS=y && !HIGHMEM && ZCACHE
depends on NET
# must ensure struct page is 8-byte aligned
select HAVE_ALIGNED_STRUCT_PAGE if !64BIT
default n
help
RAMster allows RAM on other machines in a cluster to be utilized
dynamically and symmetrically instead of swapping to a local swap
disk, thus improving performance on memory-constrained workloads
while minimizing total RAM across the cluster. RAMster, like
zcache2, compresses swap pages into local RAM, but then remotifies
the compressed pages to another node in the RAMster cluster.
config RAMSTER_DEBUG
bool "Enable ramster debug statistics"
depends on DEBUG_FS && RAMSTER
default n
help
This is used to provide an debugfs directory with counters of
how ramster is doing. You probably want to set this to 'N'.
# Depends on not-yet-upstreamed mm patches to export end_swap_bio_write and
# __add_to_swap_cache, and implement __swap_writepage (which is swap_writepage
# without the frontswap call. When these are in-tree, the dependency on
# BROKEN can be removed
config ZCACHE_WRITEBACK
bool "Allow compressed swap pages to be writtenback to swap disk"
depends on ZCACHE=y && BROKEN
default n
help
Zcache caches compressed swap pages (and other data) in RAM which
often improves performance by avoiding I/O's due to swapping.
In some workloads with very long-lived large processes, it can
instead reduce performance. Writeback decompresses zcache-compressed
pages (in LRU order) when under memory pressure and writes them to
the backing swap disk to ameliorate this problem. Policy driving
writeback is still under development.
zcache-y := zcache-main.o tmem.o zbud.o
zcache-$(CONFIG_ZCACHE_DEBUG) += debug.o
zcache-$(CONFIG_RAMSTER_DEBUG) += ramster/debug.o
zcache-$(CONFIG_RAMSTER) += ramster/ramster.o ramster/r2net.o
zcache-$(CONFIG_RAMSTER) += ramster/nodemanager.o ramster/tcp.o
zcache-$(CONFIG_RAMSTER) += ramster/heartbeat.o ramster/masklog.o
obj-$(CONFIG_ZCACHE) += zcache.o
** ZCACHE PLAN FOR PROMOTION FROM STAGING **
Last updated: Feb 13, 2013
PLAN STEPS
1. merge zcache and ramster to eliminate horrible code duplication
2. converge on a predictable, writeback-capable allocator
3. use debugfs instead of sysfs (per akpm feedback in 2011)
4. zcache side of cleancache/mm WasActive patch
5. zcache side of frontswap exclusive gets
6. zcache must be able to writeback to physical swap disk
(per Andrea Arcangeli feedback in 2011)
7. implement adequate policy for writeback
8. frontswap/cleancache work to allow zcache to be loaded
as a module
9. get core mm developer to review
10. incorporate feedback from review
11. get review/acks from 1-2 additional mm developers
12. incorporate any feedback from additional mm reviews
13. propose location/file-naming in mm tree
14. repeat 9-13 as necessary until akpm is happy and merges
STATUS/OWNERSHIP
1. DONE as part of "new" zcache; in staging/zcache for 3.9
2. DONE as part of "new" zcache (cf zbud.[ch]); in staging/zcache for 3.9
(this was the core of the zcache1 vs zcache2 flail)
3. DONE as part of "new" zcache; in staging/zcache for 3.9
4. DONE (w/caveats) as part of "new" zcache; per cleancache performance
feedback see https://lkml.org/lkml/2011/8/17/351, in
staging/zcache for 3.9; dependent on proposed mm patch, see
https://lkml.org/lkml/2012/1/25/300
5. DONE as part of "new" zcache; performance tuning only,
in staging/zcache for 3.9; dependent on frontswap patch
merged in 3.7 (33c2a174)
6. DONE (w/caveats), prototyped as part of "new" zcache, had
bad memory leak; reimplemented to use sjennings clever tricks
and proposed mm patches with new version in staging/zcache
for 3.9, see https://lkml.org/lkml/2013/2/6/437;
7. PROTOTYPED as part of "new" zcache; in staging/zcache for 3.9;
needs more review (plan to discuss at LSF/MM 2013)
9. IN PROGRESS; owned by Konrad Wilk; Mel Gorman provided
great feedback in August 2012 (unfortunately of "old"
zcache)
11. NOT DONE; owned by Konrad Wilk and Bob Liu
12. TBD (depends on quantity of feedback)
13. PROPOSED; one suggestion proposed by Dan; needs more ideas/feedback
14. TBD (depends on feedback)
WHO NEEDS TO AGREE
Not sure. Seth Jennings is now pursuing a separate but semi-parallel
track. Akpm clearly has to approve for any mm merge to happen. Minchan
Kim has interest but may be happy if/when zram is merged into mm. Konrad
Wilk may be maintainer if akpm decides compression is maintainable
separately from the rest of mm. (More LSF/MM 2013 discussion.)
ZCACHE FUTURE NEW FUNCTIONALITY
A. Support zsmalloc as an alternative high-density allocator
(See https://lkml.org/lkml/2013/1/23/511)
B. Possibly support three zbuds per pageframe when space allows
#include <linux/atomic.h>
#include "debug.h"
#ifdef CONFIG_ZCACHE_DEBUG
#include <linux/debugfs.h>
ssize_t zcache_obj_count;
ssize_t zcache_obj_count_max;
ssize_t zcache_objnode_count;
ssize_t zcache_objnode_count_max;
u64 zcache_eph_zbytes;
u64 zcache_eph_zbytes_max;
u64 zcache_pers_zbytes_max;
ssize_t zcache_eph_pageframes_max;
ssize_t zcache_pers_pageframes_max;
ssize_t zcache_pageframes_alloced;
ssize_t zcache_pageframes_freed;
ssize_t zcache_eph_zpages;
ssize_t zcache_eph_zpages_max;
ssize_t zcache_pers_zpages_max;
ssize_t zcache_flush_total;
ssize_t zcache_flush_found;
ssize_t zcache_flobj_total;
ssize_t zcache_flobj_found;
ssize_t zcache_failed_eph_puts;
ssize_t zcache_failed_pers_puts;
ssize_t zcache_failed_getfreepages;
ssize_t zcache_failed_alloc;
ssize_t zcache_put_to_flush;
ssize_t zcache_compress_poor;
ssize_t zcache_mean_compress_poor;
ssize_t zcache_eph_ate_tail;
ssize_t zcache_eph_ate_tail_failed;
ssize_t zcache_pers_ate_eph;
ssize_t zcache_pers_ate_eph_failed;
ssize_t zcache_evicted_eph_zpages;
ssize_t zcache_evicted_eph_pageframes;
ssize_t zcache_zero_filled_pages;
ssize_t zcache_zero_filled_pages_max;
#define ATTR(x) { .name = #x, .val = &zcache_##x, }
static struct debug_entry {
const char *name;
ssize_t *val;
} attrs[] = {
ATTR(obj_count), ATTR(obj_count_max),
ATTR(objnode_count), ATTR(objnode_count_max),
ATTR(flush_total), ATTR(flush_found),
ATTR(flobj_total), ATTR(flobj_found),
ATTR(failed_eph_puts), ATTR(failed_pers_puts),
ATTR(failed_getfreepages), ATTR(failed_alloc),
ATTR(put_to_flush),
ATTR(compress_poor), ATTR(mean_compress_poor),
ATTR(eph_ate_tail), ATTR(eph_ate_tail_failed),
ATTR(pers_ate_eph), ATTR(pers_ate_eph_failed),
ATTR(evicted_eph_zpages), ATTR(evicted_eph_pageframes),
ATTR(eph_pageframes), ATTR(eph_pageframes_max),
ATTR(pers_pageframes), ATTR(pers_pageframes_max),
ATTR(eph_zpages), ATTR(eph_zpages_max),
ATTR(pers_zpages), ATTR(pers_zpages_max),
ATTR(last_active_file_pageframes),
ATTR(last_inactive_file_pageframes),
ATTR(last_active_anon_pageframes),
ATTR(last_inactive_anon_pageframes),
ATTR(eph_nonactive_puts_ignored),
ATTR(pers_nonactive_puts_ignored),
ATTR(zero_filled_pages),
#ifdef CONFIG_ZCACHE_WRITEBACK
ATTR(outstanding_writeback_pages),
ATTR(writtenback_pages),
#endif
};
#undef ATTR
int zcache_debugfs_init(void)
{
unsigned int i;
struct dentry *root = debugfs_create_dir("zcache", NULL);
if (root == NULL)
return -ENXIO;
for (i = 0; i < ARRAY_SIZE(attrs); i++)
if (!debugfs_create_size_t(attrs[i].name, S_IRUGO, root, attrs[i].val))
goto out;
debugfs_create_u64("eph_zbytes", S_IRUGO, root, &zcache_eph_zbytes);
debugfs_create_u64("eph_zbytes_max", S_IRUGO, root, &zcache_eph_zbytes_max);
debugfs_create_u64("pers_zbytes", S_IRUGO, root, &zcache_pers_zbytes);
debugfs_create_u64("pers_zbytes_max", S_IRUGO, root, &zcache_pers_zbytes_max);
return 0;
out:
return -ENODEV;
}
/* developers can call this in case of ooms, e.g. to find memory leaks */
void zcache_dump(void)
{
unsigned int i;
for (i = 0; i < ARRAY_SIZE(attrs); i++)
pr_debug("zcache: %s=%zu\n", attrs[i].name, *attrs[i].val);
pr_debug("zcache: eph_zbytes=%llu\n", (unsigned long long)zcache_eph_zbytes);
pr_debug("zcache: eph_zbytes_max=%llu\n", (unsigned long long)zcache_eph_zbytes_max);
pr_debug("zcache: pers_zbytes=%llu\n", (unsigned long long)zcache_pers_zbytes);
pr_debug("zcache: pers_zbytes_max=%llu\n", (unsigned long long)zcache_pers_zbytes_max);
}
#endif
#include <linux/bug.h>
#ifdef CONFIG_ZCACHE_DEBUG
/* we try to keep these statistics SMP-consistent */
extern ssize_t zcache_obj_count;
static atomic_t zcache_obj_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_obj_count_max;
static inline void inc_zcache_obj_count(void)
{
zcache_obj_count = atomic_inc_return(&zcache_obj_atomic);
if (zcache_obj_count > zcache_obj_count_max)
zcache_obj_count_max = zcache_obj_count;
}
static inline void dec_zcache_obj_count(void)
{
zcache_obj_count = atomic_dec_return(&zcache_obj_atomic);
BUG_ON(zcache_obj_count < 0);
};
extern ssize_t zcache_objnode_count;
static atomic_t zcache_objnode_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_objnode_count_max;
static inline void inc_zcache_objnode_count(void)
{
zcache_objnode_count = atomic_inc_return(&zcache_objnode_atomic);
if (zcache_objnode_count > zcache_objnode_count_max)
zcache_objnode_count_max = zcache_objnode_count;
};
static inline void dec_zcache_objnode_count(void)
{
zcache_objnode_count = atomic_dec_return(&zcache_objnode_atomic);
BUG_ON(zcache_objnode_count < 0);
};
extern u64 zcache_eph_zbytes;
static atomic_long_t zcache_eph_zbytes_atomic = ATOMIC_INIT(0);
extern u64 zcache_eph_zbytes_max;
static inline void inc_zcache_eph_zbytes(unsigned clen)
{
zcache_eph_zbytes = atomic_long_add_return(clen, &zcache_eph_zbytes_atomic);
if (zcache_eph_zbytes > zcache_eph_zbytes_max)
zcache_eph_zbytes_max = zcache_eph_zbytes;
};
static inline void dec_zcache_eph_zbytes(unsigned zsize)
{
zcache_eph_zbytes = atomic_long_sub_return(zsize, &zcache_eph_zbytes_atomic);
};
extern u64 zcache_pers_zbytes;
static atomic_long_t zcache_pers_zbytes_atomic = ATOMIC_INIT(0);
extern u64 zcache_pers_zbytes_max;
static inline void inc_zcache_pers_zbytes(unsigned clen)
{
zcache_pers_zbytes = atomic_long_add_return(clen, &zcache_pers_zbytes_atomic);
if (zcache_pers_zbytes > zcache_pers_zbytes_max)
zcache_pers_zbytes_max = zcache_pers_zbytes;
}
static inline void dec_zcache_pers_zbytes(unsigned zsize)
{
zcache_pers_zbytes = atomic_long_sub_return(zsize, &zcache_pers_zbytes_atomic);
}
extern ssize_t zcache_eph_pageframes;
static atomic_t zcache_eph_pageframes_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_eph_pageframes_max;
static inline void inc_zcache_eph_pageframes(void)
{
zcache_eph_pageframes = atomic_inc_return(&zcache_eph_pageframes_atomic);
if (zcache_eph_pageframes > zcache_eph_pageframes_max)
zcache_eph_pageframes_max = zcache_eph_pageframes;
};
static inline void dec_zcache_eph_pageframes(void)
{
zcache_eph_pageframes = atomic_dec_return(&zcache_eph_pageframes_atomic);
};
extern ssize_t zcache_pers_pageframes;
static atomic_t zcache_pers_pageframes_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_pers_pageframes_max;
static inline void inc_zcache_pers_pageframes(void)
{
zcache_pers_pageframes = atomic_inc_return(&zcache_pers_pageframes_atomic);
if (zcache_pers_pageframes > zcache_pers_pageframes_max)
zcache_pers_pageframes_max = zcache_pers_pageframes;
}
static inline void dec_zcache_pers_pageframes(void)
{
zcache_pers_pageframes = atomic_dec_return(&zcache_pers_pageframes_atomic);
}
extern ssize_t zcache_pageframes_alloced;
static atomic_t zcache_pageframes_alloced_atomic = ATOMIC_INIT(0);
static inline void inc_zcache_pageframes_alloced(void)
{
zcache_pageframes_alloced = atomic_inc_return(&zcache_pageframes_alloced_atomic);
};
extern ssize_t zcache_pageframes_freed;
static atomic_t zcache_pageframes_freed_atomic = ATOMIC_INIT(0);
static inline void inc_zcache_pageframes_freed(void)
{
zcache_pageframes_freed = atomic_inc_return(&zcache_pageframes_freed_atomic);
}
extern ssize_t zcache_eph_zpages;
static atomic_t zcache_eph_zpages_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_eph_zpages_max;
static inline void inc_zcache_eph_zpages(void)
{
zcache_eph_zpages = atomic_inc_return(&zcache_eph_zpages_atomic);
if (zcache_eph_zpages > zcache_eph_zpages_max)
zcache_eph_zpages_max = zcache_eph_zpages;
}
static inline void dec_zcache_eph_zpages(unsigned zpages)
{
zcache_eph_zpages = atomic_sub_return(zpages, &zcache_eph_zpages_atomic);
}
extern ssize_t zcache_pers_zpages;
static atomic_t zcache_pers_zpages_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_pers_zpages_max;
static inline void inc_zcache_pers_zpages(void)
{
zcache_pers_zpages = atomic_inc_return(&zcache_pers_zpages_atomic);
if (zcache_pers_zpages > zcache_pers_zpages_max)
zcache_pers_zpages_max = zcache_pers_zpages;
}
static inline void dec_zcache_pers_zpages(unsigned zpages)
{
zcache_pers_zpages = atomic_sub_return(zpages, &zcache_pers_zpages_atomic);
}
extern ssize_t zcache_zero_filled_pages;
static atomic_t zcache_zero_filled_pages_atomic = ATOMIC_INIT(0);
extern ssize_t zcache_zero_filled_pages_max;
static inline void inc_zcache_zero_filled_pages(void)
{
zcache_zero_filled_pages = atomic_inc_return(
&zcache_zero_filled_pages_atomic);
if (zcache_zero_filled_pages > zcache_zero_filled_pages_max)
zcache_zero_filled_pages_max = zcache_zero_filled_pages;
}
static inline void dec_zcache_zero_filled_pages(void)
{
zcache_zero_filled_pages = atomic_dec_return(
&zcache_zero_filled_pages_atomic);
}
static inline unsigned long curr_pageframes_count(void)
{
return zcache_pageframes_alloced -
atomic_read(&zcache_pageframes_freed_atomic) -
atomic_read(&zcache_eph_pageframes_atomic) -
atomic_read(&zcache_pers_pageframes_atomic);
};
/* but for the rest of these, counting races are ok */
extern ssize_t zcache_flush_total;
extern ssize_t zcache_flush_found;
extern ssize_t zcache_flobj_total;
extern ssize_t zcache_flobj_found;
extern ssize_t zcache_failed_eph_puts;
extern ssize_t zcache_failed_pers_puts;
extern ssize_t zcache_failed_getfreepages;
extern ssize_t zcache_failed_alloc;
extern ssize_t zcache_put_to_flush;
extern ssize_t zcache_compress_poor;
extern ssize_t zcache_mean_compress_poor;
extern ssize_t zcache_eph_ate_tail;
extern ssize_t zcache_eph_ate_tail_failed;
extern ssize_t zcache_pers_ate_eph;
extern ssize_t zcache_pers_ate_eph_failed;
extern ssize_t zcache_evicted_eph_zpages;
extern ssize_t zcache_evicted_eph_pageframes;
extern ssize_t zcache_last_active_file_pageframes;
extern ssize_t zcache_last_inactive_file_pageframes;
extern ssize_t zcache_last_active_anon_pageframes;
extern ssize_t zcache_last_inactive_anon_pageframes;
static ssize_t zcache_eph_nonactive_puts_ignored;
static ssize_t zcache_pers_nonactive_puts_ignored;
#ifdef CONFIG_ZCACHE_WRITEBACK
extern ssize_t zcache_writtenback_pages;
extern ssize_t zcache_outstanding_writeback_pages;
#endif
static inline void inc_zcache_flush_total(void)
{
zcache_flush_total++;
};
static inline void inc_zcache_flush_found(void)
{
zcache_flush_found++;
};
static inline void inc_zcache_flobj_total(void)
{
zcache_flobj_total++;
};
static inline void inc_zcache_flobj_found(void)
{
zcache_flobj_found++;
};
static inline void inc_zcache_failed_eph_puts(void)
{
zcache_failed_eph_puts++;
};
static inline void inc_zcache_failed_pers_puts(void)
{
zcache_failed_pers_puts++;
};
static inline void inc_zcache_failed_getfreepages(void)
{
zcache_failed_getfreepages++;
};
static inline void inc_zcache_failed_alloc(void)
{
zcache_failed_alloc++;
};
static inline void inc_zcache_put_to_flush(void)
{
zcache_put_to_flush++;
};
static inline void inc_zcache_compress_poor(void)
{
zcache_compress_poor++;
};
static inline void inc_zcache_mean_compress_poor(void)
{
zcache_mean_compress_poor++;
};
static inline void inc_zcache_eph_ate_tail(void)
{
zcache_eph_ate_tail++;
};
static inline void inc_zcache_eph_ate_tail_failed(void)
{
zcache_eph_ate_tail_failed++;
};
static inline void inc_zcache_pers_ate_eph(void)
{
zcache_pers_ate_eph++;
};
static inline void inc_zcache_pers_ate_eph_failed(void)
{
zcache_pers_ate_eph_failed++;
};
static inline void inc_zcache_evicted_eph_zpages(unsigned zpages)
{
zcache_evicted_eph_zpages += zpages;
};
static inline void inc_zcache_evicted_eph_pageframes(void)
{
zcache_evicted_eph_pageframes++;
};
static inline void inc_zcache_eph_nonactive_puts_ignored(void)
{
zcache_eph_nonactive_puts_ignored++;
};
static inline void inc_zcache_pers_nonactive_puts_ignored(void)
{
zcache_pers_nonactive_puts_ignored++;
};
int zcache_debugfs_init(void);
#else
static inline void inc_zcache_obj_count(void) { };
static inline void dec_zcache_obj_count(void) { };
static inline void inc_zcache_objnode_count(void) { };
static inline void dec_zcache_objnode_count(void) { };
static inline void inc_zcache_eph_zbytes(unsigned clen) { };
static inline void dec_zcache_eph_zbytes(unsigned zsize) { };
static inline void inc_zcache_pers_zbytes(unsigned clen) { };
static inline void dec_zcache_pers_zbytes(unsigned zsize) { };
static inline void inc_zcache_eph_pageframes(void) { };
static inline void dec_zcache_eph_pageframes(void) { };
static inline void inc_zcache_pers_pageframes(void) { };
static inline void dec_zcache_pers_pageframes(void) { };
static inline void inc_zcache_pageframes_alloced(void) { };
static inline void inc_zcache_pageframes_freed(void) { };
static inline void inc_zcache_eph_zpages(void) { };
static inline void dec_zcache_eph_zpages(unsigned zpages) { };
static inline void inc_zcache_pers_zpages(void) { };
static inline void dec_zcache_pers_zpages(unsigned zpages) { };
static inline void inc_zcache_zero_filled_pages(void) { };
static inline void dec_zcache_zero_filled_pages(void) { };
static inline unsigned long curr_pageframes_count(void)
{
return 0;
};
static inline int zcache_debugfs_init(void)
{
return 0;
};
static inline void inc_zcache_flush_total(void) { };
static inline void inc_zcache_flush_found(void) { };
static inline void inc_zcache_flobj_total(void) { };
static inline void inc_zcache_flobj_found(void) { };
static inline void inc_zcache_failed_eph_puts(void) { };
static inline void inc_zcache_failed_pers_puts(void) { };
static inline void inc_zcache_failed_getfreepages(void) { };
static inline void inc_zcache_failed_alloc(void) { };
static inline void inc_zcache_put_to_flush(void) { };
static inline void inc_zcache_compress_poor(void) { };
static inline void inc_zcache_mean_compress_poor(void) { };
static inline void inc_zcache_eph_ate_tail(void) { };
static inline void inc_zcache_eph_ate_tail_failed(void) { };
static inline void inc_zcache_pers_ate_eph(void) { };
static inline void inc_zcache_pers_ate_eph_failed(void) { };
static inline void inc_zcache_evicted_eph_zpages(unsigned zpages) { };
static inline void inc_zcache_evicted_eph_pageframes(void) { };
static inline void inc_zcache_eph_nonactive_puts_ignored(void) { };
static inline void inc_zcache_pers_nonactive_puts_ignored(void) { };
#endif
/*
* zcache/ramster.h
*
* Placeholder to resolve ramster references when !CONFIG_RAMSTER
* Real ramster.h lives in ramster subdirectory.
*
* Copyright (c) 2009-2012, Dan Magenheimer, Oracle Corp.
*/
#ifndef _ZCACHE_RAMSTER_H_
#define _ZCACHE_RAMSTER_H_
#ifdef CONFIG_RAMSTER
#include "ramster/ramster.h"
#else
static inline void ramster_init(bool x, bool y, bool z, bool w)
{
}
static inline void ramster_register_pamops(struct tmem_pamops *p)
{
}
static inline int ramster_remotify_pageframe(bool b)
{
return 0;
}
static inline void *ramster_pampd_free(void *v, struct tmem_pool *p,
struct tmem_oid *o, uint32_t u, bool b)
{
return NULL;
}
static inline int ramster_do_preload_flnode(struct tmem_pool *p)
{
return -1;
}
static inline bool pampd_is_remote(void *v)
{
return false;
}
static inline void ramster_count_foreign_pages(bool b, int i)
{
}
static inline void ramster_cpu_up(int cpu)
{
}
static inline void ramster_cpu_down(int cpu)
{
}
#endif
#endif /* _ZCACHE_RAMSTER_H */
#include <linux/atomic.h>
#include "debug.h"
ssize_t ramster_foreign_eph_pages;
ssize_t ramster_foreign_pers_pages;
#ifdef CONFIG_DEBUG_FS
#include <linux/debugfs.h>
ssize_t ramster_eph_pages_remoted;
ssize_t ramster_pers_pages_remoted;
ssize_t ramster_eph_pages_remote_failed;
ssize_t ramster_pers_pages_remote_failed;
ssize_t ramster_remote_eph_pages_succ_get;
ssize_t ramster_remote_pers_pages_succ_get;
ssize_t ramster_remote_eph_pages_unsucc_get;
ssize_t ramster_remote_pers_pages_unsucc_get;
ssize_t ramster_pers_pages_remote_nomem;
ssize_t ramster_remote_objects_flushed;
ssize_t ramster_remote_object_flushes_failed;
ssize_t ramster_remote_pages_flushed;
ssize_t ramster_remote_page_flushes_failed;
#define ATTR(x) { .name = #x, .val = &ramster_##x, }
static struct debug_entry {
const char *name;
ssize_t *val;
} attrs[] = {
ATTR(eph_pages_remoted),
ATTR(pers_pages_remoted),
ATTR(eph_pages_remote_failed),
ATTR(pers_pages_remote_failed),
ATTR(remote_eph_pages_succ_get),
ATTR(remote_pers_pages_succ_get),
ATTR(remote_eph_pages_unsucc_get),
ATTR(remote_pers_pages_unsucc_get),
ATTR(pers_pages_remote_nomem),
ATTR(remote_objects_flushed),
ATTR(remote_pages_flushed),
ATTR(remote_object_flushes_failed),
ATTR(remote_page_flushes_failed),
ATTR(foreign_eph_pages),
ATTR(foreign_eph_pages_max),
ATTR(foreign_pers_pages),
ATTR(foreign_pers_pages_max),
};
#undef ATTR
int ramster_debugfs_init(void)
{
int i;
struct dentry *root = debugfs_create_dir("ramster", NULL);
if (root == NULL)
return -ENXIO;
for (i = 0; i < ARRAY_SIZE(attrs); i++)
if (!debugfs_create_size_t(attrs[i].name,
S_IRUGO, root, attrs[i].val))
goto out;
return 0;
out:
return -ENODEV;
}
#else
static inline int ramster_debugfs_init(void)
{
return 0;
}
#endif
#include <linux/bug.h>
#ifdef CONFIG_RAMSTER_DEBUG
extern long ramster_flnodes;
static atomic_t ramster_flnodes_atomic = ATOMIC_INIT(0);
static unsigned long ramster_flnodes_max;
static inline void inc_ramster_flnodes(void)
{
ramster_flnodes = atomic_inc_return(&ramster_flnodes_atomic);
if (ramster_flnodes > ramster_flnodes_max)
ramster_flnodes_max = ramster_flnodes;
}
static inline void dec_ramster_flnodes(void)
{
ramster_flnodes = atomic_dec_return(&ramster_flnodes_atomic);
}
extern ssize_t ramster_foreign_eph_pages;
static atomic_t ramster_foreign_eph_pages_atomic = ATOMIC_INIT(0);
static ssize_t ramster_foreign_eph_pages_max;
static inline void inc_ramster_foreign_eph_pages(void)
{
ramster_foreign_eph_pages = atomic_inc_return(
&ramster_foreign_eph_pages_atomic);
if (ramster_foreign_eph_pages > ramster_foreign_eph_pages_max)
ramster_foreign_eph_pages_max = ramster_foreign_eph_pages;
}
static inline void dec_ramster_foreign_eph_pages(void)
{
ramster_foreign_eph_pages = atomic_dec_return(
&ramster_foreign_eph_pages_atomic);
}
extern ssize_t ramster_foreign_pers_pages;
static atomic_t ramster_foreign_pers_pages_atomic = ATOMIC_INIT(0);
static ssize_t ramster_foreign_pers_pages_max;
static inline void inc_ramster_foreign_pers_pages(void)
{
ramster_foreign_pers_pages = atomic_inc_return(
&ramster_foreign_pers_pages_atomic);
if (ramster_foreign_pers_pages > ramster_foreign_pers_pages_max)
ramster_foreign_pers_pages_max = ramster_foreign_pers_pages;
}
static inline void dec_ramster_foreign_pers_pages(void)
{
ramster_foreign_pers_pages = atomic_dec_return(
&ramster_foreign_pers_pages_atomic);
}
extern ssize_t ramster_eph_pages_remoted;
extern ssize_t ramster_pers_pages_remoted;
extern ssize_t ramster_eph_pages_remote_failed;
extern ssize_t ramster_pers_pages_remote_failed;
extern ssize_t ramster_remote_eph_pages_succ_get;
extern ssize_t ramster_remote_pers_pages_succ_get;
extern ssize_t ramster_remote_eph_pages_unsucc_get;
extern ssize_t ramster_remote_pers_pages_unsucc_get;
extern ssize_t ramster_pers_pages_remote_nomem;
extern ssize_t ramster_remote_objects_flushed;
extern ssize_t ramster_remote_object_flushes_failed;
extern ssize_t ramster_remote_pages_flushed;
extern ssize_t ramster_remote_page_flushes_failed;
int ramster_debugfs_init(void);
static inline void inc_ramster_eph_pages_remoted(void)
{
ramster_eph_pages_remoted++;
};
static inline void inc_ramster_pers_pages_remoted(void)
{
ramster_pers_pages_remoted++;
};
static inline void inc_ramster_eph_pages_remote_failed(void)
{
ramster_eph_pages_remote_failed++;
};
static inline void inc_ramster_pers_pages_remote_failed(void)
{
ramster_pers_pages_remote_failed++;
};
static inline void inc_ramster_remote_eph_pages_succ_get(void)
{
ramster_remote_eph_pages_succ_get++;
};
static inline void inc_ramster_remote_pers_pages_succ_get(void)
{
ramster_remote_pers_pages_succ_get++;
};
static inline void inc_ramster_remote_eph_pages_unsucc_get(void)
{
ramster_remote_eph_pages_unsucc_get++;
};
static inline void inc_ramster_remote_pers_pages_unsucc_get(void)
{
ramster_remote_pers_pages_unsucc_get++;
};
static inline void inc_ramster_pers_pages_remote_nomem(void)
{
ramster_pers_pages_remote_nomem++;
};
static inline void inc_ramster_remote_objects_flushed(void)
{
ramster_remote_objects_flushed++;
};
static inline void inc_ramster_remote_object_flushes_failed(void)
{
ramster_remote_object_flushes_failed++;
};
static inline void inc_ramster_remote_pages_flushed(void)
{
ramster_remote_pages_flushed++;
};
static inline void inc_ramster_remote_page_flushes_failed(void)
{
ramster_remote_page_flushes_failed++;
};
#else
static inline void inc_ramster_flnodes(void) { };
static inline void dec_ramster_flnodes(void) { };
static inline void inc_ramster_foreign_eph_pages(void) { };
static inline void dec_ramster_foreign_eph_pages(void) { };
static inline void inc_ramster_foreign_pers_pages(void) { };
static inline void dec_ramster_foreign_pers_pages(void) { };
static inline void inc_ramster_eph_pages_remoted(void) { };
static inline void inc_ramster_pers_pages_remoted(void) { };
static inline void inc_ramster_eph_pages_remote_failed(void) { };
static inline void inc_ramster_pers_pages_remote_failed(void) { };
static inline void inc_ramster_remote_eph_pages_succ_get(void) { };
static inline void inc_ramster_remote_pers_pages_succ_get(void) { };
static inline void inc_ramster_remote_eph_pages_unsucc_get(void) { };
static inline void inc_ramster_remote_pers_pages_unsucc_get(void) { };
static inline void inc_ramster_pers_pages_remote_nomem(void) { };
static inline void inc_ramster_remote_objects_flushed(void) { };
static inline void inc_ramster_remote_object_flushes_failed(void) { };
static inline void inc_ramster_remote_pages_flushed(void) { };
static inline void inc_ramster_remote_page_flushes_failed(void) { };
static inline int ramster_debugfs_init(void)
{
return 0;
}
#endif
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* Copyright (C) 2004, 2005, 2012 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/configfs.h>
#include "heartbeat.h"
#include "tcp.h"
#include "nodemanager.h"
#include "masklog.h"
/*
* The first heartbeat pass had one global thread that would serialize all hb
* callback calls. This global serializing sem should only be removed once
* we've made sure that all callees can deal with being called concurrently
* from multiple hb region threads.
*/
static DECLARE_RWSEM(r2hb_callback_sem);
/*
* multiple hb threads are watching multiple regions. A node is live
* whenever any of the threads sees activity from the node in its region.
*/
static DEFINE_SPINLOCK(r2hb_live_lock);
static unsigned long r2hb_live_node_bitmap[BITS_TO_LONGS(R2NM_MAX_NODES)];
static struct r2hb_callback {
struct list_head list;
} r2hb_callbacks[R2HB_NUM_CB];
enum r2hb_heartbeat_modes {
R2HB_HEARTBEAT_LOCAL = 0,
R2HB_HEARTBEAT_GLOBAL,
R2HB_HEARTBEAT_NUM_MODES,
};
char *r2hb_heartbeat_mode_desc[R2HB_HEARTBEAT_NUM_MODES] = {
"local", /* R2HB_HEARTBEAT_LOCAL */
"global", /* R2HB_HEARTBEAT_GLOBAL */
};
unsigned int r2hb_dead_threshold = R2HB_DEFAULT_DEAD_THRESHOLD;
unsigned int r2hb_heartbeat_mode = R2HB_HEARTBEAT_LOCAL;
/* Only sets a new threshold if there are no active regions.
*
* No locking or otherwise interesting code is required for reading
* r2hb_dead_threshold as it can't change once regions are active and
* it's not interesting to anyone until then anyway. */
static void r2hb_dead_threshold_set(unsigned int threshold)
{
if (threshold > R2HB_MIN_DEAD_THRESHOLD) {
spin_lock(&r2hb_live_lock);
r2hb_dead_threshold = threshold;
spin_unlock(&r2hb_live_lock);
}
}
static int r2hb_global_hearbeat_mode_set(unsigned int hb_mode)
{
int ret = -1;
if (hb_mode < R2HB_HEARTBEAT_NUM_MODES) {
spin_lock(&r2hb_live_lock);
r2hb_heartbeat_mode = hb_mode;
ret = 0;
spin_unlock(&r2hb_live_lock);
}
return ret;
}
void r2hb_exit(void)
{
}
int r2hb_init(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(r2hb_callbacks); i++)
INIT_LIST_HEAD(&r2hb_callbacks[i].list);
memset(r2hb_live_node_bitmap, 0, sizeof(r2hb_live_node_bitmap));
return 0;
}
/* if we're already in a callback then we're already serialized by the sem */
static void r2hb_fill_node_map_from_callback(unsigned long *map,
unsigned bytes)
{
BUG_ON(bytes < (BITS_TO_LONGS(R2NM_MAX_NODES) * sizeof(unsigned long)));
memcpy(map, &r2hb_live_node_bitmap, bytes);
}
/*
* get a map of all nodes that are heartbeating in any regions
*/
void r2hb_fill_node_map(unsigned long *map, unsigned bytes)
{
/* callers want to serialize this map and callbacks so that they
* can trust that they don't miss nodes coming to the party */
down_read(&r2hb_callback_sem);
spin_lock(&r2hb_live_lock);
r2hb_fill_node_map_from_callback(map, bytes);
spin_unlock(&r2hb_live_lock);
up_read(&r2hb_callback_sem);
}
EXPORT_SYMBOL_GPL(r2hb_fill_node_map);
/*
* heartbeat configfs bits. The heartbeat set is a default set under
* the cluster set in nodemanager.c.
*/
/* heartbeat set */
struct r2hb_hb_group {
struct config_group hs_group;
/* some stuff? */
};
static struct r2hb_hb_group *to_r2hb_hb_group(struct config_group *group)
{
return group ?
container_of(group, struct r2hb_hb_group, hs_group)
: NULL;
}
static struct config_item r2hb_config_item;
static struct config_item *r2hb_hb_group_make_item(struct config_group *group,
const char *name)
{
int ret;
if (strlen(name) > R2HB_MAX_REGION_NAME_LEN) {
ret = -ENAMETOOLONG;
goto free;
}
config_item_put(&r2hb_config_item);
return &r2hb_config_item;
free:
return ERR_PTR(ret);
}
static void r2hb_hb_group_drop_item(struct config_group *group,
struct config_item *item)
{
if (r2hb_global_heartbeat_active()) {
pr_notice("ramster: Heartbeat %s on region %s (%s)\n",
"stopped/aborted", config_item_name(item),
"no region");
}
config_item_put(item);
}
struct r2hb_hb_group_attribute {
struct configfs_attribute attr;
ssize_t (*show)(struct r2hb_hb_group *, char *);
ssize_t (*store)(struct r2hb_hb_group *, const char *, size_t);
};
static ssize_t r2hb_hb_group_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
struct r2hb_hb_group *reg = to_r2hb_hb_group(to_config_group(item));
struct r2hb_hb_group_attribute *r2hb_hb_group_attr =
container_of(attr, struct r2hb_hb_group_attribute, attr);
ssize_t ret = 0;
if (r2hb_hb_group_attr->show)
ret = r2hb_hb_group_attr->show(reg, page);
return ret;
}
static ssize_t r2hb_hb_group_store(struct config_item *item,
struct configfs_attribute *attr,
const char *page, size_t count)
{
struct r2hb_hb_group *reg = to_r2hb_hb_group(to_config_group(item));
struct r2hb_hb_group_attribute *r2hb_hb_group_attr =
container_of(attr, struct r2hb_hb_group_attribute, attr);
ssize_t ret = -EINVAL;
if (r2hb_hb_group_attr->store)
ret = r2hb_hb_group_attr->store(reg, page, count);
return ret;
}
static ssize_t r2hb_hb_group_threshold_show(struct r2hb_hb_group *group,
char *page)
{
return sprintf(page, "%u\n", r2hb_dead_threshold);
}
static ssize_t r2hb_hb_group_threshold_store(struct r2hb_hb_group *group,
const char *page,
size_t count)
{
unsigned long tmp;
char *p = (char *)page;
int err;
err = kstrtoul(p, 10, &tmp);
if (err)
return err;
/* this will validate ranges for us. */
r2hb_dead_threshold_set((unsigned int) tmp);
return count;
}
static
ssize_t r2hb_hb_group_mode_show(struct r2hb_hb_group *group,
char *page)
{
return sprintf(page, "%s\n",
r2hb_heartbeat_mode_desc[r2hb_heartbeat_mode]);
}
static
ssize_t r2hb_hb_group_mode_store(struct r2hb_hb_group *group,
const char *page, size_t count)
{
unsigned int i;
int ret;
size_t len;
len = (page[count - 1] == '\n') ? count - 1 : count;
if (!len)
return -EINVAL;
for (i = 0; i < R2HB_HEARTBEAT_NUM_MODES; ++i) {
if (strnicmp(page, r2hb_heartbeat_mode_desc[i], len))
continue;
ret = r2hb_global_hearbeat_mode_set(i);
if (!ret)
pr_notice("ramster: Heartbeat mode set to %s\n",
r2hb_heartbeat_mode_desc[i]);
return count;
}
return -EINVAL;
}
static struct r2hb_hb_group_attribute r2hb_hb_group_attr_threshold = {
.attr = { .ca_owner = THIS_MODULE,
.ca_name = "dead_threshold",
.ca_mode = S_IRUGO | S_IWUSR },
.show = r2hb_hb_group_threshold_show,
.store = r2hb_hb_group_threshold_store,
};
static struct r2hb_hb_group_attribute r2hb_hb_group_attr_mode = {
.attr = { .ca_owner = THIS_MODULE,
.ca_name = "mode",
.ca_mode = S_IRUGO | S_IWUSR },
.show = r2hb_hb_group_mode_show,
.store = r2hb_hb_group_mode_store,
};
static struct configfs_attribute *r2hb_hb_group_attrs[] = {
&r2hb_hb_group_attr_threshold.attr,
&r2hb_hb_group_attr_mode.attr,
NULL,
};
static struct configfs_item_operations r2hb_hearbeat_group_item_ops = {
.show_attribute = r2hb_hb_group_show,
.store_attribute = r2hb_hb_group_store,
};
static struct configfs_group_operations r2hb_hb_group_group_ops = {
.make_item = r2hb_hb_group_make_item,
.drop_item = r2hb_hb_group_drop_item,
};
static struct config_item_type r2hb_hb_group_type = {
.ct_group_ops = &r2hb_hb_group_group_ops,
.ct_item_ops = &r2hb_hearbeat_group_item_ops,
.ct_attrs = r2hb_hb_group_attrs,
.ct_owner = THIS_MODULE,
};
/* this is just here to avoid touching group in heartbeat.h which the
* entire damn world #includes */
struct config_group *r2hb_alloc_hb_set(void)
{
struct r2hb_hb_group *hs = NULL;
struct config_group *ret = NULL;
hs = kzalloc(sizeof(struct r2hb_hb_group), GFP_KERNEL);
if (hs == NULL)
goto out;
config_group_init_type_name(&hs->hs_group, "heartbeat",
&r2hb_hb_group_type);
ret = &hs->hs_group;
out:
if (ret == NULL)
kfree(hs);
return ret;
}
void r2hb_free_hb_set(struct config_group *group)
{
struct r2hb_hb_group *hs = to_r2hb_hb_group(group);
kfree(hs);
}
/* hb callback registration and issuing */
static struct r2hb_callback *hbcall_from_type(enum r2hb_callback_type type)
{
if (type == R2HB_NUM_CB)
return ERR_PTR(-EINVAL);
return &r2hb_callbacks[type];
}
void r2hb_setup_callback(struct r2hb_callback_func *hc,
enum r2hb_callback_type type,
r2hb_cb_func *func,
void *data,
int priority)
{
INIT_LIST_HEAD(&hc->hc_item);
hc->hc_func = func;
hc->hc_data = data;
hc->hc_priority = priority;
hc->hc_type = type;
hc->hc_magic = R2HB_CB_MAGIC;
}
EXPORT_SYMBOL_GPL(r2hb_setup_callback);
int r2hb_register_callback(const char *region_uuid,
struct r2hb_callback_func *hc)
{
struct r2hb_callback_func *tmp;
struct list_head *iter;
struct r2hb_callback *hbcall;
int ret;
BUG_ON(hc->hc_magic != R2HB_CB_MAGIC);
BUG_ON(!list_empty(&hc->hc_item));
hbcall = hbcall_from_type(hc->hc_type);
if (IS_ERR(hbcall)) {
ret = PTR_ERR(hbcall);
goto out;
}
down_write(&r2hb_callback_sem);
list_for_each(iter, &hbcall->list) {
tmp = list_entry(iter, struct r2hb_callback_func, hc_item);
if (hc->hc_priority < tmp->hc_priority) {
list_add_tail(&hc->hc_item, iter);
break;
}
}
if (list_empty(&hc->hc_item))
list_add_tail(&hc->hc_item, &hbcall->list);
up_write(&r2hb_callback_sem);
ret = 0;
out:
mlog(ML_CLUSTER, "returning %d on behalf of %p for funcs %p\n",
ret, __builtin_return_address(0), hc);
return ret;
}
EXPORT_SYMBOL_GPL(r2hb_register_callback);
void r2hb_unregister_callback(const char *region_uuid,
struct r2hb_callback_func *hc)
{
BUG_ON(hc->hc_magic != R2HB_CB_MAGIC);
mlog(ML_CLUSTER, "on behalf of %p for funcs %p\n",
__builtin_return_address(0), hc);
/* XXX Can this happen _with_ a region reference? */
if (list_empty(&hc->hc_item))
return;
down_write(&r2hb_callback_sem);
list_del_init(&hc->hc_item);
up_write(&r2hb_callback_sem);
}
EXPORT_SYMBOL_GPL(r2hb_unregister_callback);
int r2hb_check_node_heartbeating_from_callback(u8 node_num)
{
unsigned long testing_map[BITS_TO_LONGS(R2NM_MAX_NODES)];
r2hb_fill_node_map_from_callback(testing_map, sizeof(testing_map));
if (!test_bit(node_num, testing_map)) {
mlog(ML_HEARTBEAT,
"node (%u) does not have heartbeating enabled.\n",
node_num);
return 0;
}
return 1;
}
EXPORT_SYMBOL_GPL(r2hb_check_node_heartbeating_from_callback);
void r2hb_stop_all_regions(void)
{
}
EXPORT_SYMBOL_GPL(r2hb_stop_all_regions);
/*
* this is just a hack until we get the plumbing which flips file systems
* read only and drops the hb ref instead of killing the node dead.
*/
int r2hb_global_heartbeat_active(void)
{
return (r2hb_heartbeat_mode == R2HB_HEARTBEAT_GLOBAL);
}
EXPORT_SYMBOL(r2hb_global_heartbeat_active);
/* added for RAMster */
void r2hb_manual_set_node_heartbeating(int node_num)
{
if (node_num < R2NM_MAX_NODES)
set_bit(node_num, r2hb_live_node_bitmap);
}
EXPORT_SYMBOL(r2hb_manual_set_node_heartbeating);
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* heartbeat.h
*
* Function prototypes
*
* Copyright (C) 2004 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
*/
#ifndef R2CLUSTER_HEARTBEAT_H
#define R2CLUSTER_HEARTBEAT_H
#define R2HB_REGION_TIMEOUT_MS 2000
#define R2HB_MAX_REGION_NAME_LEN 32
/* number of changes to be seen as live */
#define R2HB_LIVE_THRESHOLD 2
/* number of equal samples to be seen as dead */
extern unsigned int r2hb_dead_threshold;
#define R2HB_DEFAULT_DEAD_THRESHOLD 31
/* Otherwise MAX_WRITE_TIMEOUT will be zero... */
#define R2HB_MIN_DEAD_THRESHOLD 2
#define R2HB_MAX_WRITE_TIMEOUT_MS \
(R2HB_REGION_TIMEOUT_MS * (r2hb_dead_threshold - 1))
#define R2HB_CB_MAGIC 0x51d1e4ec
/* callback stuff */
enum r2hb_callback_type {
R2HB_NODE_DOWN_CB = 0,
R2HB_NODE_UP_CB,
R2HB_NUM_CB
};
struct r2nm_node;
typedef void (r2hb_cb_func)(struct r2nm_node *, int, void *);
struct r2hb_callback_func {
u32 hc_magic;
struct list_head hc_item;
r2hb_cb_func *hc_func;
void *hc_data;
int hc_priority;
enum r2hb_callback_type hc_type;
};
struct config_group *r2hb_alloc_hb_set(void);
void r2hb_free_hb_set(struct config_group *group);
void r2hb_setup_callback(struct r2hb_callback_func *hc,
enum r2hb_callback_type type,
r2hb_cb_func *func,
void *data,
int priority);
int r2hb_register_callback(const char *region_uuid,
struct r2hb_callback_func *hc);
void r2hb_unregister_callback(const char *region_uuid,
struct r2hb_callback_func *hc);
void r2hb_fill_node_map(unsigned long *map,
unsigned bytes);
void r2hb_exit(void);
int r2hb_init(void);
int r2hb_check_node_heartbeating_from_callback(u8 node_num);
void r2hb_stop_all_regions(void);
int r2hb_get_all_regions(char *region_uuids, u8 numregions);
int r2hb_global_heartbeat_active(void);
void r2hb_manual_set_node_heartbeating(int);
#endif /* R2CLUSTER_HEARTBEAT_H */
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* Copyright (C) 2004, 2005, 2012 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*/
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/string.h>
#include <linux/uaccess.h>
#include "masklog.h"
struct mlog_bits r2_mlog_and_bits = MLOG_BITS_RHS(MLOG_INITIAL_AND_MASK);
EXPORT_SYMBOL_GPL(r2_mlog_and_bits);
struct mlog_bits r2_mlog_not_bits = MLOG_BITS_RHS(0);
EXPORT_SYMBOL_GPL(r2_mlog_not_bits);
static ssize_t mlog_mask_show(u64 mask, char *buf)
{
char *state;
if (__mlog_test_u64(mask, r2_mlog_and_bits))
state = "allow";
else if (__mlog_test_u64(mask, r2_mlog_not_bits))
state = "deny";
else
state = "off";
return snprintf(buf, PAGE_SIZE, "%s\n", state);
}
static ssize_t mlog_mask_store(u64 mask, const char *buf, size_t count)
{
if (!strnicmp(buf, "allow", 5)) {
__mlog_set_u64(mask, r2_mlog_and_bits);
__mlog_clear_u64(mask, r2_mlog_not_bits);
} else if (!strnicmp(buf, "deny", 4)) {
__mlog_set_u64(mask, r2_mlog_not_bits);
__mlog_clear_u64(mask, r2_mlog_and_bits);
} else if (!strnicmp(buf, "off", 3)) {
__mlog_clear_u64(mask, r2_mlog_not_bits);
__mlog_clear_u64(mask, r2_mlog_and_bits);
} else
return -EINVAL;
return count;
}
struct mlog_attribute {
struct attribute attr;
u64 mask;
};
#define to_mlog_attr(_attr) container_of(_attr, struct mlog_attribute, attr)
#define define_mask(_name) { \
.attr = { \
.name = #_name, \
.mode = S_IRUGO | S_IWUSR, \
}, \
.mask = ML_##_name, \
}
static struct mlog_attribute mlog_attrs[MLOG_MAX_BITS] = {
define_mask(TCP),
define_mask(MSG),
define_mask(SOCKET),
define_mask(HEARTBEAT),
define_mask(HB_BIO),
define_mask(DLMFS),
define_mask(DLM),
define_mask(DLM_DOMAIN),
define_mask(DLM_THREAD),
define_mask(DLM_MASTER),
define_mask(DLM_RECOVERY),
define_mask(DLM_GLUE),
define_mask(VOTE),
define_mask(CONN),
define_mask(QUORUM),
define_mask(BASTS),
define_mask(CLUSTER),
define_mask(ERROR),
define_mask(NOTICE),
define_mask(KTHREAD),
};
static struct attribute *mlog_attr_ptrs[MLOG_MAX_BITS] = {NULL, };
static ssize_t mlog_show(struct kobject *obj, struct attribute *attr,
char *buf)
{
struct mlog_attribute *mlog_attr = to_mlog_attr(attr);
return mlog_mask_show(mlog_attr->mask, buf);
}
static ssize_t mlog_store(struct kobject *obj, struct attribute *attr,
const char *buf, size_t count)
{
struct mlog_attribute *mlog_attr = to_mlog_attr(attr);
return mlog_mask_store(mlog_attr->mask, buf, count);
}
static const struct sysfs_ops mlog_attr_ops = {
.show = mlog_show,
.store = mlog_store,
};
static struct kobj_type mlog_ktype = {
.default_attrs = mlog_attr_ptrs,
.sysfs_ops = &mlog_attr_ops,
};
static struct kset mlog_kset = {
.kobj = {.ktype = &mlog_ktype},
};
int r2_mlog_sys_init(struct kset *r2cb_kset)
{
int i = 0;
while (mlog_attrs[i].attr.mode) {
mlog_attr_ptrs[i] = &mlog_attrs[i].attr;
i++;
}
mlog_attr_ptrs[i] = NULL;
kobject_set_name(&mlog_kset.kobj, "logmask");
mlog_kset.kobj.kset = r2cb_kset;
return kset_register(&mlog_kset);
}
void r2_mlog_sys_shutdown(void)
{
kset_unregister(&mlog_kset);
}
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* Copyright (C) 2005, 2012 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*/
#ifndef R2CLUSTER_MASKLOG_H
#define R2CLUSTER_MASKLOG_H
/*
* For now this is a trivial wrapper around printk() that gives the critical
* ability to enable sets of debugging output at run-time. In the future this
* will almost certainly be redirected to relayfs so that it can pay a
* substantially lower heisenberg tax.
*
* Callers associate the message with a bitmask and a global bitmask is
* maintained with help from /proc. If any of the bits match the message is
* output.
*
* We must have efficient bit tests on i386 and it seems gcc still emits crazy
* code for the 64bit compare. It emits very good code for the dual unsigned
* long tests, though, completely avoiding tests that can never pass if the
* caller gives a constant bitmask that fills one of the longs with all 0s. So
* the desire is to have almost all of the calls decided on by comparing just
* one of the longs. This leads to having infrequently given bits that are
* frequently matched in the high bits.
*
* _ERROR and _NOTICE are used for messages that always go to the console and
* have appropriate KERN_ prefixes. We wrap these in our function instead of
* just calling printk() so that this can eventually make its way through
* relayfs along with the debugging messages. Everything else gets KERN_DEBUG.
* The inline tests and macro dance give GCC the opportunity to quite cleverly
* only emit the appropriage printk() when the caller passes in a constant
* mask, as is almost always the case.
*
* All this bitmask nonsense is managed from the files under
* /sys/fs/r2cb/logmask/. Reading the files gives a straightforward
* indication of which bits are allowed (allow) or denied (off/deny).
* ENTRY deny
* EXIT deny
* TCP off
* MSG off
* SOCKET off
* ERROR allow
* NOTICE allow
*
* Writing changes the state of a given bit and requires a strictly formatted
* single write() call:
*
* write(fd, "allow", 5);
*
* Echoing allow/deny/off string into the logmask files can flip the bits
* on or off as expected; here is the bash script for example:
*
* log_mask="/sys/fs/r2cb/log_mask"
* for node in ENTRY EXIT TCP MSG SOCKET ERROR NOTICE; do
* echo allow >"$log_mask"/"$node"
* done
*
* The debugfs.ramster tool can also flip the bits with the -l option:
*
* debugfs.ramster -l TCP allow
*/
/* for task_struct */
#include <linux/sched.h>
/* bits that are frequently given and infrequently matched in the low word */
/* NOTE: If you add a flag, you need to also update masklog.c! */
#define ML_TCP 0x0000000000000001ULL /* net cluster/tcp.c */
#define ML_MSG 0x0000000000000002ULL /* net network messages */
#define ML_SOCKET 0x0000000000000004ULL /* net socket lifetime */
#define ML_HEARTBEAT 0x0000000000000008ULL /* hb all heartbeat tracking */
#define ML_HB_BIO 0x0000000000000010ULL /* hb io tracing */
#define ML_DLMFS 0x0000000000000020ULL /* dlm user dlmfs */
#define ML_DLM 0x0000000000000040ULL /* dlm general debugging */
#define ML_DLM_DOMAIN 0x0000000000000080ULL /* dlm domain debugging */
#define ML_DLM_THREAD 0x0000000000000100ULL /* dlm domain thread */
#define ML_DLM_MASTER 0x0000000000000200ULL /* dlm master functions */
#define ML_DLM_RECOVERY 0x0000000000000400ULL /* dlm master functions */
#define ML_DLM_GLUE 0x0000000000000800ULL /* ramster dlm glue layer */
#define ML_VOTE 0x0000000000001000ULL /* ramster node messaging */
#define ML_CONN 0x0000000000002000ULL /* net connection management */
#define ML_QUORUM 0x0000000000004000ULL /* net connection quorum */
#define ML_BASTS 0x0000000000008000ULL /* dlmglue asts and basts */
#define ML_CLUSTER 0x0000000000010000ULL /* cluster stack */
/* bits that are infrequently given and frequently matched in the high word */
#define ML_ERROR 0x1000000000000000ULL /* sent to KERN_ERR */
#define ML_NOTICE 0x2000000000000000ULL /* setn to KERN_NOTICE */
#define ML_KTHREAD 0x4000000000000000ULL /* kernel thread activity */
#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE)
#ifndef MLOG_MASK_PREFIX
#define MLOG_MASK_PREFIX 0
#endif
/*
* When logging is disabled, force the bit test to 0 for anything other
* than errors and notices, allowing gcc to remove the code completely.
* When enabled, allow all masks.
*/
#if defined(CONFIG_RAMSTER_DEBUG_MASKLOG)
#define ML_ALLOWED_BITS (~0)
#else
#define ML_ALLOWED_BITS (ML_ERROR|ML_NOTICE)
#endif
#define MLOG_MAX_BITS 64
struct mlog_bits {
unsigned long words[MLOG_MAX_BITS / BITS_PER_LONG];
};
extern struct mlog_bits r2_mlog_and_bits, r2_mlog_not_bits;
#if BITS_PER_LONG == 32
#define __mlog_test_u64(mask, bits) \
((u32)(mask & 0xffffffff) & bits.words[0] || \
((u64)(mask) >> 32) & bits.words[1])
#define __mlog_set_u64(mask, bits) do { \
bits.words[0] |= (u32)(mask & 0xffffffff); \
bits.words[1] |= (u64)(mask) >> 32; \
} while (0)
#define __mlog_clear_u64(mask, bits) do { \
bits.words[0] &= ~((u32)(mask & 0xffffffff)); \
bits.words[1] &= ~((u64)(mask) >> 32); \
} while (0)
#define MLOG_BITS_RHS(mask) { \
{ \
[0] = (u32)(mask & 0xffffffff), \
[1] = (u64)(mask) >> 32, \
} \
}
#else /* 32bit long above, 64bit long below */
#define __mlog_test_u64(mask, bits) ((mask) & bits.words[0])
#define __mlog_set_u64(mask, bits) do { \
bits.words[0] |= (mask); \
} while (0)
#define __mlog_clear_u64(mask, bits) do { \
bits.words[0] &= ~(mask); \
} while (0)
#define MLOG_BITS_RHS(mask) { { (mask) } }
#endif
/*
* smp_processor_id() "helpfully" screams when called outside preemptible
* regions in current kernels. sles doesn't have the variants that don't
* scream. just do this instead of trying to guess which we're building
* against.. *sigh*.
*/
#define __mlog_cpu_guess ({ \
unsigned long _cpu = get_cpu(); \
put_cpu(); \
_cpu; \
})
/* In the following two macros, the whitespace after the ',' just
* before ##args is intentional. Otherwise, gcc 2.95 will eat the
* previous token if args expands to nothing.
*/
#define __mlog_printk(level, fmt, args...) \
printk(level "(%s,%u,%lu):%s:%d " fmt, current->comm, \
task_pid_nr(current), __mlog_cpu_guess, \
__PRETTY_FUNCTION__, __LINE__ , ##args)
#define mlog(mask, fmt, args...) do { \
u64 __m = MLOG_MASK_PREFIX | (mask); \
if ((__m & ML_ALLOWED_BITS) && \
__mlog_test_u64(__m, r2_mlog_and_bits) && \
!__mlog_test_u64(__m, r2_mlog_not_bits)) { \
if (__m & ML_ERROR) \
__mlog_printk(KERN_ERR, "ERROR: "fmt , ##args); \
else if (__m & ML_NOTICE) \
__mlog_printk(KERN_NOTICE, fmt , ##args); \
else \
__mlog_printk(KERN_INFO, fmt , ##args); \
} \
} while (0)
#define mlog_errno(st) do { \
int _st = (st); \
if (_st != -ERESTARTSYS && _st != -EINTR && \
_st != AOP_TRUNCATED_PAGE && _st != -ENOSPC) \
mlog(ML_ERROR, "status = %lld\n", (long long)_st); \
} while (0)
#define mlog_bug_on_msg(cond, fmt, args...) do { \
if (cond) { \
mlog(ML_ERROR, "bug expression: " #cond "\n"); \
mlog(ML_ERROR, fmt, ##args); \
BUG(); \
} \
} while (0)
#include <linux/kobject.h>
#include <linux/sysfs.h>
int r2_mlog_sys_init(struct kset *r2cb_subsys);
void r2_mlog_sys_shutdown(void);
#endif /* R2CLUSTER_MASKLOG_H */
此差异已折叠。
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* nodemanager.h
*
* Function prototypes
*
* Copyright (C) 2004 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
*/
#ifndef R2CLUSTER_NODEMANAGER_H
#define R2CLUSTER_NODEMANAGER_H
#include "ramster_nodemanager.h"
/* This totally doesn't belong here. */
#include <linux/configfs.h>
#include <linux/rbtree.h>
enum r2nm_fence_method {
R2NM_FENCE_RESET = 0,
R2NM_FENCE_PANIC,
R2NM_FENCE_METHODS, /* Number of fence methods */
};
struct r2nm_node {
spinlock_t nd_lock;
struct config_item nd_item;
char nd_name[R2NM_MAX_NAME_LEN+1]; /* replace? */
__u8 nd_num;
/* only one address per node, as attributes, for now. */
__be32 nd_ipv4_address;
__be16 nd_ipv4_port;
struct rb_node nd_ip_node;
/* there can be only one local node for now */
int nd_local;
unsigned long nd_set_attributes;
};
struct r2nm_cluster {
struct config_group cl_group;
unsigned cl_has_local:1;
u8 cl_local_node;
rwlock_t cl_nodes_lock;
struct r2nm_node *cl_nodes[R2NM_MAX_NODES];
struct rb_root cl_node_ip_tree;
unsigned int cl_idle_timeout_ms;
unsigned int cl_keepalive_delay_ms;
unsigned int cl_reconnect_delay_ms;
enum r2nm_fence_method cl_fence_method;
/* part of a hack for disk bitmap.. will go eventually. - zab */
unsigned long cl_nodes_bitmap[BITS_TO_LONGS(R2NM_MAX_NODES)];
};
extern struct r2nm_cluster *r2nm_single_cluster;
u8 r2nm_this_node(void);
int r2nm_configured_node_map(unsigned long *map, unsigned bytes);
struct r2nm_node *r2nm_get_node_by_num(u8 node_num);
struct r2nm_node *r2nm_get_node_by_ip(__be32 addr);
void r2nm_node_get(struct r2nm_node *node);
void r2nm_node_put(struct r2nm_node *node);
int r2nm_depend_item(struct config_item *item);
void r2nm_undepend_item(struct config_item *item);
int r2nm_depend_this_node(void);
void r2nm_undepend_this_node(void);
#endif /* R2CLUSTER_NODEMANAGER_H */
/*
* r2net.c
*
* Copyright (c) 2011-2012, Dan Magenheimer, Oracle Corp.
*
* Ramster_r2net provides an interface between zcache and r2net.
*
* FIXME: support more than two nodes
*/
#include <linux/list.h>
#include "tcp.h"
#include "nodemanager.h"
#include "../tmem.h"
#include "../zcache.h"
#include "ramster.h"
#define RAMSTER_TESTING
#define RMSTR_KEY 0x77347734
enum {
RMSTR_TMEM_PUT_EPH = 100,
RMSTR_TMEM_PUT_PERS,
RMSTR_TMEM_ASYNC_GET_REQUEST,
RMSTR_TMEM_ASYNC_GET_AND_FREE_REQUEST,
RMSTR_TMEM_ASYNC_GET_REPLY,
RMSTR_TMEM_FLUSH,
RMSTR_TMEM_FLOBJ,
RMSTR_TMEM_DESTROY_POOL,
};
#define RMSTR_R2NET_MAX_LEN \
(R2NET_MAX_PAYLOAD_BYTES - sizeof(struct tmem_xhandle))
#include "tcp_internal.h"
static struct r2nm_node *r2net_target_node;
static int r2net_target_nodenum;
int r2net_remote_target_node_set(int node_num)
{
int ret = -1;
r2net_target_node = r2nm_get_node_by_num(node_num);
if (r2net_target_node != NULL) {
r2net_target_nodenum = node_num;
r2nm_node_put(r2net_target_node);
ret = 0;
}
return ret;
}
/* FIXME following buffer should be per-cpu, protected by preempt_disable */
static char ramster_async_get_buf[R2NET_MAX_PAYLOAD_BYTES];
static int ramster_remote_async_get_request_handler(struct r2net_msg *msg,
u32 len, void *data, void **ret_data)
{
char *pdata;
struct tmem_xhandle xh;
int found;
size_t size = RMSTR_R2NET_MAX_LEN;
u16 msgtype = be16_to_cpu(msg->msg_type);
bool get_and_free = (msgtype == RMSTR_TMEM_ASYNC_GET_AND_FREE_REQUEST);
unsigned long flags;
xh = *(struct tmem_xhandle *)msg->buf;
if (xh.xh_data_size > RMSTR_R2NET_MAX_LEN)
BUG();
pdata = ramster_async_get_buf;
*(struct tmem_xhandle *)pdata = xh;
pdata += sizeof(struct tmem_xhandle);
local_irq_save(flags);
found = zcache_get_page(xh.client_id, xh.pool_id, &xh.oid, xh.index,
pdata, &size, true, get_and_free ? 1 : -1);
local_irq_restore(flags);
if (found < 0) {
/* a zero size indicates the get failed */
size = 0;
}
if (size > RMSTR_R2NET_MAX_LEN)
BUG();
*ret_data = pdata - sizeof(struct tmem_xhandle);
/* now make caller (r2net_process_message) handle specially */
r2net_force_data_magic(msg, RMSTR_TMEM_ASYNC_GET_REPLY, RMSTR_KEY);
return size + sizeof(struct tmem_xhandle);
}
static int ramster_remote_async_get_reply_handler(struct r2net_msg *msg,
u32 len, void *data, void **ret_data)
{
char *in = (char *)msg->buf;
int datalen = len - sizeof(struct r2net_msg);
int ret = -1;
struct tmem_xhandle *xh = (struct tmem_xhandle *)in;
in += sizeof(struct tmem_xhandle);
datalen -= sizeof(struct tmem_xhandle);
BUG_ON(datalen < 0 || datalen > PAGE_SIZE);
ret = ramster_localify(xh->pool_id, &xh->oid, xh->index,
in, datalen, xh->extra);
#ifdef RAMSTER_TESTING
if (ret == -EEXIST)
pr_err("TESTING ArrgREP, aborted overwrite on racy put\n");
#endif
return ret;
}
int ramster_remote_put_handler(struct r2net_msg *msg,
u32 len, void *data, void **ret_data)
{
struct tmem_xhandle *xh;
char *p = (char *)msg->buf;
int datalen = len - sizeof(struct r2net_msg) -
sizeof(struct tmem_xhandle);
u16 msgtype = be16_to_cpu(msg->msg_type);
bool ephemeral = (msgtype == RMSTR_TMEM_PUT_EPH);
unsigned long flags;
int ret;
xh = (struct tmem_xhandle *)p;
p += sizeof(struct tmem_xhandle);
zcache_autocreate_pool(xh->client_id, xh->pool_id, ephemeral);
local_irq_save(flags);
ret = zcache_put_page(xh->client_id, xh->pool_id, &xh->oid, xh->index,
p, datalen, true, ephemeral);
local_irq_restore(flags);
return ret;
}
int ramster_remote_flush_handler(struct r2net_msg *msg,
u32 len, void *data, void **ret_data)
{
struct tmem_xhandle *xh;
char *p = (char *)msg->buf;
xh = (struct tmem_xhandle *)p;
p += sizeof(struct tmem_xhandle);
(void)zcache_flush_page(xh->client_id, xh->pool_id,
&xh->oid, xh->index);
return 0;
}
int ramster_remote_flobj_handler(struct r2net_msg *msg,
u32 len, void *data, void **ret_data)
{
struct tmem_xhandle *xh;
char *p = (char *)msg->buf;
xh = (struct tmem_xhandle *)p;
p += sizeof(struct tmem_xhandle);
(void)zcache_flush_object(xh->client_id, xh->pool_id, &xh->oid);
return 0;
}
int r2net_remote_async_get(struct tmem_xhandle *xh, bool free, int remotenode,
size_t expect_size, uint8_t expect_cksum,
void *extra)
{
int nodenum, ret = -1, status;
struct r2nm_node *node = NULL;
struct kvec vec[1];
size_t veclen = 1;
u32 msg_type;
struct r2net_node *nn;
node = r2nm_get_node_by_num(remotenode);
if (node == NULL)
goto out;
xh->client_id = r2nm_this_node(); /* which node is getting */
xh->xh_data_cksum = expect_cksum;
xh->xh_data_size = expect_size;
xh->extra = extra;
vec[0].iov_len = sizeof(*xh);
vec[0].iov_base = xh;
node = r2net_target_node;
if (!node)
goto out;
nodenum = r2net_target_nodenum;
r2nm_node_get(node);
nn = r2net_nn_from_num(nodenum);
if (nn->nn_persistent_error || !nn->nn_sc_valid) {
ret = -ENOTCONN;
r2nm_node_put(node);
goto out;
}
if (free)
msg_type = RMSTR_TMEM_ASYNC_GET_AND_FREE_REQUEST;
else
msg_type = RMSTR_TMEM_ASYNC_GET_REQUEST;
ret = r2net_send_message_vec(msg_type, RMSTR_KEY,
vec, veclen, remotenode, &status);
r2nm_node_put(node);
if (ret < 0) {
if (ret == -ENOTCONN || ret == -EHOSTDOWN)
goto out;
if (ret == -EAGAIN)
goto out;
/* FIXME handle bad message possibilities here? */
pr_err("UNTESTED ret<0 in ramster_remote_async_get: ret=%d\n",
ret);
}
ret = status;
out:
return ret;
}
#ifdef RAMSTER_TESTING
/* leave me here to see if it catches a weird crash */
static void ramster_check_irq_counts(void)
{
static int last_hardirq_cnt, last_softirq_cnt, last_preempt_cnt;
int cur_hardirq_cnt, cur_softirq_cnt, cur_preempt_cnt;
cur_hardirq_cnt = hardirq_count() >> HARDIRQ_SHIFT;
if (cur_hardirq_cnt > last_hardirq_cnt) {
last_hardirq_cnt = cur_hardirq_cnt;
if (!(last_hardirq_cnt&(last_hardirq_cnt-1)))
pr_err("RAMSTER TESTING RRP hardirq_count=%d\n",
last_hardirq_cnt);
}
cur_softirq_cnt = softirq_count() >> SOFTIRQ_SHIFT;
if (cur_softirq_cnt > last_softirq_cnt) {
last_softirq_cnt = cur_softirq_cnt;
if (!(last_softirq_cnt&(last_softirq_cnt-1)))
pr_err("RAMSTER TESTING RRP softirq_count=%d\n",
last_softirq_cnt);
}
cur_preempt_cnt = preempt_count() & PREEMPT_MASK;
if (cur_preempt_cnt > last_preempt_cnt) {
last_preempt_cnt = cur_preempt_cnt;
if (!(last_preempt_cnt&(last_preempt_cnt-1)))
pr_err("RAMSTER TESTING RRP preempt_count=%d\n",
last_preempt_cnt);
}
}
#endif
int r2net_remote_put(struct tmem_xhandle *xh, char *data, size_t size,
bool ephemeral, int *remotenode)
{
int nodenum, ret = -1, status;
struct r2nm_node *node = NULL;
struct kvec vec[2];
size_t veclen = 2;
u32 msg_type;
struct r2net_node *nn;
BUG_ON(size > RMSTR_R2NET_MAX_LEN);
xh->client_id = r2nm_this_node(); /* which node is putting */
vec[0].iov_len = sizeof(*xh);
vec[0].iov_base = xh;
vec[1].iov_len = size;
vec[1].iov_base = data;
node = r2net_target_node;
if (!node)
goto out;
nodenum = r2net_target_nodenum;
r2nm_node_get(node);
nn = r2net_nn_from_num(nodenum);
if (nn->nn_persistent_error || !nn->nn_sc_valid) {
ret = -ENOTCONN;
r2nm_node_put(node);
goto out;
}
if (ephemeral)
msg_type = RMSTR_TMEM_PUT_EPH;
else
msg_type = RMSTR_TMEM_PUT_PERS;
#ifdef RAMSTER_TESTING
/* leave me here to see if it catches a weird crash */
ramster_check_irq_counts();
#endif
ret = r2net_send_message_vec(msg_type, RMSTR_KEY, vec, veclen,
nodenum, &status);
if (ret < 0)
ret = -1;
else {
ret = status;
*remotenode = nodenum;
}
r2nm_node_put(node);
out:
return ret;
}
int r2net_remote_flush(struct tmem_xhandle *xh, int remotenode)
{
int ret = -1, status;
struct r2nm_node *node = NULL;
struct kvec vec[1];
size_t veclen = 1;
node = r2nm_get_node_by_num(remotenode);
BUG_ON(node == NULL);
xh->client_id = r2nm_this_node(); /* which node is flushing */
vec[0].iov_len = sizeof(*xh);
vec[0].iov_base = xh;
BUG_ON(irqs_disabled());
BUG_ON(in_softirq());
ret = r2net_send_message_vec(RMSTR_TMEM_FLUSH, RMSTR_KEY,
vec, veclen, remotenode, &status);
r2nm_node_put(node);
return ret;
}
int r2net_remote_flush_object(struct tmem_xhandle *xh, int remotenode)
{
int ret = -1, status;
struct r2nm_node *node = NULL;
struct kvec vec[1];
size_t veclen = 1;
node = r2nm_get_node_by_num(remotenode);
BUG_ON(node == NULL);
xh->client_id = r2nm_this_node(); /* which node is flobjing */
vec[0].iov_len = sizeof(*xh);
vec[0].iov_base = xh;
ret = r2net_send_message_vec(RMSTR_TMEM_FLOBJ, RMSTR_KEY,
vec, veclen, remotenode, &status);
r2nm_node_put(node);
return ret;
}
/*
* Handler registration
*/
static LIST_HEAD(r2net_unreg_list);
static void r2net_unregister_handlers(void)
{
r2net_unregister_handler_list(&r2net_unreg_list);
}
int r2net_register_handlers(void)
{
int status;
status = r2net_register_handler(RMSTR_TMEM_PUT_EPH, RMSTR_KEY,
RMSTR_R2NET_MAX_LEN,
ramster_remote_put_handler,
NULL, NULL, &r2net_unreg_list);
if (status)
goto bail;
status = r2net_register_handler(RMSTR_TMEM_PUT_PERS, RMSTR_KEY,
RMSTR_R2NET_MAX_LEN,
ramster_remote_put_handler,
NULL, NULL, &r2net_unreg_list);
if (status)
goto bail;
status = r2net_register_handler(RMSTR_TMEM_ASYNC_GET_REQUEST, RMSTR_KEY,
RMSTR_R2NET_MAX_LEN,
ramster_remote_async_get_request_handler,
NULL, NULL,
&r2net_unreg_list);
if (status)
goto bail;
status = r2net_register_handler(RMSTR_TMEM_ASYNC_GET_AND_FREE_REQUEST,
RMSTR_KEY, RMSTR_R2NET_MAX_LEN,
ramster_remote_async_get_request_handler,
NULL, NULL,
&r2net_unreg_list);
if (status)
goto bail;
status = r2net_register_handler(RMSTR_TMEM_ASYNC_GET_REPLY, RMSTR_KEY,
RMSTR_R2NET_MAX_LEN,
ramster_remote_async_get_reply_handler,
NULL, NULL,
&r2net_unreg_list);
if (status)
goto bail;
status = r2net_register_handler(RMSTR_TMEM_FLUSH, RMSTR_KEY,
RMSTR_R2NET_MAX_LEN,
ramster_remote_flush_handler,
NULL, NULL,
&r2net_unreg_list);
if (status)
goto bail;
status = r2net_register_handler(RMSTR_TMEM_FLOBJ, RMSTR_KEY,
RMSTR_R2NET_MAX_LEN,
ramster_remote_flobj_handler,
NULL, NULL,
&r2net_unreg_list);
if (status)
goto bail;
pr_info("ramster: r2net handlers registered\n");
bail:
if (status) {
r2net_unregister_handlers();
pr_err("ramster: couldn't register r2net handlers\n");
}
return status;
}
RAMSTER HOW-TO
Author: Dan Magenheimer
Ramster maintainer: Konrad Wilk <konrad.wilk@oracle.com>
This is a HOWTO document for ramster which, as of this writing, is in
the kernel as a subdirectory of zcache in drivers/staging, called ramster.
(Zcache can be built with or without ramster functionality.) If enabled
and properly configured, ramster allows memory capacity load balancing
across multiple machines in a cluster. Further, the ramster code serves
as an example of asynchronous access for zcache (as well as cleancache and
frontswap) that may prove useful for future transcendent memory
implementations, such as KVM and NVRAM. While ramster works today on
any network connection that supports kernel sockets, its features may
become more interesting on future high-speed fabrics/interconnects.
Ramster requires both kernel and userland support. The userland support,
called ramster-tools, is known to work with EL6-based distros, but is a
set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
includes an init file, a config file, and a userland binary that interfaces
to the kernel. This state of userland support reflects the abysmal userland
skills of this suitably-embarrassed author; any help/patches to turn
ramster-tools into more distributable rpms/debs useful for a wider range
of distros would be appreciated. The source RPM that can be used as a
starting point is available at:
http://oss.oracle.com/projects/tmem/files/RAMster/
As a result of this author's ignorance, userland setup described in this
HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies
if this offends anyone!
Kernel support has only been tested on x86_64. Systems with an active
ocfs2 filesystem should work, but since ramster leverages a lot of
code from ocfs2, there may be latent issues. A kernel configuration that
includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
if no ocfs2 filesystem is mounted.
This HOWTO demonstrates memory capacity load balancing for a two-node
cluster, where one node called the "local" node becomes overcommitted
and the other node called the "remote" node provides additional RAM
capacity for use by the local node. Ramster is capable of more complex
topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES".
If you find any terms in this HOWTO unfamiliar or don't understand the
motivation for ramster, the following LWN reading is recommended:
-- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
-- The future calculus of memory management (lwn.net/Articles/475681)
And since ramster is built on top of zcache, this article may be helpful:
-- In-kernel memory compression (lwn.net/Articles/545244)
Now that you've memorized the contents of those articles, let's get started!
A. PRELIMINARY
1) Install two x86_64 Linux systems that are known to work when
upgraded to a recent upstream Linux kernel version.
On each system:
2) Configure, build and install, then boot Linux, just to ensure it
can be done with an unmodified upstream kernel. Confirm you booted
the upstream kernel with "uname -a".
3) If you plan to do any performance testing or unless you plan to
test only swapping, the "WasActive" patch is also highly recommended.
(Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
For a demo or simple testing, the patch can be ignored.
4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems
can be found at:
http://oss.oracle.com/projects/tmem/files/RAMster/
(Sorry but for now, non-EL6 users must recreate ramster-tools on
their own from source. See above.)
5) Ensure that debugfs is mounted at each boot. Examples below assume it
is mounted at /sys/kernel/debug.
B. BUILDING RAMSTER INTO THE KERNEL
Do the following on each system:
1) Using the kernel configuration mechanism of your choice, change
your config to include:
CONFIG_CLEANCACHE=y
CONFIG_FRONTSWAP=y
CONFIG_STAGING=y
CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m
CONFIG_ZCACHE=y
CONFIG_RAMSTER=y
For a linux-3.10 or later kernel, you should also set:
CONFIG_ZCACHE_DEBUG=y
CONFIG_RAMSTER_DEBUG=y
Before building the kernel please doublecheck your kernel config
file to ensure all of the settings are correct.
2) Build this kernel and change your boot file (e.g. /etc/grub.conf)
so that the new kernel will boot.
3) Add "zcache" and "ramster" as kernel boot parameters for the new kernel.
4) Reboot each system approximately simultaneously.
5) Check dmesg to ensure there are some messages from ramster, prefixed
by "ramster:"
# dmesg | grep ramster
You should also see a lot of files in:
# ls /sys/kernel/debug/zcache
# ls /sys/kernel/debug/ramster
These are mostly counters for various zcache and ramster activities.
You should also see files in:
# ls /sys/kernel/mm/ramster
These are sysfs files that control ramster as we shall see.
Ramster now will act as a single-system zcache on each system
but doesn't yet know anything about the cluster so can't yet do
anything remotely.
C. CONFIGURING THE RAMSTER CLUSTER
This part can be error prone unless you are familiar with clustering
filesystems. We need to describe the cluster in a /etc/ramster.conf
file and the init scripts that parse it are extremely picky about
the syntax.
1) Create a /etc/ramster.conf file and ensure it is identical on both
systems. This file mimics the ocfs2 format and there is a good amount
of documentation that can be searched for ocfs2.conf, but you can use:
cluster:
name = ramster
node_count = 2
node:
name = system1
cluster = ramster
number = 0
ip_address = my.ip.ad.r1
ip_port = 7777
node:
name = system2
cluster = ramster
number = 1
ip_address = my.ip.ad.r2
ip_port = 7777
You must ensure that the "name" field in the file exactly matches
the output of "hostname" on each system; if "hostname" shows a
fully-qualified hostname, ensure the name is fully qualified in
/etc/ramster.conf. Obviously, substitute my.ip.ad.rx with proper
ip addresses.
2) Enable the ramster service and configure it. If you used the
EL6 ramster-tools, this would be:
# chkconfig --add ramster
# service ramster configure
Set "load on boot" to "y", cluster to start is "ramster" (or whatever
name you chose in ramster.conf), heartbeat dead threshold as "500",
network idle timeout as "1000000". Leave the others as default.
3) Reboot both systems. After reboot, try (assuming EL6 ramster-tools):
# service ramster status
You should see "Checking RAMSTER cluster "ramster": Online". If you do
not, something is wrong and ramster will not work. Note that you
should also see that the driver for "configfs" is loaded and mounted,
the driver for ocfs2_dlmfs is not loaded, and some numbers for network
parameters. You will also see "Checking RAMSTER heartbeat: Not active".
That's all OK.
4) Now you need to start the cluster heartbeat; the cluster is not "up"
until all nodes detect a heartbeat. In a real cluster, heartbeat detection
is done via a cluster filesystem, but ramster doesn't require one. Some
hack-y kernel code in ramster can start the heartbeat for you though if
you tell it what nodes are "up". To enable the heartbeat, do:
# echo 0 > /sys/kernel/mm/ramster/manual_node_up
# echo 1 > /sys/kernel/mm/ramster/manual_node_up
This must be done on BOTH nodes and, to avoid timeouts, must be done
approximately concurrently on both nodes. On an EL6 system, it is
convenient to put these lines in /etc/rc.local. To confirm that the
cluster is now up, on both systems do:
# dmesg | grep ramster
You should see ramster "Accepted connection" messages in dmesg on both
nodes after this. Note that if you check userland status again with
# service ramster status
you will still see "Checking RAMSTER heartbeat: Not active". That's
still OK... the ramster kernel heartbeat hack doesn't communicate to
userland.
5) You now must tell each node the node to which it should "remotify" pages.
On this two node cluster, we will assume the "local" node, node 0, has
memory overcommitted and will use ramster to utilize RAM capacity on
the "remote node", node 1. To configure this, on node 0, you do:
# echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
You should see "ramster: node 1 set as remotification target" in dmesg
on node 0. Again, on EL6, /etc/rc.local is a good place to put this
on node 0 so you don't forget to do it at each boot.
6) One more step: By default, the ramster code does not "remotify" any
pages; this is primarily for testing purposes, but sometimes it is
useful. This may change in the future, but for now, on node 0, you do:
# echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable
# echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable
The first enables remotifying swap (persistent, aka frontswap) pages,
the second enables remotifying of page cache (ephemeral, cleancache)
pages.
On EL6, these lines can also be put in /etc/rc.local (AFTER the
node_up lines), or at the beginning of a script that runs a workload.
7) Note that most testing has been done with both/all machines booted
roughly simultaneously to avoid cluster timeouts. Ideally, you should
do this too unless you are trying to break ramster rather than just
use it. ;-)
D. TESTING RAMSTER
1) Note that ramster has no value unless pages get "remotified". For
swap/frontswap/persistent pages, this doesn't happen unless/until
the workload would cause swapping to occur, at which point pages
are put into frontswap/zcache, and the remotification thread starts
working. To get to the point where the system swaps, you either
need a workload for which the working set exceeds the RAM in the
system; or you need to somehow reduce the amount of RAM one of
the system sees. This latter is easy when testing in a VM, but
harder on physical systems. In some cases, "mem=xxxM" on the
kernel command line restricts memory, but for some values of xxx
the kernel may fail to boot. One may also try creating a fixed
RAMdisk, doing nothing with it, but ensuring that it eats up a fixed
amount of RAM.
2) To see if ramster is working, on the "remote node", node 1, try:
# grep . /sys/kernel/debug/ramster/foreign_*
# # note, that is space-dot-space between grep and the pathname
to monitor the number (and max) ephemeral and persistent pages
that ramster has sent. If these stay at zero, ramster is not working
either because the workload on the local node (node 0) isn't creating
enough memory pressure or because "remotifying" isn't working. On the
local system, node 0, you can watch lots of useful information also.
Try:
grep . /sys/kernel/debug/zcache/*pageframes* \
/sys/kernel/debug/zcache/*zbytes* \
/sys/kernel/debug/zcache/*zpages* \
/sys/kernel/debug/ramster/*remote*
Of particular note are the remote_*_pages_succ_get counters. These
show how many disk reads and/or disk writes have been avoided on the
overcommitted local system by storing pages remotely using ramster.
At the risk of information overload, you can also grep:
/sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/*
These show, for example, how many disk reads and/or disk writes have
been avoided by using zcache to optimize RAM on the local system.
AUTOMATIC SWAP REPATRIATION
You may notice that while the systems are idle, the foreign persistent
page count on the remote machine slowly decreases. This is because
ramster implements "frontswap selfshrinking": When possible, swap
pages that have been remotified are slowly repatriated to the local
machine. This is so that local RAM can be used when possible and
so that, in case of remote machine crash, the probability of loss
of data is reduced.
REBOOTING / POWEROFF
If a system is shut down while some of its swap pages still reside
on a remote system, the system may lock up during the shutdown
sequence. This will occur if the network is shut down before the
swap mechansim is shut down, which is the default ordering on many
distros. To avoid this annoying problem, simply shut off the swap
subsystem before starting the shutdown sequence, e.g.:
# swapoff -a
# reboot
Ideally, this swapoff-before-ifdown ordering should be enforced permanently
using shutdown scripts.
KNOWN PROBLEMS
1) You may periodically see messages such as:
ramster_r2net, message length problem
This is harmless but indicates that a node is sending messages
containing compressed pages that exceed the maximum for zcache
(PAGE_SIZE*15/16). The sender side needs to be fixed.
2) If you see a "No longer connected to node..." message or a "No connection
established with node X after N seconds", it is possible you may
be in an unrecoverable state. If you are certain all of the
appropriate cluster configuration steps described above have been
performed, try rebooting the two servers concurrently to see if
the cluster starts.
Note that "Connection to node... shutdown, state 7" is an intermediate
connection state. As long as you later see "Accepted connection", the
intermediate states are harmless.
3) There are known issues in counting certain values. As a result
you may see periodic warnings from the kernel. Almost always you
will see "ramster: bad accounting for XXX". There are also "WARN_ONCE"
messages. If you see kernel warnings with a tombstone, please report
them. They are harmless but reflect bugs that need to be eventually fixed.
ADVANCED RAMSTER TOPOLOGIES
The kernel code for ramster can support up to eight nodes in a cluster,
but no testing has been done with more than three nodes.
In the example described above, the "remote" node serves as a RAM
overflow for the "local" node. This can be made symmetric by appropriate
settings of the sysfs remote_target_nodenum file. For example, by setting:
# echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
on node 0, and
# echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
on node 1, each node can serve as a RAM overflow for the other.
For more than two nodes, a "RAM server" can be configured. For a
three node system, set:
# echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
on node 1, and
# echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
on node 2. Then node 0 is a RAM server for node 1 and node 2.
In this implementation of ramster, any remote node is potentially a single
point of failure (SPOF). Though the probability of failure is reduced
by automatic swap repatriation (see above), a proposed future enhancement
to ramster improves high-availability for the cluster by sending a copy
of each page of date to two other nodes. Patches welcome!
此差异已折叠。
/*
* ramster.h
*
* Peer-to-peer transcendent memory
*
* Copyright (c) 2009-2012, Dan Magenheimer, Oracle Corp.
*/
#ifndef _RAMSTER_RAMSTER_H_
#define _RAMSTER_RAMSTER_H_
#include "../tmem.h"
enum ramster_remotify_op {
RAMSTER_REMOTIFY_FLUSH_PAGE,
RAMSTER_REMOTIFY_FLUSH_OBJ,
};
struct ramster_remotify_hdr {
enum ramster_remotify_op op;
struct list_head list;
};
struct flushlist_node {
struct ramster_remotify_hdr rem_op;
struct tmem_xhandle xh;
};
struct ramster_preload {
struct flushlist_node *flnode;
};
union remotify_list_node {
struct ramster_remotify_hdr rem_op;
struct {
struct ramster_remotify_hdr rem_op;
struct tmem_handle th;
} zbud_hdr;
struct flushlist_node flist;
};
/*
* format of remote pampd:
* bit 0 is reserved for zbud (in-page buddy selection)
* bit 1 == intransit
* bit 2 == is_remote... if this bit is set, then
* bit 3-10 == remotenode
* bit 11-23 == size
* bit 24-31 == cksum
*/
#define FAKE_PAMPD_INTRANSIT_BITS 1
#define FAKE_PAMPD_ISREMOTE_BITS 1
#define FAKE_PAMPD_REMOTENODE_BITS 8
#define FAKE_PAMPD_REMOTESIZE_BITS 13
#define FAKE_PAMPD_CHECKSUM_BITS 8
#define FAKE_PAMPD_INTRANSIT_SHIFT 1
#define FAKE_PAMPD_ISREMOTE_SHIFT (FAKE_PAMPD_INTRANSIT_SHIFT + \
FAKE_PAMPD_INTRANSIT_BITS)
#define FAKE_PAMPD_REMOTENODE_SHIFT (FAKE_PAMPD_ISREMOTE_SHIFT + \
FAKE_PAMPD_ISREMOTE_BITS)
#define FAKE_PAMPD_REMOTESIZE_SHIFT (FAKE_PAMPD_REMOTENODE_SHIFT + \
FAKE_PAMPD_REMOTENODE_BITS)
#define FAKE_PAMPD_CHECKSUM_SHIFT (FAKE_PAMPD_REMOTESIZE_SHIFT + \
FAKE_PAMPD_REMOTESIZE_BITS)
#define FAKE_PAMPD_MASK(x) ((1UL << (x)) - 1)
static inline void *pampd_make_remote(int remotenode, size_t size,
unsigned char cksum)
{
unsigned long fake_pampd = 0;
fake_pampd |= 1UL << FAKE_PAMPD_ISREMOTE_SHIFT;
fake_pampd |= ((unsigned long)remotenode &
FAKE_PAMPD_MASK(FAKE_PAMPD_REMOTENODE_BITS)) <<
FAKE_PAMPD_REMOTENODE_SHIFT;
fake_pampd |= ((unsigned long)size &
FAKE_PAMPD_MASK(FAKE_PAMPD_REMOTESIZE_BITS)) <<
FAKE_PAMPD_REMOTESIZE_SHIFT;
fake_pampd |= ((unsigned long)cksum &
FAKE_PAMPD_MASK(FAKE_PAMPD_CHECKSUM_BITS)) <<
FAKE_PAMPD_CHECKSUM_SHIFT;
return (void *)fake_pampd;
}
static inline unsigned int pampd_remote_node(void *pampd)
{
unsigned long fake_pampd = (unsigned long)pampd;
return (fake_pampd >> FAKE_PAMPD_REMOTENODE_SHIFT) &
FAKE_PAMPD_MASK(FAKE_PAMPD_REMOTENODE_BITS);
}
static inline unsigned int pampd_remote_size(void *pampd)
{
unsigned long fake_pampd = (unsigned long)pampd;
return (fake_pampd >> FAKE_PAMPD_REMOTESIZE_SHIFT) &
FAKE_PAMPD_MASK(FAKE_PAMPD_REMOTESIZE_BITS);
}
static inline unsigned char pampd_remote_cksum(void *pampd)
{
unsigned long fake_pampd = (unsigned long)pampd;
return (fake_pampd >> FAKE_PAMPD_CHECKSUM_SHIFT) &
FAKE_PAMPD_MASK(FAKE_PAMPD_CHECKSUM_BITS);
}
static inline bool pampd_is_remote(void *pampd)
{
unsigned long fake_pampd = (unsigned long)pampd;
return (fake_pampd >> FAKE_PAMPD_ISREMOTE_SHIFT) &
FAKE_PAMPD_MASK(FAKE_PAMPD_ISREMOTE_BITS);
}
static inline bool pampd_is_intransit(void *pampd)
{
unsigned long fake_pampd = (unsigned long)pampd;
return (fake_pampd >> FAKE_PAMPD_INTRANSIT_SHIFT) &
FAKE_PAMPD_MASK(FAKE_PAMPD_INTRANSIT_BITS);
}
/* note that it is a BUG for intransit to be set without isremote also set */
static inline void *pampd_mark_intransit(void *pampd)
{
unsigned long fake_pampd = (unsigned long)pampd;
fake_pampd |= 1UL << FAKE_PAMPD_ISREMOTE_SHIFT;
fake_pampd |= 1UL << FAKE_PAMPD_INTRANSIT_SHIFT;
return (void *)fake_pampd;
}
static inline void *pampd_mask_intransit_and_remote(void *marked_pampd)
{
unsigned long pampd = (unsigned long)marked_pampd;
pampd &= ~(1UL << FAKE_PAMPD_INTRANSIT_SHIFT);
pampd &= ~(1UL << FAKE_PAMPD_ISREMOTE_SHIFT);
return (void *)pampd;
}
extern int r2net_remote_async_get(struct tmem_xhandle *,
bool, int, size_t, uint8_t, void *extra);
extern int r2net_remote_put(struct tmem_xhandle *, char *, size_t,
bool, int *);
extern int r2net_remote_flush(struct tmem_xhandle *, int);
extern int r2net_remote_flush_object(struct tmem_xhandle *, int);
extern int r2net_register_handlers(void);
extern int r2net_remote_target_node_set(int);
extern int ramster_remotify_pageframe(bool);
extern void ramster_init(bool, bool, bool, bool);
extern void ramster_register_pamops(struct tmem_pamops *);
extern int ramster_localify(int, struct tmem_oid *oidp, uint32_t, char *,
unsigned int, void *);
extern void *ramster_pampd_free(void *, struct tmem_pool *, struct tmem_oid *,
uint32_t, bool);
extern void ramster_count_foreign_pages(bool, int);
extern int ramster_do_preload_flnode(struct tmem_pool *);
extern void ramster_cpu_up(int);
extern void ramster_cpu_down(int);
#endif /* _RAMSTER_RAMSTER_H */
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* ramster_nodemanager.h
*
* Header describing the interface between userspace and the kernel
* for the ramster_nodemanager module.
*
* Copyright (C) 2002, 2004, 2012 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
*/
#ifndef _RAMSTER_NODEMANAGER_H
#define _RAMSTER_NODEMANAGER_H
#define R2NM_API_VERSION 5
#define R2NM_MAX_NODES 255
#define R2NM_INVALID_NODE_NUM 255
/* host name, group name, cluster name all 64 bytes */
#define R2NM_MAX_NAME_LEN 64 /* __NEW_UTS_LEN */
extern int r2nm_init(void);
#endif /* _RAMSTER_NODEMANAGER_H */
此差异已折叠。
/* -*- mode: c; c-basic-offset: 8; -*-
* vim: noexpandtab sw=8 ts=8 sts=0:
*
* tcp.h
*
* Function prototypes
*
* Copyright (C) 2004 Oracle. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
*/
#ifndef R2CLUSTER_TCP_H
#define R2CLUSTER_TCP_H
#include <linux/socket.h>
#ifdef __KERNEL__
#include <net/sock.h>
#include <linux/tcp.h>
#else
#include <sys/socket.h>
#endif
#include <linux/inet.h>
#include <linux/in.h>
struct r2net_msg {
__be16 magic;
__be16 data_len;
__be16 msg_type;
__be16 pad1;
__be32 sys_status;
__be32 status;
__be32 key;
__be32 msg_num;
__u8 buf[0];
};
typedef int (r2net_msg_handler_func)(struct r2net_msg *msg, u32 len, void *data,
void **ret_data);
typedef void (r2net_post_msg_handler_func)(int status, void *data,
void *ret_data);
#define R2NET_MAX_PAYLOAD_BYTES (4096 - sizeof(struct r2net_msg))
/* same as hb delay, we're waiting for another node to recognize our hb */
#define R2NET_RECONNECT_DELAY_MS_DEFAULT 2000
#define R2NET_KEEPALIVE_DELAY_MS_DEFAULT 2000
#define R2NET_IDLE_TIMEOUT_MS_DEFAULT 30000
/* TODO: figure this out.... */
static inline int r2net_link_down(int err, struct socket *sock)
{
if (sock) {
if (sock->sk->sk_state != TCP_ESTABLISHED &&
sock->sk->sk_state != TCP_CLOSE_WAIT)
return 1;
}
if (err >= 0)
return 0;
switch (err) {
/* ????????????????????????? */
case -ERESTARTSYS:
case -EBADF:
/* When the server has died, an ICMP port unreachable
* message prompts ECONNREFUSED. */
case -ECONNREFUSED:
case -ENOTCONN:
case -ECONNRESET:
case -EPIPE:
return 1;
}
return 0;
}
enum {
R2NET_DRIVER_UNINITED,
R2NET_DRIVER_READY,
};
int r2net_send_message(u32 msg_type, u32 key, void *data, u32 len,
u8 target_node, int *status);
int r2net_send_message_vec(u32 msg_type, u32 key, struct kvec *vec,
size_t veclen, u8 target_node, int *status);
int r2net_register_handler(u32 msg_type, u32 key, u32 max_len,
r2net_msg_handler_func *func, void *data,
r2net_post_msg_handler_func *post_func,
struct list_head *unreg_list);
void r2net_unregister_handler_list(struct list_head *list);
void r2net_fill_node_map(unsigned long *map, unsigned bytes);
void r2net_force_data_magic(struct r2net_msg *, u16, u32);
void r2net_hb_node_up_manual(int);
struct r2net_node *r2net_nn_from_num(u8);
struct r2nm_node;
int r2net_register_hb_callbacks(void);
void r2net_unregister_hb_callbacks(void);
int r2net_start_listening(struct r2nm_node *node);
void r2net_stop_listening(struct r2nm_node *node);
void r2net_disconnect_node(struct r2nm_node *node);
int r2net_num_connected_peers(void);
int r2net_init(void);
void r2net_exit(void);
struct r2net_send_tracking;
struct r2net_sock_container;
#if 0
int r2net_debugfs_init(void);
void r2net_debugfs_exit(void);
void r2net_debug_add_nst(struct r2net_send_tracking *nst);
void r2net_debug_del_nst(struct r2net_send_tracking *nst);
void r2net_debug_add_sc(struct r2net_sock_container *sc);
void r2net_debug_del_sc(struct r2net_sock_container *sc);
#else
static inline int r2net_debugfs_init(void)
{
return 0;
}
static inline void r2net_debugfs_exit(void)
{
}
static inline void r2net_debug_add_nst(struct r2net_send_tracking *nst)
{
}
static inline void r2net_debug_del_nst(struct r2net_send_tracking *nst)
{
}
static inline void r2net_debug_add_sc(struct r2net_sock_container *sc)
{
}
static inline void r2net_debug_del_sc(struct r2net_sock_container *sc)
{
}
#endif /* CONFIG_DEBUG_FS */
#endif /* R2CLUSTER_TCP_H */
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
/*
* zbud.h
*
* Copyright (c) 2010-2012, Dan Magenheimer, Oracle Corp.
*
*/
#ifndef _ZBUD_H_
#define _ZBUD_H_
#include "tmem.h"
struct zbudref;
extern unsigned int zbud_max_buddy_size(void);
extern struct zbudref *zbud_match_prep(struct tmem_handle *th, bool eph,
void *cdata, unsigned size);
extern struct zbudref *zbud_create_prep(struct tmem_handle *th, bool eph,
void *cdata, unsigned size,
struct page *newpage);
extern void zbud_create_finish(struct zbudref *, bool);
extern int zbud_decompress(struct page *, struct zbudref *, bool,
void (*func)(char *, unsigned int, char *));
extern int zbud_copy_from_zbud(char *, struct zbudref *, size_t *, bool);
extern int zbud_copy_to_zbud(struct zbudref *, char *, bool);
extern struct page *zbud_free_and_delist(struct zbudref *, bool eph,
unsigned int *, unsigned int *);
extern struct page *zbud_evict_pageframe_lru(unsigned int *, unsigned int *);
extern unsigned int zbud_make_zombie_lru(struct tmem_handle *, unsigned char **,
unsigned int *, bool);
extern void zbud_init(void);
#endif /* _ZBUD_H_ */
此差异已折叠。
/*
* zcache.h
*
* Copyright (c) 2012, Dan Magenheimer, Oracle Corp.
*/
#ifndef _ZCACHE_H_
#define _ZCACHE_H_
struct zcache_preload {
struct tmem_obj *obj;
struct tmem_objnode *objnodes[OBJNODE_TREE_MAX_PATH];
};
struct tmem_pool;
#define MAX_POOLS_PER_CLIENT 16
#define MAX_CLIENTS 16
#define LOCAL_CLIENT ((uint16_t)-1)
struct zcache_client {
struct tmem_pool *tmem_pools[MAX_POOLS_PER_CLIENT];
bool allocated;
atomic_t refcount;
};
extern struct tmem_pool *zcache_get_pool_by_id(uint16_t cli_id,
uint16_t poolid);
extern void zcache_put_pool(struct tmem_pool *pool);
extern int zcache_put_page(int, int, struct tmem_oid *,
uint32_t, void *,
unsigned int, bool, int);
extern int zcache_get_page(int, int, struct tmem_oid *, uint32_t,
void *, size_t *, bool, int);
extern int zcache_flush_page(int, int, struct tmem_oid *, uint32_t);
extern int zcache_flush_object(int, int, struct tmem_oid *);
extern void zcache_decompress_to_page(char *, unsigned int, struct page *);
#if defined(CONFIG_RAMSTER) || defined(CONFIG_RAMSTER_MODULE)
extern void *zcache_pampd_create(char *, unsigned int, bool, int,
struct tmem_handle *);
int zcache_autocreate_pool(unsigned int cli_id, unsigned int pool_id, bool eph);
#endif
#define MAX_POOLS_PER_CLIENT 16
#define MAX_CLIENTS 16
#define LOCAL_CLIENT ((uint16_t)-1)
#endif /* _ZCACHE_H_ */
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册