[UnifiedV4 00/16] The Unified slab allocator (V4)

Discussion:

[UnifiedV4 00/16] The Unified slab allocator (V4)

Christoph Lameter

2010-10-05 18:57:25 UTC

V3->V4:
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).
- Implement object expiration via reclaim VM logic.

The following is a release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like in SLAB whereas much of the infrastructure
comes from SLUB.

After this patches SLUB will track the cpu cache contents
like SLAB attemped to. There are a number of architectural differences:

1. SLUB accurately tracks cpu caches instead of assuming that there
is only a single cpu cache per node or system.

2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.

3. SLUB caches are dynamically configurable via the sysfs filesystem.

4. There is no per slab page metadata structure to maintain (aside
from the object bitmap that usually fits into the page struct).

5. Has all the resiliency and diagnostic features of SLUB.

The unified allocator is a merging of SLUB with some queuing concepts from
SLAB and a new way of managing objects in the slabs using bitmaps. Memory
wise this is slightly more inefficient than SLUB (due to the need to place
large bitmaps --sized a few words--in some slab pages if there are more
than BITS_PER_LONG objects in a slab) but in general does not increase space
use too much.

The SLAB scheme of not touching the object during management is adopted.
The unified allocator can efficiently free and allocate cache cold objects
without causing cache misses.

Some numbers using tcp_rr on localhost

Dell R910 128G RAM, 64 processors, 4 NUMA nodes

threads unified slub slab
64 4141798 3729037 3884939
128 4146587 3890993 4105276
192 4003063 3876570 4110971
256 3928857 3942806 4099249
320 3922623 3969042 4093283
384 3827603 4002833 4108420
448 4140345 4027251 4118534
512 4163741 4050130 4122644
576 4175666 4099934 4149355
640 4190332 4142570 4175618
704 4198779 4173177 4193657
768 4662216 4200462 4222686

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:27 UTC

There is a lot of #ifdef/#endifs that can be avoided if functions would be in different
places. Move them around and reduce #ifdef.

Signed-off-by: Christoph Lameter <***@linux.com>

---
mm/slub.c | 297 +++++++++++++++++++++++++++++---------------------------------
1 file changed, 141 insertions(+), 156 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 08:17:49.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 08:18:03.000000000 -0500
@@ -3476,71 +3476,6 @@ static long validate_slab_cache(struct k
kfree(map);
return count;
}
-#endif
-
-#ifdef SLUB_RESILIENCY_TEST
-static void resiliency_test(void)
-{
- u8 *p;
-
- BUILD_BUG_ON(KMALLOC_MIN_SIZE > 16 || SLUB_PAGE_SHIFT < 10);
-
- printk(KERN_ERR "SLUB resiliency testing\n");
- printk(KERN_ERR "-----------------------\n");
- printk(KERN_ERR "A. Corruption after allocation\n");
-
- p = kzalloc(16, GFP_KERNEL);
- p[16] = 0x12;
- printk(KERN_ERR "\n1. kmalloc-16: Clobber Redzone/next pointer"
- " 0x12->0x%p\n\n", p + 16);
-
- validate_slab_cache(kmalloc_caches[4]);
-
- /* Hmmm... The next two are dangerous */
- p = kzalloc(32, GFP_KERNEL);
- p[32 + sizeof(void *)] = 0x34;
- printk(KERN_ERR "\n2. kmalloc-32: Clobber next pointer/next slab"
- " 0x34 -> -0x%p\n", p);
- printk(KERN_ERR
- "If allocated object is overwritten then not detectable\n\n");
-
- validate_slab_cache(kmalloc_caches[5]);
- p = kzalloc(64, GFP_KERNEL);
- p += 64 + (get_cycles() & 0xff) * sizeof(void *);
- *p = 0x56;
- printk(KERN_ERR "\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n",
- p);
- printk(KERN_ERR
- "If allocated object is overwritten then not detectable\n\n");
- validate_slab_cache(kmalloc_caches[6]);
-
- printk(KERN_ERR "\nB. Corruption after free\n");
- p = kzalloc(128, GFP_KERNEL);
- kfree(p);
- *p = 0x78;
- printk(KERN_ERR "1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p);
- validate_slab_cache(kmalloc_caches[7]);
-
- p = kzalloc(256, GFP_KERNEL);
- kfree(p);
- p[50] = 0x9a;
- printk(KERN_ERR "\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n",
- p);
- validate_slab_cache(kmalloc_caches[8]);
-
- p = kzalloc(512, GFP_KERNEL);
- kfree(p);
- p[512] = 0xab;
- printk(KERN_ERR "\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p);
- validate_slab_cache(kmalloc_caches[9]);
-}
-#else
-#ifdef CONFIG_SYSFS
-static void resiliency_test(void) {};
-#endif
-#endif
-
-#ifdef CONFIG_DEBUG
/*
* Generate lists of code addresses where slabcache objects are allocated
* and freed.
@@ -3771,6 +3706,68 @@ static int list_locations(struct kmem_ca
}
#endif

+#ifdef SLUB_RESILIENCY_TEST
+static void resiliency_test(void)
+{
+ u8 *p;
+
+ BUILD_BUG_ON(KMALLOC_MIN_SIZE > 16 || SLUB_PAGE_SHIFT < 10);
+
+ printk(KERN_ERR "SLUB resiliency testing\n");
+ printk(KERN_ERR "-----------------------\n");
+ printk(KERN_ERR "A. Corruption after allocation\n");
+
+ p = kzalloc(16, GFP_KERNEL);
+ p[16] = 0x12;
+ printk(KERN_ERR "\n1. kmalloc-16: Clobber Redzone/next pointer"
+ " 0x12->0x%p\n\n", p + 16);
+
+ validate_slab_cache(kmalloc_caches[4]);
+
+ /* Hmmm... The next two are dangerous */
+ p = kzalloc(32, GFP_KERNEL);
+ p[32 + sizeof(void *)] = 0x34;
+ printk(KERN_ERR "\n2. kmalloc-32: Clobber next pointer/next slab"
+ " 0x34 -> -0x%p\n", p);
+ printk(KERN_ERR
+ "If allocated object is overwritten then not detectable\n\n");
+
+ validate_slab_cache(kmalloc_caches[5]);
+ p = kzalloc(64, GFP_KERNEL);
+ p += 64 + (get_cycles() & 0xff) * sizeof(void *);
+ *p = 0x56;
+ printk(KERN_ERR "\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n",
+ p);
+ printk(KERN_ERR
+ "If allocated object is overwritten then not detectable\n\n");
+ validate_slab_cache(kmalloc_caches[6]);
+
+ printk(KERN_ERR "\nB. Corruption after free\n");
+ p = kzalloc(128, GFP_KERNEL);
+ kfree(p);
+ *p = 0x78;
+ printk(KERN_ERR "1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p);
+ validate_slab_cache(kmalloc_caches[7]);
+
+ p = kzalloc(256, GFP_KERNEL);
+ kfree(p);
+ p[50] = 0x9a;
+ printk(KERN_ERR "\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n",
+ p);
+ validate_slab_cache(kmalloc_caches[8]);
+
+ p = kzalloc(512, GFP_KERNEL);
+ kfree(p);
+ p[512] = 0xab;
+ printk(KERN_ERR "\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p);
+ validate_slab_cache(kmalloc_caches[9]);
+}
+#else
+#ifdef CONFIG_SYSFS
+static void resiliency_test(void) {};
+#endif
+#endif
+
#ifdef CONFIG_SYSFS
enum slab_stat_type {
SL_ALL, /* All slabs */
@@ -3987,14 +3984,6 @@ static ssize_t aliases_show(struct kmem_
}
SLAB_ATTR_RO(aliases);

-#ifdef CONFIG_SLUB_DEBUG
-static ssize_t slabs_show(struct kmem_cache *s, char *buf)
-{
- return show_slab_objects(s, buf, SO_ALL);
-}
-SLAB_ATTR_RO(slabs);
-#endif
-
static ssize_t partial_show(struct kmem_cache *s, char *buf)
{
return show_slab_objects(s, buf, SO_PARTIAL);
@@ -4019,7 +4008,48 @@ static ssize_t objects_partial_show(stru
}
SLAB_ATTR_RO(objects_partial);

+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+
+static ssize_t reclaim_account_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ s->flags &= ~SLAB_RECLAIM_ACCOUNT;
+ if (buf[0] == '1')
+ s->flags |= SLAB_RECLAIM_ACCOUNT;
+ return length;
+}
+SLAB_ATTR(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
#ifdef CONFIG_SLUB_DEBUG
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+ return show_slab_objects(s, buf, SO_ALL);
+}
+SLAB_ATTR_RO(slabs);
+
static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
{
return show_slab_objects(s, buf, SO_ALL|SO_TOTAL);
@@ -4056,60 +4086,6 @@ static ssize_t trace_store(struct kmem_c
}
SLAB_ATTR(trace);

-#ifdef CONFIG_FAILSLAB
-static ssize_t failslab_show(struct kmem_cache *s, char *buf)
-{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB));
-}
-
-static ssize_t failslab_store(struct kmem_cache *s, const char *buf,
- size_t length)
-{
- s->flags &= ~SLAB_FAILSLAB;
- if (buf[0] == '1')
- s->flags |= SLAB_FAILSLAB;
- return length;
-}
-SLAB_ATTR(failslab);
-#endif
-#endif
-
-static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
-{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
-}
-
-static ssize_t reclaim_account_store(struct kmem_cache *s,
- const char *buf, size_t length)
-{
- s->flags &= ~SLAB_RECLAIM_ACCOUNT;
- if (buf[0] == '1')
- s->flags |= SLAB_RECLAIM_ACCOUNT;
- return length;
-}
-SLAB_ATTR(reclaim_account);
-
-static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
-{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
-}
-SLAB_ATTR_RO(hwcache_align);
-
-#ifdef CONFIG_ZONE_DMA
-static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
-{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
-}
-SLAB_ATTR_RO(cache_dma);
-#endif
-
-static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
-{
- return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
-}
-SLAB_ATTR_RO(destroy_by_rcu);
-
-#ifdef CONFIG_SLUB_DEBUG
static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
{
return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
@@ -4185,6 +4161,39 @@ static ssize_t validate_store(struct kme
return ret;
}
SLAB_ATTR(validate);
+
+static ssize_t alloc_calls_show(struct kmem_cache *s, char *buf)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return -ENOSYS;
+ return list_locations(s, buf, TRACK_ALLOC);
+}
+SLAB_ATTR_RO(alloc_calls);
+
+static ssize_t free_calls_show(struct kmem_cache *s, char *buf)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return -ENOSYS;
+ return list_locations(s, buf, TRACK_FREE);
+}
+SLAB_ATTR_RO(free_calls);
+#endif /* CONFIG_SLUB_DEBUG */
+
+#ifdef CONFIG_FAILSLAB
+static ssize_t failslab_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB));
+}
+
+static ssize_t failslab_store(struct kmem_cache *s, const char *buf,
+ size_t length)
+{
+ s->flags &= ~SLAB_FAILSLAB;
+ if (buf[0] == '1')
+ s->flags |= SLAB_FAILSLAB;
+ return length;
+}
+SLAB_ATTR(failslab);
#endif

static ssize_t shrink_show(struct kmem_cache *s, char *buf)
@@ -4206,24 +4215,6 @@ static ssize_t shrink_store(struct kmem_
}
SLAB_ATTR(shrink);

-#ifdef CONFIG_SLUB_DEBUG
-static ssize_t alloc_calls_show(struct kmem_cache *s, char *buf)
-{
- if (!(s->flags & SLAB_STORE_USER))
- return -ENOSYS;
- return list_locations(s, buf, TRACK_ALLOC);
-}
-SLAB_ATTR_RO(alloc_calls);
-
-static ssize_t free_calls_show(struct kmem_cache *s, char *buf)
-{
- if (!(s->flags & SLAB_STORE_USER))
- return -ENOSYS;
- return list_locations(s, buf, TRACK_FREE);
-}
-SLAB_ATTR_RO(free_calls);
-#endif
-
#ifdef CONFIG_NUMA
static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
{
@@ -4329,30 +4320,24 @@ static struct attribute *slab_attrs[] =
&min_partial_attr.attr,
&objects_attr.attr,
&objects_partial_attr.attr,
-#ifdef CONFIG_SLUB_DEBUG
- &total_objects_attr.attr,
- &slabs_attr.attr,
-#endif
&partial_attr.attr,
&cpu_slabs_attr.attr,
&ctor_attr.attr,
&aliases_attr.attr,
&align_attr.attr,
-#ifdef CONFIG_SLUB_DEBUG
- &sanity_checks_attr.attr,
- &trace_attr.attr,
-#endif
&hwcache_align_attr.attr,
&reclaim_account_attr.attr,
&destroy_by_rcu_attr.attr,
+ &shrink_attr.attr,
#ifdef CONFIG_SLUB_DEBUG
+ &total_objects_attr.attr,
+ &slabs_attr.attr,
+ &sanity_checks_attr.attr,
+ &trace_attr.attr,
&red_zone_attr.attr,
&poison_attr.attr,
&store_user_attr.attr,
&validate_attr.attr,
-#endif
- &shrink_attr.attr,
-#ifdef CONFIG_SLUB_DEBUG
&alloc_calls_attr.attr,
&free_calls_attr.attr,
#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Pekka Enberg

2010-10-06 14:02:17 UTC

Post by Christoph Lameter
There is a lot of #ifdef/#endifs that can be avoided if functions would be in different
places. Move them around and reduce #ifdef.

I applied this patch. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:30 UTC

There is no need anymore for the "inuse" field in the page struct.
Extend the objects field to 32 bit allowing a practically unlimited
number of objects.

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
include/linux/mm_types.h | 5 +----
mm/slub.c | 7 -------
2 files changed, 1 insertion(+), 11 deletions(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2010-10-04 08:14:26.000000000 -0500
+++ linux-2.6/include/linux/mm_types.h 2010-10-04 08:26:05.000000000 -0500
@@ -40,10 +40,7 @@ struct page {
* to show when page is mapped
* & limit reverse map searches.
*/
- struct { /* SLUB */
- u16 inuse;
- u16 objects;
- };
+ u32 objects; /* SLUB */
};
union {
struct {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 08:26:02.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 08:26:05.000000000 -0500
@@ -143,7 +143,6 @@ static inline int kmem_cache_debug(struc

#define OO_SHIFT 16
#define OO_MASK ((1 << OO_SHIFT) - 1)
-#define MAX_OBJS_PER_PAGE 65535 /* since page.objects is u16 */

/* Internal SLUB flags */
#define __OBJECT_POISON 0x80000000UL /* Poison object */
@@ -783,9 +782,6 @@ static int verify_slab(struct kmem_cache
max_objects = ((void *)page->freelist - start) / s->size;
}

- if (max_objects > MAX_OBJS_PER_PAGE)
- max_objects = MAX_OBJS_PER_PAGE;
-
if (page->objects != max_objects) {
slab_err(s, page, "Wrong number of objects. Found %d but "
"should be %d", page->objects, max_objects);
@@ -2097,9 +2093,6 @@ static inline int slab_order(int size, i
int rem;
int min_order = slub_min_order;

- if ((PAGE_SIZE << min_order) / size > MAX_OBJS_PER_PAGE)
- return get_order(size * MAX_OBJS_PER_PAGE) - 1;
-
for (order = max(min_order,
fls(min_objects * size - 1) - PAGE_SHIFT);
order <= max_order; order++) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:28 UTC

This patch adds SLAB style cpu queueing and uses a new way for
managing objects in the slabs using bitmaps. It uses a percpu queue so that
free operations can be properly buffered and a bitmap for managing the
free/allocated state in the slabs. The approach uses slightly more memory
(due to the need to place large bitmaps --sized a few words--in some
slab pages) but in general does compete well in terms of space use.
The storage format using bitmaps avoids the SLAB management structure that
SLAB needs for each slab page and therefore metadata is more compact
and easily fits into a cacheline.

The SLAB scheme of not touching the object during management is adopted.
SLUB can now efficiently free and allocate cache cold objects.

The queueing scheme addresses also the issue that the free slowpath
was taken too frequently.

This patch only implements staticallly sized per cpu queues and does
not deal with NUMA queueing and shared queuing.
(A later patch introduces the infamous alien caches to SLUB.)

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
include/linux/page-flags.h | 6
include/linux/poison.h | 1
include/linux/slub_def.h | 46 -
init/Kconfig | 14
mm/slub.c | 1165 ++++++++++++++++++++++-----------------------
5 files changed, 608 insertions(+), 624 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 11:00:39.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 11:14:26.000000000 -0500
@@ -1,11 +1,11 @@
/*
- * SLUB: A slab allocator that limits cache line use instead of queuing
- * objects in per cpu and per node lists.
+ * SLUB: The unified slab allocator.
*
* The allocator synchronizes using per slab locks and only
* uses a centralized lock to manage a pool of partial slabs.
*
* (C) 2007 SGI, Christoph Lameter
+ * (C) 2010 Linux Foundation, Christoph Lameter
*/

#include <linux/mm.h>
@@ -83,27 +83,6 @@
* minimal so we rely on the page allocators per cpu caches for
* fast frees and allocs.
*
- * Overloading of page flags that are otherwise used for LRU management.
- *
- * PageActive The slab is frozen and exempt from list processing.
- * This means that the slab is dedicated to a purpose
- * such as satisfying allocations for a specific
- * processor. Objects may be freed in the slab while
- * it is frozen but slab_free will then skip the usual
- * list operations. It is up to the processor holding
- * the slab to integrate the slab into the slab lists
- * when the slab is no longer needed.
- *
- * One use of this flag is to mark slabs that are
- * used for allocations. Then such a slab becomes a cpu
- * slab. The cpu slab may be equipped with an additional
- * freelist that allows lockless access to
- * free objects in addition to the regular freelist
- * that requires the slab lock.
- *
- * PageError Slab requires special handling due to debug
- * options set. This moves slab handling out of
- * the fast path and disables lockless freelists.
*/

#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
@@ -254,38 +233,95 @@ static inline int check_valid_pointer(st
return 1;
}

-static inline void *get_freepointer(struct kmem_cache *s, void *object)
-{
- return *(void **)(object + s->offset);
-}
-
-static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
-{
- *(void **)(object + s->offset) = fp;
-}
-
/* Loop over all objects in a slab */
#define for_each_object(__p, __s, __addr, __objects) \
for (__p = (__addr); __p < (__addr) + (__objects) * (__s)->size;\
__p += (__s)->size)

-/* Scan freelist */
-#define for_each_free_object(__p, __s, __free) \
- for (__p = (__free); __p; __p = get_freepointer((__s), __p))
-
/* Determine object index from a given position */
static inline int slab_index(void *p, struct kmem_cache *s, void *addr)
{
return (p - addr) / s->size;
}

+static inline int map_in_page_struct(struct page *page)
+{
+ return page->objects <= BITS_PER_LONG;
+}
+
+static inline unsigned long *map(struct page *page)
+{
+ if (map_in_page_struct(page))
+ return (unsigned long *)&page->freelist;
+ else
+ return page->freelist;
+}
+
+static inline int map_size(struct page *page)
+{
+ return BITS_TO_LONGS(page->objects) * sizeof(unsigned long);
+}
+
+static inline int available(struct page *page)
+{
+ return bitmap_weight(map(page), page->objects);
+}
+
+static inline int all_objects_available(struct page *page)
+{
+ return bitmap_full(map(page), page->objects);
+}
+
+static inline int all_objects_used(struct page *page)
+{
+ return bitmap_empty(map(page), page->objects);
+}
+
+static inline int inuse(struct page *page)
+{
+ return page->objects - available(page);
+}
+
+/*
+ * Basic queue functions
+ */
+
+static inline void *queue_get(struct kmem_cache_queue *q)
+{
+ return q->object[--q->objects];
+}
+
+static inline void queue_put(struct kmem_cache_queue *q, void *object)
+{
+ q->object[q->objects++] = object;
+}
+
+static inline int queue_full(struct kmem_cache_queue *q)
+{
+ return q->objects == QUEUE_SIZE;
+}
+
+static inline int queue_empty(struct kmem_cache_queue *q)
+{
+ return q->objects == 0;
+}
+
static inline struct kmem_cache_order_objects oo_make(int order,
unsigned long size)
{
- struct kmem_cache_order_objects x = {
- (order << OO_SHIFT) + (PAGE_SIZE << order) / size
- };
+ struct kmem_cache_order_objects x;
+ unsigned long objects;
+ unsigned long page_size = PAGE_SIZE << order;
+ unsigned long ws = sizeof(unsigned long);
+
+ objects = page_size / size;
+
+ if (objects > BITS_PER_LONG)
+ /* Bitmap must fit into the slab as well */
+ objects = ((page_size / ws) * BITS_PER_LONG) /
+ ((size / ws) * BITS_PER_LONG + 1);

+ x.x = (order << OO_SHIFT) + objects;
return x;
}

@@ -352,10 +388,7 @@ static struct track *get_track(struct km
{
struct track *p;

- if (s->offset)
- p = object + s->offset + sizeof(void *);
- else
- p = object + s->inuse;
+ p = object + s->inuse;

return p + alloc;
}
@@ -403,8 +436,8 @@ static void print_tracking(struct kmem_c

static void print_page_info(struct page *page)
{
- printk(KERN_ERR "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n",
- page, page->objects, page->inuse, page->freelist, page->flags);
+ printk(KERN_ERR "INFO: Slab 0x%p objects=%u avail=%u order=%d flags=0x%04lx\n",
+ page, page->objects, available(page), compound_order(page), page->flags);

}

@@ -443,8 +476,8 @@ static void print_trailer(struct kmem_ca

print_page_info(page);

- printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
- p, p - addr, get_freepointer(s, p));
+ printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n\n",
+ p, p - addr);

if (p > addr + 16)
print_section("Bytes b4", p - 16, 16);
@@ -455,10 +488,7 @@ static void print_trailer(struct kmem_ca
print_section("Redzone", p + s->objsize,
s->inuse - s->objsize);

- if (s->offset)
- off = s->offset + sizeof(void *);
- else
- off = s->inuse;
+ off = s->inuse;

if (s->flags & SLAB_STORE_USER)
off += 2 * sizeof(struct track);
@@ -495,7 +525,9 @@ static void init_object(struct kmem_cach
u8 *p = object;

if (s->flags & __OBJECT_POISON) {
- memset(p, POISON_FREE, s->objsize - 1);
+ u8 filler = (val == SLUB_RED_ACTIVE) ? POISON_INUSE : POISON_FREE;
+
+ memset(p, filler, s->objsize - 1);
p[s->objsize - 1] = POISON_END;
}

@@ -550,8 +582,6 @@ static int check_bytes_and_report(struct
*
* object address
* Bytes of the object to be managed.
- * If the freepointer may overlay the object then the free
- * pointer is the first word of the object.
*
* Poisoning uses 0x6b (POISON_FREE) and the last byte is
* 0xa5 (POISON_END)
@@ -567,9 +597,8 @@ static int check_bytes_and_report(struct
* object + s->inuse
* Meta data starts here.
*
- * A. Free pointer (if we cannot overwrite object on free)
- * B. Tracking data for SLAB_STORE_USER
- * C. Padding to reach required alignment boundary or at mininum
+ * A. Tracking data for SLAB_STORE_USER
+ * B. Padding to reach required alignment boundary or at mininum
* one word if debugging is on to be able to detect writes
* before the word boundary.
*
@@ -587,10 +616,6 @@ static int check_pad_bytes(struct kmem_c
{
unsigned long off = s->inuse; /* The end of info */

- if (s->offset)
- /* Freepointer is placed after the object. */
- off += sizeof(void *);
-
if (s->flags & SLAB_STORE_USER)
/* We also have user information there */
off += 2 * sizeof(struct track);
@@ -615,15 +640,42 @@ static int slab_pad_check(struct kmem_ca
return 1;

start = page_address(page);
- length = (PAGE_SIZE << compound_order(page));
- end = start + length;
- remainder = length % s->size;
+ end = start + (PAGE_SIZE << compound_order(page));
+
+ /* Check for special case of bitmap at the end of the page */
+ if (!map_in_page_struct(page)) {
+ if ((u8 *)page->freelist > start && (u8 *)page->freelist < end)
+ end = page->freelist;
+ else
+ slab_err(s, page, "pagemap pointer invalid =%p start=%p end=%p objects=%d",
+ page->freelist, start, end, page->objects);
+ }
+
+ length = end - start;
+ remainder = length - page->objects * s->size;
if (!remainder)
return 1;

fault = check_bytes(end - remainder, POISON_INUSE, remainder);
- if (!fault)
- return 1;
+ if (!fault) {
+ u8 *freelist_end;
+
+ if (map_in_page_struct(page))
+ return 1;
+
+ end = start + (PAGE_SIZE << compound_order(page));
+ freelist_end = page->freelist + map_size(page);
+ remainder = end - freelist_end;
+
+ if (!remainder)
+ return 1;
+
+ fault = check_bytes(freelist_end, POISON_INUSE,
+ remainder);
+ if (!fault)
+ return 1;
+ }
+
while (end > fault && end[-1] == POISON_INUSE)
end--;

@@ -663,25 +715,6 @@ static int check_object(struct kmem_cach
*/
check_pad_bytes(s, page, p);
}
-
- if (!s->offset && val == SLUB_RED_ACTIVE)
- /*
- * Object and freepointer overlap. Cannot check
- * freepointer while object is allocated.
- */
- return 1;
-
- /* Check free pointer validity */
- if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
- object_err(s, page, p, "Freepointer corrupt");
- /*
- * No choice but to zap it and thus lose the remainder
- * of the free objects in this slab. May cause
- * another error because the object count is now wrong.
- */
- set_freepointer(s, p, NULL);
- return 0;
- }
return 1;
}

@@ -702,51 +735,45 @@ static int check_slab(struct kmem_cache
s->name, page->objects, maxobj);
return 0;
}
- if (page->inuse > page->objects) {
- slab_err(s, page, "inuse %u > max %u",
- s->name, page->inuse, page->objects);
- return 0;
- }
+
/* Slab_pad_check fixes things up after itself */
slab_pad_check(s, page);
return 1;
}

/*
- * Determine if a certain object on a page is on the freelist. Must hold the
- * slab lock to guarantee that the chains are in a consistent state.
+ * Determine if a certain object on a page is on the free map.
*/
-static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
+static int object_marked_free(struct kmem_cache *s, struct page *page, void *search)
+{
+ return test_bit(slab_index(search, s, page_address(page)), map(page));
+}
+
+/* Verify the integrity of the metadata in a slab page */
+static int verify_slab(struct kmem_cache *s, struct page *page)
{
int nr = 0;
- void *fp = page->freelist;
- void *object = NULL;
unsigned long max_objects;
+ void *start = page_address(page);
+ unsigned long size = PAGE_SIZE << compound_order(page);

- while (fp && nr <= page->objects) {
- if (fp == search)
- return 1;
- if (!check_valid_pointer(s, page, fp)) {
- if (object) {
- object_err(s, page, object,
- "Freechain corrupt");
- set_freepointer(s, object, NULL);
- break;
- } else {
- slab_err(s, page, "Freepointer corrupt");
- page->freelist = NULL;
- page->inuse = page->objects;
- slab_fix(s, "Freelist cleared");
- return 0;
- }
- break;
- }
- object = fp;
- fp = get_freepointer(s, object);
- nr++;
+ nr = available(page);
+
+ if (map_in_page_struct(page))
+ max_objects = size / s->size;
+ else {
+ if (page->freelist <= start || page->freelist >= start + size) {
+ slab_err(s, page, "Invalid pointer to bitmap of free objects max_objects=%d!",
+ page->objects);
+ /* Switch to bitmap in page struct */
+ page->objects = max_objects = BITS_PER_LONG;
+ page->freelist = 0L;
+ slab_fix(s, "Slab sized for %d objects. ALl objects marked in use.",
+ BITS_PER_LONG);
+ } else
+ max_objects = ((void *)page->freelist - start) / s->size;
}

- max_objects = (PAGE_SIZE << compound_order(page)) / s->size;
if (max_objects > MAX_OBJS_PER_PAGE)
max_objects = MAX_OBJS_PER_PAGE;

@@ -755,24 +782,19 @@ static int on_freelist(struct kmem_cache
"should be %d", page->objects, max_objects);
page->objects = max_objects;
slab_fix(s, "Number of objects adjusted.");
+ return 0;
}
- if (page->inuse != page->objects - nr) {
- slab_err(s, page, "Wrong object count. Counter is %d but "
- "counted were %d", page->inuse, page->objects - nr);
- page->inuse = page->objects - nr;
- slab_fix(s, "Object count adjusted.");
- }
- return search == NULL;
+ return 1;
}

static void trace(struct kmem_cache *s, struct page *page, void *object,
int alloc)
{
if (s->flags & SLAB_TRACE) {
- printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+ printk(KERN_INFO "TRACE %s %s 0x%p free=%d fp=0x%p\n",
s->name,
alloc ? "alloc" : "free",
- object, page->inuse,
+ object, available(page),
page->freelist);

if (!alloc)
@@ -818,22 +840,24 @@ static inline void slab_free_hook_irq(st
/*
* Tracking of fully allocated slabs for debugging purposes.
*/
-static void add_full(struct kmem_cache_node *n, struct page *page)
+static inline void add_full(struct kmem_cache *s,
+ struct kmem_cache_node *n, struct page *page)
{
+
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
spin_lock(&n->list_lock);
list_add(&page->lru, &n->full);
spin_unlock(&n->list_lock);
}

-static void remove_full(struct kmem_cache *s, struct page *page)
+static inline void remove_full(struct kmem_cache *s,
+ struct kmem_cache_node *n, struct page *page)
{
- struct kmem_cache_node *n;
-
if (!(s->flags & SLAB_STORE_USER))
return;

- n = get_node(s, page_to_nid(page));
-
spin_lock(&n->list_lock);
list_del(&page->lru);
spin_unlock(&n->list_lock);
@@ -886,23 +910,28 @@ static void setup_object_debug(struct km
init_tracking(s, object);
}

-static noinline int alloc_debug_processing(struct kmem_cache *s, struct page *page,
- void *object, unsigned long addr)
+static noinline int alloc_debug_processing(struct kmem_cache *s,
+ void *object, unsigned long addr)
{
+ struct page *page = virt_to_head_page(object);
+
if (!check_slab(s, page))
goto bad;

- if (!on_freelist(s, page, object)) {
- object_err(s, page, object, "Object already allocated");
+ if (!check_valid_pointer(s, page, object)) {
+ object_err(s, page, object, "Pointer check fails");
goto bad;
}

- if (!check_valid_pointer(s, page, object)) {
- object_err(s, page, object, "Freelist Pointer check fails");
+ if (object_marked_free(s, page, object)) {
+ object_err(s, page, object, "Allocated object still marked free in slab");
goto bad;
}

- if (!check_object(s, page, object, SLUB_RED_INACTIVE))
+ if (!check_object(s, page, object, SLUB_RED_QUEUE))
+ goto bad;
+
+ if (!verify_slab(s, page))
goto bad;

/* Success perform special debug activities for allocs */
@@ -920,8 +949,7 @@ bad:
* as used avoids touching the remaining objects.
*/
slab_fix(s, "Marking all objects used");
- page->inuse = page->objects;
- page->freelist = NULL;
+ bitmap_zero(map(page), page->objects);
}
return 0;
}
@@ -937,7 +965,7 @@ static noinline int free_debug_processin
goto fail;
}

- if (on_freelist(s, page, object)) {
+ if (object_marked_free(s, page, object)) {
object_err(s, page, object, "Object already free");
goto fail;
}
@@ -960,13 +988,11 @@ static noinline int free_debug_processin
goto fail;
}

- /* Special debug activities for freeing objects */
- if (!PageSlubFrozen(page) && !page->freelist)
- remove_full(s, page);
if (s->flags & SLAB_STORE_USER)
set_track(s, object, TRACK_FREE, addr);
trace(s, page, object, 0);
- init_object(s, object, SLUB_RED_INACTIVE);
+ init_object(s, object, SLUB_RED_QUEUE);
+ verify_slab(s, page);
return 1;

fail:
@@ -1062,7 +1088,7 @@ static inline void setup_object_debug(st
struct page *page, void *object) {}

static inline int alloc_debug_processing(struct kmem_cache *s,
- struct page *page, void *object, unsigned long addr) { return 0; }
+ void *object, unsigned long addr) { return 0; }

static inline int free_debug_processing(struct kmem_cache *s,
struct page *page, void *object, unsigned long addr) { return 0; }
@@ -1071,7 +1097,10 @@ static inline int slab_pad_check(struct
{ return 1; }
static inline int check_object(struct kmem_cache *s, struct page *page,
void *object, u8 val) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
+ struct page *page) {}
+static inline void remove_full(struct kmem_cache *s,
+ struct kmem_cache_node *n, struct page *page) {}
static inline unsigned long kmem_cache_flags(unsigned long objsize,
unsigned long flags, const char *name,
void (*ctor)(void *))
@@ -1185,8 +1214,8 @@ static struct page *new_slab(struct kmem
{
struct page *page;
void *start;
- void *last;
void *p;
+ unsigned long size;

BUG_ON(flags & GFP_SLAB_BUG_MASK);

@@ -1198,23 +1227,20 @@ static struct page *new_slab(struct kmem
inc_slabs_node(s, page_to_nid(page), page->objects);
page->slab = s;
page->flags |= 1 << PG_slab;
-
start = page_address(page);
+ size = PAGE_SIZE << compound_order(page);

if (unlikely(s->flags & SLAB_POISON))
- memset(start, POISON_INUSE, PAGE_SIZE << compound_order(page));
+ memset(start, POISON_INUSE, size);

- last = start;
- for_each_object(p, s, start, page->objects) {
- setup_object(s, page, last);
- set_freepointer(s, last, p);
- last = p;
- }
- setup_object(s, page, last);
- set_freepointer(s, last, NULL);
+ if (!map_in_page_struct(page))
+ page->freelist = start + page->objects * s->size;
+
+ bitmap_fill(map(page), page->objects);
+
+ for_each_object(p, s, start, page->objects)
+ setup_object(s, page, p);

- page->freelist = start;
- page->inuse = 0;
out:
return page;
}
@@ -1242,6 +1268,7 @@ static void __free_slab(struct kmem_cach

__ClearPageSlab(page);
reset_page_mapcount(page);
+ stat(s, FREE_SLAB);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
__free_pages(page, order);
@@ -1307,6 +1334,7 @@ static void add_partial(struct kmem_cach
list_add_tail(&page->lru, &n->partial);
else
list_add(&page->lru, &n->partial);
+ __SetPageSlubPartial(page);
spin_unlock(&n->list_lock);
}

@@ -1315,12 +1343,11 @@ static inline void __remove_partial(stru
{
list_del(&page->lru);
n->nr_partial--;
+ __ClearPageSlubPartial(page);
}

-static void remove_partial(struct kmem_cache *s, struct page *page)
+static void remove_partial(struct kmem_cache_node *n, struct page *page)
{
- struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-
spin_lock(&n->list_lock);
__remove_partial(n, page);
spin_unlock(&n->list_lock);
@@ -1336,7 +1363,6 @@ static inline int lock_and_freeze_slab(s
{
if (slab_trylock(page)) {
__remove_partial(n, page);
- __SetPageSlubFrozen(page);
return 1;
}
return 0;
@@ -1439,116 +1465,163 @@ static struct page *get_partial(struct k
}

/*
- * Move a page back to the lists.
- *
- * Must be called with the slab lock held.
- *
- * On exit the slab lock will have been dropped.
+ * Move the vector of objects back to the slab pages they came from
*/
-static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
- __releases(bitlock)
+void drain_objects(struct kmem_cache *s, void **object, int nr)
{
- struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+ int i;

- __ClearPageSlubFrozen(page);
- if (page->inuse) {
+ for (i = 0 ; i < nr; ) {

- if (page->freelist) {
- add_partial(n, page, tail);
- stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
- } else {
- stat(s, DEACTIVATE_FULL);
- if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER))
- add_full(n, page);
+ void *p = object[i];
+ struct page *page = virt_to_head_page(p);
+ void *addr = page_address(page);
+ unsigned long size = PAGE_SIZE << compound_order(page);
+ unsigned long *m;
+ unsigned long offset;
+ struct kmem_cache_node *n;
+
+#ifdef CONFIG_SLUB_DEBUG
+ if (kmem_cache_debug(s) && !PageSlab(page)) {
+ object_err(s, page, p, "Object from non-slab page");
+ i++;
+ continue;
}
- slab_unlock(page);
- } else {
- stat(s, DEACTIVATE_EMPTY);
- if (n->nr_partial < s->min_partial) {
+#endif
+ slab_lock(page);
+ m = map(page);
+
+ offset = p - addr;
+
+ while (i < nr) {
+
+ int bit;
+ unsigned long new_offset;
+
+ if (offset >= size)
+ break;
+
+#ifdef CONFIG_SLUB_DEBUG
+ if (kmem_cache_debug(s) && offset % s->size) {
+ object_err(s, page, object[i], "Misaligned object");
+ i++;
+ p = object[i];
+ new_offset = p - addr;
+ continue;
+ }
+#endif
+
+ bit = offset / s->size;
+
/*
- * Adding an empty slab to the partial slabs in order
- * to avoid page allocator overhead. This slab needs
- * to come after the other slabs with objects in
- * so that the others get filled first. That way the
- * size of the partial list stays small.
- *
- * kmem_cache_shrink can reclaim any empty slabs from
- * the partial list.
+ * Fast loop to fold a sequence of objects into the slab
+ * avoiding division and virt_to_head_page()
*/
- add_partial(n, page, 1);
+ do {
+#ifdef CONFIG_SLUB_DEBUG
+
+ if (kmem_cache_debug(s)) {
+ u8 *endobject = p + s->objsize;
+ int redlen = s->inuse - s->objsize;
+
+ if (s->flags & SLAB_RED_ZONE && check_bytes(endobject, SLUB_RED_QUEUE, redlen))
+ object_err(s, page, p, "Object not on queue while draining");
+ else {
+ if (unlikely(__test_and_set_bit(bit, m)))
+ object_err(s, page, p, "Double free");
+ init_object(s, p, SLUB_RED_INACTIVE);
+ }
+ } else
+#endif
+ __set_bit(bit, m);
+
+ i++;
+ p = object[i];
+ bit++;
+ offset += s->size;
+ new_offset = p - addr;
+
+ } while (new_offset == offset && i < nr && new_offset < size);
+
+ offset = new_offset;
+ }
+ n = get_node(s, page_to_nid(page));
+ if (bitmap_full(m, page->objects) && n->nr_partial > s->min_partial) {
+
+ /* All objects are available now */
+ if (PageSlubPartial(page)) {
+
+ remove_partial(n, page);
+ stat(s, FREE_REMOVE_PARTIAL);
+ } else
+ remove_full(s, n, page);
+
slab_unlock(page);
+ discard_slab(s, page);
+
} else {
+
+ /* Some object are available now */
+ if (!PageSlubPartial(page)) {
+
+ /* Slab had no free objects but has them now */
+ remove_full(s, n, page);
+ add_partial(n, page, 0);
+ stat(s, FREE_ADD_PARTIAL);
+ }
slab_unlock(page);
- stat(s, FREE_SLAB);
- discard_slab(s, page);
}
}
}

-/*
- * Remove the cpu slab
- */
-static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
- __releases(bitlock)
+static inline int drain_queue(struct kmem_cache *s,
+ struct kmem_cache_queue *q, int nr)
{
- struct page *page = c->page;
- int tail = 1;
+ int t = min(nr, q->objects);

- if (page->freelist)
- stat(s, DEACTIVATE_REMOTE_FREES);
- /*
- * Merge cpu freelist into slab freelist. Typically we get here
- * because both freelists are empty. So this is unlikely
- * to occur.
- */
- while (unlikely(c->freelist)) {
- void **object;
-
- tail = 0; /* Hot objects. Put the slab first */
-
- /* Retrieve object from cpu_freelist */
- object = c->freelist;
- c->freelist = get_freepointer(s, c->freelist);
+ drain_objects(s, q->object, t);

- /* And put onto the regular freelist */
- set_freepointer(s, object, page->freelist);
- page->freelist = object;
- page->inuse--;
- }
- c->page = NULL;
- unfreeze_slab(s, page, tail);
+ q->objects -= t;
+ if (q->objects)
+ memcpy(q->object, q->object + t,
+ q->objects * sizeof(void *));
+ return t;
}

-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+/*
+ * Drain all objects from a per cpu queue
+ */
+static void flush_cpu_objects(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
- stat(s, CPUSLAB_FLUSH);
- slab_lock(c->page);
- deactivate_slab(s, c);
+ struct kmem_cache_queue *q = &c->q;
+
+ drain_queue(s, q, q->objects);
+ stat(s, QUEUE_FLUSH);
}

/*
- * Flush cpu slab.
+ * Flush cpu objects.
*
* Called from IPI handler with interrupts disabled.
*/
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
+static void __flush_cpu_objects(void *d)
{
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+ struct kmem_cache *s = d;
+ struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);

- if (likely(c && c->page))
- flush_slab(s, c);
+ if (c->q.objects)
+ flush_cpu_objects(s, c);
}

-static void flush_cpu_slab(void *d)
+static void flush_all(struct kmem_cache *s)
{
- struct kmem_cache *s = d;
-
- __flush_cpu_slab(s, smp_processor_id());
+ on_each_cpu(__flush_cpu_objects, s, 1);
}

-static void flush_all(struct kmem_cache *s)
+struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
{
- on_each_cpu(flush_cpu_slab, s, 1);
+ return __alloc_percpu(sizeof(struct kmem_cache_cpu),
+ __alignof__(struct kmem_cache_cpu));
}

/*
@@ -1564,11 +1637,6 @@ static inline int node_match(struct kmem
return 1;
}

-static int count_free(struct page *page)
-{
- return page->objects - page->inuse;
-}
-
static unsigned long count_partial(struct kmem_cache_node *n,
int (*get_count)(struct page *))
{
@@ -1606,7 +1674,7 @@ slab_out_of_memory(struct kmem_cache *s,

if (oo_order(s->min) > get_order(s->objsize))
printk(KERN_WARNING " %s debugging increased min order, use "
- "slub_debug=O to disable.\n", s->name);
+ "slub_debug=O to disable.\n", s->name);

for_each_online_node(node) {
struct kmem_cache_node *n = get_node(s, node);
@@ -1617,7 +1685,7 @@ slab_out_of_memory(struct kmem_cache *s,
if (!n)
continue;

- nr_free = count_partial(n, count_free);
+ nr_free = count_partial(n, available);
nr_slabs = node_nr_slabs(n);
nr_objs = node_nr_objs(n);

@@ -1628,139 +1696,156 @@ slab_out_of_memory(struct kmem_cache *s,
}

/*
- * Slow path. The lockless freelist is empty or we need to perform
- * debugging duties.
- *
- * Interrupts are disabled.
- *
- * Processing is still very fast if new objects have been freed to the
- * regular freelist. In that case we simply take over the regular freelist
- * as the lockless freelist and zap the regular freelist.
- *
- * If that is not working then we fall back to the partial lists. We take the
- * first element of the freelist as the object to allocate now and move the
- * rest of the freelist to the lockless freelist.
- *
- * And if we were unable to get a new slab from the partial slab lists then
- * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
- */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
-{
- void **object;
- struct page *new;
-
- /* We handle __GFP_ZERO in the caller */
- gfpflags &= ~__GFP_ZERO;
-
- if (!c->page)
- goto new_slab;
-
- slab_lock(c->page);
- if (unlikely(!node_match(c, node)))
- goto another_slab;
-
- stat(s, ALLOC_REFILL);
-
-load_freelist:
- object = c->page->freelist;
- if (unlikely(!object))
- goto another_slab;
- if (kmem_cache_debug(s))
- goto debug;
-
- c->freelist = get_freepointer(s, object);
- c->page->inuse = c->page->objects;
- c->page->freelist = NULL;
- c->node = page_to_nid(c->page);
-unlock_out:
- slab_unlock(c->page);
- stat(s, ALLOC_SLOWPATH);
- return object;
+ * Retrieve pointers to nr objects from a slab into the object array.
+ * Slab must be locked.
+ */
+void retrieve_objects(struct kmem_cache *s, struct page *page, void **object, int nr)
+{
+ void *addr = page_address(page);
+ unsigned long *m = map(page);
+
+ while (nr > 0) {
+ int i = find_first_bit(m, page->objects);
+ void *a;

-another_slab:
- deactivate_slab(s, c);
+ __clear_bit(i, m);
+ a = addr + i * s->size;

-new_slab:
- new = get_partial(s, gfpflags, node);
- if (new) {
- c->page = new;
- stat(s, ALLOC_FROM_PARTIAL);
- goto load_freelist;
- }
-
- gfpflags &= gfp_allowed_mask;
- if (gfpflags & __GFP_WAIT)
- local_irq_enable();
-
- new = new_slab(s, gfpflags, node);
-
- if (gfpflags & __GFP_WAIT)
- local_irq_disable();
-
- if (new) {
- c = __this_cpu_ptr(s->cpu_slab);
- stat(s, ALLOC_SLAB);
- if (c->page)
- flush_slab(s, c);
- slab_lock(new);
- __SetPageSlubFrozen(new);
- c->page = new;
- goto load_freelist;
+ /*
+ * Fast loop to get a sequence of objects out of the slab
+ * without find_first_bit() and multiplication
+ */
+ do {
+ nr--;
+ object[nr] = a;
+#ifdef CONFIG_SLUB_DEBUG
+ if (kmem_cache_debug(s)) {
+ check_object(s, page, a, SLUB_RED_INACTIVE);
+ init_object(s, a, SLUB_RED_QUEUE);
+ }
+#endif
+ a += s->size;
+ i++;
+ } while (nr > 0 && i < page->objects && __test_and_clear_bit(i, m));
}
- if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
- slab_out_of_memory(s, gfpflags, node);
- return NULL;
-debug:
- if (!alloc_debug_processing(s, c->page, object, addr))
- goto another_slab;
+}
+
+static inline void refill_queue(struct kmem_cache *s,
+ struct kmem_cache_queue *q, struct page *page, int nr)
+{
+ int d;
+ int batch = min_t(int, QUEUE_SIZE, BATCH_SIZE);

- c->page->inuse++;
- c->page->freelist = get_freepointer(s, object);
- c->node = -1;
- goto unlock_out;
+ d = min(batch - q->objects, nr);
+ retrieve_objects(s, page, q->object + q->objects, d);
+ q->objects += d;
}

-/*
- * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
- * have the fastpath folded into their functions. So no function call
- * overhead for requests that can be satisfied on the fastpath.
- *
- * The fastpath works by first checking if the lockless freelist can be used.
- * If not then __slab_alloc is called for slow processing.
- *
- * Otherwise we can simply pick the next object from the lockless free list.
- */
-static __always_inline void *slab_alloc(struct kmem_cache *s,
+void to_lists(struct kmem_cache *s, struct page *page, int tail)
+{
+ if (!all_objects_used(page))
+
+ add_partial(get_node(s, page_to_nid(page)), page, tail);
+
+ else
+ add_full(s, get_node(s, page_to_nid(page)), page);
+}
+
+/* Handling of objects from other nodes */
+
+static void slab_free_alien(struct kmem_cache *s,
+ struct kmem_cache_cpu *c, struct page *page, void *object, int node)
+{
+#ifdef CONFIG_NUMA
+ /* Direct free to the slab */
+ drain_objects(s, &object, 1);
+#endif
+}
+
+/* Generic allocation */
+
+static void *slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr)
{
- void **object;
+ void *object;
struct kmem_cache_cpu *c;
+ struct kmem_cache_queue *q;
unsigned long flags;

if (slab_pre_alloc_hook(s, gfpflags))
return NULL;

+redo:
local_irq_save(flags);
c = __this_cpu_ptr(s->cpu_slab);
- object = c->freelist;
- if (unlikely(!object || !node_match(c, node)))
+ q = &c->q;
+ if (unlikely(queue_empty(q) || !node_match(c, node))) {

- object = __slab_alloc(s, gfpflags, node, addr, c);
+ if (unlikely(!node_match(c, node))) {
+ flush_cpu_objects(s, c);
+ c->node = node;
+ }

- else {
- c->freelist = get_freepointer(s, object);
+ while (q->objects < BATCH_SIZE) {
+ struct page *new;
+
+ new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
+ if (unlikely(!new)) {
+
+ gfpflags &= gfp_allowed_mask;
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_enable();
+
+ new = new_slab(s, gfpflags, node);
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_disable();
+
+ /* process may have moved to different cpu */
+ c = __this_cpu_ptr(s->cpu_slab);
+ q = &c->q;
+
+ if (!new) {
+ if (queue_empty(q))
+ goto oom;
+ break;
+ }
+ stat(s, ALLOC_SLAB);
+ slab_lock(new);
+ } else
+ stat(s, ALLOC_FROM_PARTIAL);
+
+ refill_queue(s, q, new, available(new));
+ to_lists(s, new, 0);
+
+ slab_unlock(new);
+ }
+ stat(s, ALLOC_SLOWPATH);
+
+ } else
stat(s, ALLOC_FASTPATH);
+
+ object = queue_get(q);
+
+ if (kmem_cache_debug(s)) {
+ if (!alloc_debug_processing(s, object, addr))
+ goto redo;
}
local_irq_restore(flags);

- if (unlikely(gfpflags & __GFP_ZERO) && object)
+ if (unlikely(gfpflags & __GFP_ZERO))
memset(object, 0, s->objsize);

slab_post_alloc_hook(s, gfpflags, object);

return object;
+
+oom:
+ local_irq_restore(flags);
+ if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
+ slab_out_of_memory(s, gfpflags, node);
+ return NULL;
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1787,7 +1872,7 @@ void *kmem_cache_alloc_node(struct kmem_
void *ret = slab_alloc(s, gfpflags, node, _RET_IP_);

trace_kmem_cache_alloc_node(_RET_IP_, ret,
- s->objsize, s->size, gfpflags, node);
+ s->objsize, s->size, gfpflags, node);

return ret;
}
@@ -1804,114 +1889,52 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_notr
#endif
#endif

-/*
- * Slow patch handling. This may still be called frequently since objects
- * have a longer lifetime than the cpu slabs in most processing loads.
- *
- * So we still attempt to reduce cache line usage. Just take the slab
- * lock and free the item. If there is no additional partial page
- * handling required then we can return immediately.
- */
-static void __slab_free(struct kmem_cache *s, struct page *page,
+static void slab_free(struct kmem_cache *s, struct page *page,
void *x, unsigned long addr)
{
- void *prior;
- void **object = (void *)x;
-
- stat(s, FREE_SLOWPATH);
- slab_lock(page);
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_queue *q;
+ unsigned long flags;

- if (kmem_cache_debug(s))
- goto debug;
+ slab_free_hook(s, x);

-checks_ok:
- prior = page->freelist;
- set_freepointer(s, object, prior);
- page->freelist = object;
- page->inuse--;
-
- if (unlikely(PageSlubFrozen(page))) {
- stat(s, FREE_FROZEN);
- goto out_unlock;
- }
+ local_irq_save(flags);
+ if (kmem_cache_debug(s)
+ && !free_debug_processing(s, page, x, addr))
+ goto out;

- if (unlikely(!page->inuse))
- goto slab_empty;
+ slab_free_hook_irq(s, x);

- /*
- * Objects left in the slab. If it was not on the partial list before
- * then add it.
- */
- if (unlikely(!prior)) {
- add_partial(get_node(s, page_to_nid(page)), page, 1);
- stat(s, FREE_ADD_PARTIAL);
- }
+ c = __this_cpu_ptr(s->cpu_slab);

-out_unlock:
- slab_unlock(page);
- return;
+ if (NUMA_BUILD) {
+ int node = page_to_nid(page);

-slab_empty:
- if (prior) {
- /*
- * Slab still on the partial list.
- */
- remove_partial(s, page);
- stat(s, FREE_REMOVE_PARTIAL);
+ if (unlikely(node != c->node)) {
+ slab_free_alien(s, c, page, x, node);
+ stat(s, FREE_ALIEN);
+ goto out;
+ }
}
- slab_unlock(page);
- stat(s, FREE_SLAB);
- discard_slab(s, page);
- return;
-
-debug:
- if (!free_debug_processing(s, page, x, addr))
- goto out_unlock;
- goto checks_ok;
-}

-/*
- * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
- * can perform fastpath freeing without additional function calls.
- *
- * The fastpath is only possible if we are freeing to the current cpu slab
- * of this processor. This typically the case if we have just allocated
- * the item before.
- *
- * If fastpath is not possible then fall back to __slab_free where we deal
- * with all sorts of special processing.
- */
-static __always_inline void slab_free(struct kmem_cache *s,
- struct page *page, void *x, unsigned long addr)
-{
- void **object = (void *)x;
- struct kmem_cache_cpu *c;
- unsigned long flags;
+ q = &c->q;

- slab_free_hook(s, x);
+ if (unlikely(queue_full(q))) {

- local_irq_save(flags);
- c = __this_cpu_ptr(s->cpu_slab);
+ drain_queue(s, q, BATCH_SIZE);
+ stat(s, FREE_SLOWPATH);

- slab_free_hook_irq(s, x);
-
- if (likely(page == c->page && c->node >= 0)) {
- set_freepointer(s, object, c->freelist);
- c->freelist = object;
- stat(s, FREE_FASTPATH);
} else
- __slab_free(s, page, x, addr);
+ stat(s, FREE_FASTPATH);

+ queue_put(q, x);
+out:
local_irq_restore(flags);
}

void kmem_cache_free(struct kmem_cache *s, void *x)
{
- struct page *page;
-
- page = virt_to_head_page(x);
-
- slab_free(s, page, x, _RET_IP_);
+ slab_free(s, virt_to_head_page(x), x, _RET_IP_);

trace_kmem_cache_free(_RET_IP_, x);
}
@@ -1929,11 +1952,6 @@ static struct page *get_object_page(cons
}

/*
- * Object placement in a slab is made very easy because we always start at
- * offset 0. If we tune the size of the object to the alignment then we can
- * get the required alignment by putting one properly sized object after
- * another.
- *
* Notice that the allocation order determines the sizes of the per cpu
* caches. Each processor has always one slab available for allocations.
* Increasing the allocation order reduces the number of times that slabs
@@ -2028,7 +2046,7 @@ static inline int calculate_order(int si
*/
min_objects = slub_min_objects;
if (!min_objects)
- min_objects = 4 * (fls(nr_cpu_ids) + 1);
+ min_objects = min(BITS_PER_LONG, 4 * (fls(nr_cpu_ids) + 1));
max_objects = (PAGE_SIZE << slub_max_order)/size;
min_objects = min(min_objects, max_objects);

@@ -2139,10 +2157,7 @@ static void early_kmem_cache_node_alloc(
"in order to be able to continue\n");
}

- n = page->freelist;
- BUG_ON(!n);
- page->freelist = get_freepointer(kmem_cache_node, n);
- page->inuse++;
+ retrieve_objects(kmem_cache_node, page, (void **)&n, 1);
kmem_cache_node->node[node] = n;
#ifdef CONFIG_SLUB_DEBUG
init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
@@ -2216,10 +2231,11 @@ static void set_min_partial(struct kmem_
static int calculate_sizes(struct kmem_cache *s, int forced_order)
{
unsigned long flags = s->flags;
- unsigned long size = s->objsize;
+ unsigned long size;
unsigned long align = s->align;
int order;

+ size = s->objsize;
/*
* Round up object size to the next word boundary. We can only
* place the free pointer at word boundaries and this determines
@@ -2251,24 +2267,10 @@ static int calculate_sizes(struct kmem_c

/*
* With that we have determined the number of bytes in actual use
- * by the object. This is the potential offset to the free pointer.
+ * by the object.
*/
s->inuse = size;

- if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
- s->ctor)) {
- /*
- * Relocate free pointer after the object if it is not
- * permitted to overwrite the first word of the object on
- * kmem_cache_free.
- *
- * This is the case if we do RCU, have a constructor or
- * destructor or are poisoning the objects.
- */
- s->offset = size;
- size += sizeof(void *);
- }
-
#ifdef CONFIG_SLUB_DEBUG
if (flags & SLAB_STORE_USER)
/*
@@ -2354,7 +2356,6 @@ static int kmem_cache_open(struct kmem_c
*/
if (get_order(s->size) > get_order(s->objsize)) {
s->flags &= ~DEBUG_METADATA_FLAGS;
- s->offset = 0;
if (!calculate_sizes(s, -1))
goto error;
}
@@ -2379,9 +2380,9 @@ static int kmem_cache_open(struct kmem_c
error:
if (flags & SLAB_PANIC)
panic("Cannot create slab %s size=%lu realsize=%u "
- "order=%u offset=%u flags=%lx\n",
+ "order=%u flags=%lx\n",
s->name, (unsigned long)size, s->size, oo_order(s->oo),
- s->offset, flags);
+ flags);
return 0;
}

@@ -2435,18 +2436,14 @@ static void list_slab_objects(struct kme
#ifdef CONFIG_SLUB_DEBUG
void *addr = page_address(page);
void *p;
- unsigned long *map = kzalloc(BITS_TO_LONGS(page->objects) *
- sizeof(long), GFP_ATOMIC);
- if (!map)
- return;
+ long *m = map(page);
+
slab_err(s, page, "%s", text);
slab_lock(page);
- for_each_free_object(p, s, page->freelist)
- set_bit(slab_index(p, s, addr), map);

for_each_object(p, s, addr, page->objects) {

- if (!test_bit(slab_index(p, s, addr), map)) {
+ if (!test_bit(slab_index(p, s, addr), m)) {
printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n",
p, p - addr);
print_tracking(s, p);
@@ -2467,7 +2464,7 @@ static void free_partial(struct kmem_cac

spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry_safe(page, h, &n->partial, lru) {
- if (!page->inuse) {
+ if (all_objects_available(page)) {
__remove_partial(n, page);
discard_slab(s, page);
} else {
@@ -2821,7 +2818,7 @@ int kmem_cache_shrink(struct kmem_cache
* list_lock. page->inuse here is the upper limit.
*/
list_for_each_entry_safe(page, t, &n->partial, lru) {
- if (!page->inuse && slab_trylock(page)) {
+ if (all_objects_available(page) && slab_trylock(page)) {
/*
* Must hold slab lock here because slab_free
* may have freed the last object and be
@@ -2832,7 +2829,7 @@ int kmem_cache_shrink(struct kmem_cache
discard_slab(s, page);
} else {
list_move(&page->lru,
- slabs_by_inuse + page->inuse);
+ slabs_by_inuse + inuse(page));
}
}

@@ -3099,12 +3096,12 @@ void __init kmem_cache_init(void)

/* Caches that are not of the two-to-the-power-of size */
if (KMALLOC_MIN_SIZE <= 32) {
- kmalloc_caches[1] = create_kmalloc_cache("kmalloc-96", 96, 0);
+ kmalloc_caches[1] = create_kmalloc_cache("kmalloc", 96, 0);
caches++;
}

if (KMALLOC_MIN_SIZE <= 64) {
- kmalloc_caches[2] = create_kmalloc_cache("kmalloc-192", 192, 0);
+ kmalloc_caches[2] = create_kmalloc_cache("kmalloc", 192, 0);
caches++;
}

@@ -3115,22 +3112,21 @@ void __init kmem_cache_init(void)

slab_state = UP;

- /* Provide the correct kmalloc names now that the caches are up */
- if (KMALLOC_MIN_SIZE <= 32) {
- kmalloc_caches[1]->name = kstrdup(kmalloc_caches[1]->name, GFP_NOWAIT);
- BUG_ON(!kmalloc_caches[1]->name);
- }
+ /*
+ * Provide the correct kmalloc names and enable the shared caches
+ * now that the kmalloc array is functional
+ */
+ for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
+ struct kmem_cache *s = kmalloc_caches[i];

- if (KMALLOC_MIN_SIZE <= 64) {
- kmalloc_caches[2]->name = kstrdup(kmalloc_caches[2]->name, GFP_NOWAIT);
- BUG_ON(!kmalloc_caches[2]->name);
- }
+ if (!s)
+ continue;

- for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
- char *s = kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
+ if (strcmp(s->name, "kmalloc") == 0)
+ s->name = kasprintf(GFP_NOWAIT,
+ "kmalloc-%d", s->objsize);

- BUG_ON(!s);
- kmalloc_caches[i]->name = s;
+ BUG_ON(!s->name);
}

#ifdef CONFIG_SMP
@@ -3304,7 +3300,7 @@ static int __cpuinit slab_cpuup_callback
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
local_irq_save(flags);
- __flush_cpu_slab(s, cpu);
+ flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
local_irq_restore(flags);
}
up_read(&slub_lock);
@@ -3376,7 +3372,7 @@ void *__kmalloc_node_track_caller(size_t
#ifdef CONFIG_SYSFS
static int count_inuse(struct page *page)
{
- return page->inuse;
+ return inuse(page);
}

static int count_total(struct page *page)
@@ -3386,54 +3382,69 @@ static int count_total(struct page *page
#endif

#ifdef CONFIG_SLUB_DEBUG
-static int validate_slab(struct kmem_cache *s, struct page *page,
- unsigned long *map)
+static int validate_slab(struct kmem_cache *s, struct page *page)
{
void *p;
void *addr = page_address(page);
+ unsigned long *m = map(page);
+ unsigned long errors = 0;

- if (!check_slab(s, page) ||
- !on_freelist(s, page, NULL))
+ if (!check_slab(s, page) || !verify_slab(s, page))
return 0;

- /* Now we know that a valid freelist exists */
- bitmap_zero(map, page->objects);
+ for_each_object(p, s, addr, page->objects) {
+ int bit = slab_index(p, s, addr);

- for_each_free_object(p, s, page->freelist) {
- set_bit(slab_index(p, s, addr), map);
- if (!check_object(s, page, p, 0))
- return 0;
+ if (test_bit(bit, m)) {
+ /* Available */
+ if (!check_object(s, page, p, SLUB_RED_INACTIVE))
+ errors++;
+ } else {
+#ifdef CONFIG_SLUB_DEBUG
+ /*
+ * We cannot check if the object is on a queue without
+ * Redzoning and therefore also the integrity checks for
+ * objects will only work with redzoning on.
+ */
+ if (s->flags & SLAB_RED_ZONE) {
+ u8 *q = p + s->objsize;
+
+ if (*q != SLUB_RED_QUEUE)
+ if (!check_object(s, page, p, SLUB_RED_ACTIVE))
+ errors++;
+ }
+#endif
+ }
}

- for_each_object(p, s, addr, page->objects)
- if (!test_bit(slab_index(p, s, addr), map))
- if (!check_object(s, page, p, 1))
- return 0;
- return 1;
+ return errors;
}

-static void validate_slab_slab(struct kmem_cache *s, struct page *page,
- unsigned long *map)
+static unsigned long validate_slab_slab(struct kmem_cache *s, struct page *page)
{
+ unsigned long errors = 0;
+
if (slab_trylock(page)) {
- validate_slab(s, page, map);
+ errors = validate_slab(s, page);
slab_unlock(page);
} else
printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
s->name, page);
+ return errors;
}

static int validate_slab_node(struct kmem_cache *s,
- struct kmem_cache_node *n, unsigned long *map)
+ struct kmem_cache_node *n)
{
unsigned long count = 0;
struct page *page;
unsigned long flags;
+ unsigned long errors;

spin_lock_irqsave(&n->list_lock, flags);

list_for_each_entry(page, &n->partial, lru) {
- validate_slab_slab(s, page, map);
+ errors += validate_slab_slab(s, page);
count++;
}
if (count != n->nr_partial)
@@ -3444,7 +3455,7 @@ static int validate_slab_node(struct kme
goto out;

list_for_each_entry(page, &n->full, lru) {
- validate_slab_slab(s, page, map);
+ validate_slab_slab(s, page);
count++;
}
if (count != atomic_long_read(&n->nr_slabs))
@@ -3454,26 +3465,20 @@ static int validate_slab_node(struct kme

out:
spin_unlock_irqrestore(&n->list_lock, flags);
- return count;
+ return errors;
}

static long validate_slab_cache(struct kmem_cache *s)
{
int node;
unsigned long count = 0;
- unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
- sizeof(unsigned long), GFP_KERNEL);
-
- if (!map)
- return -ENOMEM;

flush_all(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);

- count += validate_slab_node(s, n, map);
+ count += validate_slab_node(s, n);
}
- kfree(map);
return count;
}
/*
@@ -3603,18 +3608,14 @@ static int add_location(struct loc_track
}

static void process_slab(struct loc_track *t, struct kmem_cache *s,
- struct page *page, enum track_item alloc,
- unsigned long *map)
+ struct page *page, enum track_item alloc)
{
void *addr = page_address(page);
+ unsigned long *m = map(page);
void *p;

- bitmap_zero(map, page->objects);
- for_each_free_object(p, s, page->freelist)
- set_bit(slab_index(p, s, addr), map);
-
for_each_object(p, s, addr, page->objects)
- if (!test_bit(slab_index(p, s, addr), map))
+ if (!test_bit(slab_index(p, s, addr), m))
add_location(t, s, get_track(s, p, alloc));
}

@@ -3625,12 +3626,9 @@ static int list_locations(struct kmem_ca
unsigned long i;
struct loc_track t = { 0, 0, NULL };
int node;
- unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
- sizeof(unsigned long), GFP_KERNEL);

- if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
+ if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
GFP_TEMPORARY)) {
- kfree(map);
return sprintf(buf, "Out of memory\n");
}
/* Push back cpu slabs */
@@ -3646,9 +3644,9 @@ static int list_locations(struct kmem_ca

spin_lock_irqsave(&n->list_lock, flags);
list_for_each_entry(page, &n->partial, lru)
- process_slab(&t, s, page, alloc, map);
+ process_slab(&t, s, page, alloc);
list_for_each_entry(page, &n->full, lru)
- process_slab(&t, s, page, alloc, map);
+ process_slab(&t, s, page, alloc);
spin_unlock_irqrestore(&n->list_lock, flags);
}

@@ -3699,7 +3697,6 @@ static int list_locations(struct kmem_ca
}

free_loc_track(&t);
- kfree(map);
if (!t.count)
len += sprintf(buf, "No data\n");
return len;
@@ -3779,7 +3776,6 @@ enum slab_stat_type {

#define SO_ALL (1 << SL_ALL)
#define SO_PARTIAL (1 << SL_PARTIAL)
-#define SO_CPU (1 << SL_CPU)
#define SO_OBJECTS (1 << SL_OBJECTS)
#define SO_TOTAL (1 << SL_TOTAL)

@@ -3797,30 +3793,6 @@ static ssize_t show_slab_objects(struct
return -ENOMEM;
per_cpu = nodes + nr_node_ids;

- if (flags & SO_CPU) {
- int cpu;
-
- for_each_possible_cpu(cpu) {
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-
- if (!c || c->node < 0)
- continue;
-
- if (c->page) {
- if (flags & SO_TOTAL)
- x = c->page->objects;
- else if (flags & SO_OBJECTS)
- x = c->page->inuse;
- else
- x = 1;
-
- total += x;
- nodes[c->node] += x;
- }
- per_cpu[c->node]++;
- }
- }
-
down_read(&slub_lock);
#ifdef CONFIG_SLUB_DEBUG
if (flags & SO_ALL) {
@@ -3831,7 +3803,7 @@ static ssize_t show_slab_objects(struct
x = atomic_long_read(&n->total_objects);
else if (flags & SO_OBJECTS)
x = atomic_long_read(&n->total_objects) -
- count_partial(n, count_free);
+ count_partial(n, available);

else
x = atomic_long_read(&n->nr_slabs);
@@ -3897,7 +3869,7 @@ struct slab_attribute {
static struct slab_attribute _name##_attr = __ATTR_RO(_name)

#define SLAB_ATTR(_name) \
- static struct slab_attribute _name##_attr = \
+ static struct slab_attribute _name##_attr = \
__ATTR(_name, 0644, _name##_show, _name##_store)

static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
@@ -3990,11 +3962,35 @@ static ssize_t partial_show(struct kmem_
}
SLAB_ATTR_RO(partial);

-static ssize_t cpu_slabs_show(struct kmem_cache *s, char *buf)
+static ssize_t cpu_queues_show(struct kmem_cache *s, char *buf)
{
- return show_slab_objects(s, buf, SO_CPU);
+ unsigned long total = 0;
+ int x;
+ int cpu;
+ unsigned long *cpus;
+
+ cpus = kzalloc(1 * sizeof(unsigned long) * nr_cpu_ids, GFP_KERNEL);
+ if (!cpus)
+ return -ENOMEM;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+ total += c->q.objects;
+ }
+
+ x = sprintf(buf, "%lu", total);
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+ if (c->q.objects)
+ x += sprintf(buf + x, " C%d=%u", cpu, c->q.objects);
+ }
+ kfree(cpus);
+ return x + sprintf(buf + x, "\n");
}
-SLAB_ATTR_RO(cpu_slabs);
+SLAB_ATTR_RO(cpu_queues);

static ssize_t objects_show(struct kmem_cache *s, char *buf)
{
@@ -4296,19 +4292,12 @@ STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath
STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
STAT_ATTR(FREE_FASTPATH, free_fastpath);
STAT_ATTR(FREE_SLOWPATH, free_slowpath);
-STAT_ATTR(FREE_FROZEN, free_frozen);
STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
STAT_ATTR(ALLOC_SLAB, alloc_slab);
-STAT_ATTR(ALLOC_REFILL, alloc_refill);
STAT_ATTR(FREE_SLAB, free_slab);
-STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
-STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
-STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
-STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
-STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
-STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
+STAT_ATTR(QUEUE_FLUSH, queue_flush);
STAT_ATTR(ORDER_FALLBACK, order_fallback);
#endif

@@ -4321,7 +4310,7 @@ static struct attribute *slab_attrs[] =
&objects_attr.attr,
&objects_partial_attr.attr,
&partial_attr.attr,
- &cpu_slabs_attr.attr,
+ &cpu_queues_attr.attr,
&ctor_attr.attr,
&aliases_attr.attr,
&align_attr.attr,
@@ -4352,19 +4341,12 @@ static struct attribute *slab_attrs[] =
&alloc_slowpath_attr.attr,
&free_fastpath_attr.attr,
&free_slowpath_attr.attr,
- &free_frozen_attr.attr,
&free_add_partial_attr.attr,
&free_remove_partial_attr.attr,
&alloc_from_partial_attr.attr,
&alloc_slab_attr.attr,
- &alloc_refill_attr.attr,
&free_slab_attr.attr,
- &cpuslab_flush_attr.attr,
- &deactivate_full_attr.attr,
- &deactivate_empty_attr.attr,
- &deactivate_to_head_attr.attr,
- &deactivate_to_tail_attr.attr,
- &deactivate_remote_frees_attr.attr,
+ &queue_flush_attr.attr,
&order_fallback_attr.attr,
#endif
#ifdef CONFIG_FAILSLAB
@@ -4504,6 +4486,7 @@ static int sysfs_slab_add(struct kmem_ca
*/
sysfs_remove_link(&slab_kset->kobj, s->name);
name = s->name;
+
} else {
/*
* Create a unique name for the slab as a target
@@ -4681,7 +4664,7 @@ static int s_show(struct seq_file *m, vo
nr_partials += n->nr_partial;
nr_slabs += atomic_long_read(&n->nr_slabs);
nr_objs += atomic_long_read(&n->total_objects);
- nr_free += count_partial(n, count_free);
+ nr_free += count_partial(n, available);
}

nr_inuse = nr_objs - nr_free;
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2010-10-04 11:00:39.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h 2010-10-04 11:00:40.000000000 -0500
@@ -125,9 +125,8 @@ enum pageflags {

/* SLOB */
PG_slob_free = PG_private,
-
/* SLUB */
- PG_slub_frozen = PG_active,
+ PG_slub_partial = PG_active,
};

#ifndef __GENERATING_BOUNDS_H
@@ -212,8 +211,7 @@ PAGEFLAG(Reserved, reserved) __CLEARPAGE
PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

__PAGEFLAG(SlobFree, slob_free)
-
-__PAGEFLAG(SlubFrozen, slub_frozen)
+__PAGEFLAG(SlubPartial, slub_partial)

/*
* Private page markings that may be used by the filesystem that owns the page
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-04 11:00:39.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-04 11:14:26.000000000 -0500
@@ -2,9 +2,10 @@
#define _LINUX_SLUB_DEF_H

/*
- * SLUB : A Slab allocator without object queues.
+ * SLUB : The Unified Slab allocator.
*
- * (C) 2007 SGI, Christoph Lameter
+ * (C) 2007-2008 SGI, Christoph Lameter
+ * (C) 2008-2010 Linux Foundation, Christoph Lameter
*/
#include <linux/types.h>
#include <linux/gfp.h>
@@ -15,33 +16,35 @@
#include <trace/events/kmem.h>

enum stat_item {
- ALLOC_FASTPATH, /* Allocation from cpu slab */
- ALLOC_SLOWPATH, /* Allocation by getting a new cpu slab */
- FREE_FASTPATH, /* Free to cpu slub */
- FREE_SLOWPATH, /* Freeing not to cpu slab */
- FREE_FROZEN, /* Freeing to frozen slab */
- FREE_ADD_PARTIAL, /* Freeing moves slab to partial list */
- FREE_REMOVE_PARTIAL, /* Freeing removes last object */
- ALLOC_FROM_PARTIAL, /* Cpu slab acquired from partial list */
- ALLOC_SLAB, /* Cpu slab acquired from page allocator */
- ALLOC_REFILL, /* Refill cpu slab from slab freelist */
+ ALLOC_FASTPATH, /* Allocation from cpu queue */
+ ALLOC_SLOWPATH, /* Allocation required refilling of queue */
+ FREE_FASTPATH, /* Free to cpu queue */
+ FREE_SLOWPATH, /* Required pushing objects out of the queue */
+ FREE_ADD_PARTIAL, /* Freeing moved slab to partial list */
+ FREE_REMOVE_PARTIAL, /* Freeing removed from partial list */
+ ALLOC_FROM_PARTIAL, /* slab with objects acquired from partial */
+ ALLOC_SLAB, /* New slab acquired from page allocator */
+ FREE_ALIEN, /* Free to alien node */
FREE_SLAB, /* Slab freed to the page allocator */
- CPUSLAB_FLUSH, /* Abandoning of the cpu slab */
- DEACTIVATE_FULL, /* Cpu slab was full when deactivated */
- DEACTIVATE_EMPTY, /* Cpu slab was empty when deactivated */
- DEACTIVATE_TO_HEAD, /* Cpu slab was moved to the head of partials */
- DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */
- DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
+ QUEUE_FLUSH, /* Flushing of the per cpu queue */
ORDER_FALLBACK, /* Number of times fallback was necessary */
NR_SLUB_STAT_ITEMS };

+#define QUEUE_SIZE 50
+#define BATCH_SIZE 25
+
+/* Queueing structure used for per cpu, l3 cache and alien queueing */
+struct kmem_cache_queue {
+ int objects; /* Available objects */
+ void *object[QUEUE_SIZE];
+};
+
struct kmem_cache_cpu {
- void **freelist; /* Pointer to first free per cpu object */
- struct page *page; /* The slab from which we are allocating */
- int node; /* The node of the page (or -1 for debug) */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
+ int node; /* objects only from this numa node */
+ struct kmem_cache_queue q;
};

struct kmem_cache_node {
@@ -73,7 +76,6 @@ struct kmem_cache {
unsigned long flags;
int size; /* The size of an object including meta data */
int objsize; /* The size of an object without meta data */
- int offset; /* Free pointer offset. */
struct kmem_cache_order_objects oo;

/* Allocation and freeing of slabs */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig 2010-10-04 11:00:39.000000000 -0500
+++ linux-2.6/init/Kconfig 2010-10-04 11:00:40.000000000 -0500
@@ -1091,14 +1091,14 @@ config SLAB
per cpu and per node queues.

config SLUB
- bool "SLUB (Unqueued Allocator)"
+ bool "SLUB (Unified allocator)"
help
- SLUB is a slab allocator that minimizes cache line usage
- instead of managing queues of cached objects (SLAB approach).
- Per cpu caching is realized using slabs of objects instead
- of queues of objects. SLUB can use memory efficiently
- and has enhanced diagnostics. SLUB is the default choice for
- a slab allocator.
+ SLUB is a slab allocator that minimizes metadata and provides
+ a clean implementation that is faster than SLAB. SLUB has many
+ of the queueing characteristic of the original SLAB allocator
+ but uses a bit map to manage objects in slabs. SLUB can use
+ memory more efficiently and has enhanced diagnostic and
+ resiliency features compared with SLAB.

config SLOB
depends on EMBEDDED
Index: linux-2.6/include/linux/poison.h
===================================================================
--- linux-2.6.orig/include/linux/poison.h 2010-10-04 11:00:39.000000000 -0500
+++ linux-2.6/include/linux/poison.h 2010-10-04 11:00:40.000000000 -0500
@@ -42,6 +42,7 @@

#define SLUB_RED_INACTIVE 0xbb
#define SLUB_RED_ACTIVE 0xcc
+#define SLUB_RED_QUEUE 0xdd

/* ...and for poisoning */
#define POISON_INUSE 0x5a /* for use-uninitialised poisoning */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:29 UTC

Allow resizing of cpu queue and batch size. This is done in the
basic steps that are also followed by SLAB.

Careful: The ->cpu pointer is becoming volatile. References
to the ->cpu pointer either

A. Occur with interrupts disabled. This guarantees that nothing on the
processor itself interferes. This only serializes access to a single
processor specific area.

B. Occur with slub_lock taken for operations on all per cpu areas.
Taking the slub_lock guarantees that no resizing operation will occur
while accessing the percpu areas. The data in the percpu areas
is volatile even with slub_lock since the alloc and free functions
do not take slub_lock and will operate on fields of kmem_cache_cpu.

C. Are racy: Tolerable for statistics. The ->cpu pointer must always
point to a valid kmem_cache_cpu area.

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
include/linux/slub_def.h | 11 +-
mm/slub.c | 225 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 203 insertions(+), 33 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 11:02:09.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 11:10:48.000000000 -0500
@@ -194,10 +194,19 @@ static inline void sysfs_slab_remove(str

#endif

+/*
+ * We allow stat calls while slub_lock is taken or while interrupts
+ * are enabled for simplicities sake.
+ *
+ * This results in potential inaccuracies. If the platform does not
+ * support per cpu atomic operations vs. interrupts then the counters
+ * may be updated in a racy manner due to slab processing in
+ * interrupts.
+ */
static inline void stat(struct kmem_cache *s, enum stat_item si)
{
#ifdef CONFIG_SLUB_STATS
- __this_cpu_inc(s->cpu_slab->stat[si]);
+ __this_cpu_inc(s->cpu->stat[si]);
#endif
}

@@ -298,7 +307,7 @@ static inline void queue_put(struct kmem

static inline int queue_full(struct kmem_cache_queue *q)
{
- return q->objects == QUEUE_SIZE;
+ return q->objects == q->max;
}

static inline int queue_empty(struct kmem_cache_queue *q)
@@ -1599,6 +1608,11 @@ static void flush_cpu_objects(struct kme
stat(s, QUEUE_FLUSH);
}

+struct flush_control {
+ struct kmem_cache *s;
+ struct kmem_cache_cpu *c;
+};
+
/*
* Flush cpu objects.
*
@@ -1606,24 +1620,100 @@ static void flush_cpu_objects(struct kme
*/
static void __flush_cpu_objects(void *d)
{
- struct kmem_cache *s = d;
- struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
+ struct flush_control *f = d;
+ struct kmem_cache_cpu *c = __this_cpu_ptr(f->c);

if (c->q.objects)
- flush_cpu_objects(s, c);
+ flush_cpu_objects(f->s, c);
}

static void flush_all(struct kmem_cache *s)
{
- on_each_cpu(__flush_cpu_objects, s, 1);
+ struct flush_control f = { s, s->cpu };
+
+ on_each_cpu(__flush_cpu_objects, &f, 1);
}

struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
{
- return __alloc_percpu(sizeof(struct kmem_cache_cpu),
- __alignof__(struct kmem_cache_cpu));
+ struct kmem_cache_cpu *k;
+ int cpu;
+ int size;
+ int max;
+
+ /* Size the queue and the allocation to cacheline sizes */
+ size = ALIGN(n * sizeof(void *) + sizeof(struct kmem_cache_cpu), cache_line_size());
+
+ k = __alloc_percpu(size, cache_line_size());
+ if (!k)
+ return NULL;
+
+ max = (size - sizeof(struct kmem_cache_cpu)) / sizeof(void *);
+
+ for_each_possible_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(k, cpu);
+
+ c->q.max = max;
+ }
+
+ s->cpu_queue = max;
+ return k;
}

+
+#ifdef CONFIG_SYSFS
+static void resize_cpu_queue(struct kmem_cache *s, int queue)
+{
+ struct kmem_cache_cpu *n = alloc_kmem_cache_cpu(s, queue);
+ struct flush_control f;
+
+ /* Create the new cpu queue and then free the old one */
+ f.s = s;
+ f.c = s->cpu;
+
+ /* We can only shrink the queue here since the new
+ * queue size may be smaller and there may be concurrent
+ * slab operations. The update of the queue must be seen
+ * before the change of the location of the percpu queue.
+ *
+ * Note that the queue may contain more object than the
+ * queue size after this operation.
+ */
+ if (queue < s->queue) {
+ s->queue = queue;
+ s->batch = (s->queue + 1) / 2;
+ barrier();
+ }
+
+ /* This is critical since allocation and free runs
+ * concurrently without taking the slub_lock!
+ * We point the cpu pointer to a different per cpu
+ * segment to redirect current processing and then
+ * flush the cpu objects on the old cpu structure.
+ *
+ * The old percpu structure is no longer reachable
+ * since slab_alloc/free must have terminated in order
+ * to execute __flush_cpu_objects. Both require
+ * interrupts to be disabled.
+ */
+ s->cpu = n;
+ on_each_cpu(__flush_cpu_objects, &f, 1);
+
+ /*
+ * If the queue needs to be extended then we deferred
+ * the update until now when the larger sized queue
+ * has been allocated and is working.
+ */
+ if (queue > s->queue) {
+ s->queue = queue;
+ s->batch = (s->queue + 1) / 2;
+ }
+
+ if (slab_state > UP)
+ free_percpu(f.c);
+}
+#endif
+
/*
* Check if the objects in a per cpu structure fit numa
* locality expectations.
@@ -1734,7 +1824,7 @@ static inline void refill_queue(struct k
struct kmem_cache_queue *q, struct page *page, int nr)
{
int d;
- int batch = min_t(int, QUEUE_SIZE, BATCH_SIZE);
+ int batch = min_t(int, q->max, s->queue);

d = min(batch - q->objects, nr);
retrieve_objects(s, page, q->object + q->objects, d);
@@ -1777,7 +1867,7 @@ static void *slab_alloc(struct kmem_cach

redo:
local_irq_save(flags);
- c = __this_cpu_ptr(s->cpu_slab);
+ c = __this_cpu_ptr(s->cpu);
q = &c->q;
if (unlikely(queue_empty(q) || !node_match(c, node))) {

@@ -1786,7 +1876,7 @@ redo:
c->node = node;
}

- while (q->objects < BATCH_SIZE) {
+ while (q->objects < s->batch) {
struct page *new;

new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
@@ -1803,7 +1893,7 @@ redo:
local_irq_disable();

/* process may have moved to different cpu */
- c = __this_cpu_ptr(s->cpu_slab);
+ c = __this_cpu_ptr(s->cpu);
q = &c->q;

if (!new) {
@@ -1905,7 +1995,7 @@ static void slab_free(struct kmem_cache

slab_free_hook_irq(s, x);

- c = __this_cpu_ptr(s->cpu_slab);
+ c = __this_cpu_ptr(s->cpu);

if (NUMA_BUILD) {
int node = page_to_nid(page);
@@ -1921,7 +2011,7 @@ static void slab_free(struct kmem_cache

if (unlikely(queue_full(q))) {

- drain_queue(s, q, BATCH_SIZE);
+ drain_queue(s, q, s->batch);
stat(s, FREE_SLOWPATH);

} else
@@ -2123,9 +2213,9 @@ static inline int alloc_kmem_cache_cpus(
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));

- s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
+ s->cpu = alloc_kmem_cache_cpu(s, s->queue);

- return s->cpu_slab != NULL;
+ return s->cpu != NULL;
}

static struct kmem_cache *kmem_cache_node;
@@ -2335,6 +2425,18 @@ static int calculate_sizes(struct kmem_c

}

+static int initial_queue_size(int size)
+{
+ if (size > PAGE_SIZE)
+ return 8;
+ else if (size > 1024)
+ return 24;
+ else if (size > 256)
+ return 54;
+ else
+ return 120;
+}
+
static int kmem_cache_open(struct kmem_cache *s,
const char *name, size_t size,
size_t align, unsigned long flags,
@@ -2373,6 +2475,9 @@ static int kmem_cache_open(struct kmem_c
if (!init_kmem_cache_nodes(s))
goto error;

+ s->queue = initial_queue_size(s->size);
+ s->batch = (s->queue + 1) / 2;
+
if (alloc_kmem_cache_cpus(s))
return 1;

@@ -2482,8 +2587,9 @@ static inline int kmem_cache_close(struc
{
int node;

+ down_read(&slub_lock);
flush_all(s);
- free_percpu(s->cpu_slab);
+ free_percpu(s->cpu);
/* Attempt to free all objects */
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);
@@ -2493,6 +2599,7 @@ static inline int kmem_cache_close(struc
return 1;
}
free_kmem_cache_nodes(s);
+ up_read(&slub_lock);
return 0;
}

@@ -3110,6 +3217,7 @@ void __init kmem_cache_init(void)
caches++;
}

+ /* Now the kmalloc array is fully functional (*not* the dma array) */
slab_state = UP;

/*
@@ -3300,7 +3408,7 @@ static int __cpuinit slab_cpuup_callback
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
local_irq_save(flags);
- flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
+ flush_cpu_objects(s, per_cpu_ptr(s->cpu, cpu));
local_irq_restore(flags);
}
up_read(&slub_lock);
@@ -3827,6 +3935,7 @@ static ssize_t show_slab_objects(struct
nodes[node] += x;
}
}
+
x = sprintf(buf, "%lu", total);
#ifdef CONFIG_NUMA
for_each_node_state(node, N_NORMAL_MEMORY)
@@ -3834,6 +3943,7 @@ static ssize_t show_slab_objects(struct
x += sprintf(buf + x, " N%d=%lu",
node, nodes[node]);
#endif
+ up_read(&slub_lock);
kfree(nodes);
return x + sprintf(buf + x, "\n");
}
@@ -3939,6 +4049,57 @@ static ssize_t min_partial_store(struct
}
SLAB_ATTR(min_partial);

+static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%u\n", s->queue);
+}
+
+static ssize_t cpu_queue_size_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ unsigned long queue;
+ int err;
+
+ err = strict_strtoul(buf, 10, &queue);
+ if (err)
+ return err;
+
+ if (queue > 10000 || queue < 4)
+ return -EINVAL;
+
+ if (s->batch > queue)
+ s->batch = queue;
+
+ down_write(&slub_lock);
+ resize_cpu_queue(s, queue);
+ up_write(&slub_lock);
+ return length;
+}
+SLAB_ATTR(cpu_queue_size);
+
+static ssize_t batch_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%u\n", s->batch);
+}
+
+static ssize_t batch_size_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ unsigned long batch;
+ int err;
+
+ err = strict_strtoul(buf, 10, &batch);
+ if (err)
+ return err;
+
+ if (batch < s->queue || batch < 4)
+ return -EINVAL;
+
+ s->batch = batch;
+ return length;
+}
+SLAB_ATTR(batch_size);
+
static ssize_t ctor_show(struct kmem_cache *s, char *buf)
{
if (s->ctor) {
@@ -3962,7 +4123,7 @@ static ssize_t partial_show(struct kmem_
}
SLAB_ATTR_RO(partial);

-static ssize_t cpu_queues_show(struct kmem_cache *s, char *buf)
+static ssize_t per_cpu_caches_show(struct kmem_cache *s, char *buf)
{
unsigned long total = 0;
int x;
@@ -3973,8 +4134,9 @@ static ssize_t cpu_queues_show(struct km
if (!cpus)
return -ENOMEM;

+ down_read(&slub_lock);
for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);

total += c->q.objects;
}
@@ -3982,15 +4144,16 @@ static ssize_t cpu_queues_show(struct km
x = sprintf(buf, "%lu", total);

for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+ struct kmem_cache_queue *q = &c->q;

- if (c->q.objects)
- x += sprintf(buf + x, " C%d=%u", cpu, c->q.objects);
+ x += sprintf(buf + x, " C%d=%u/%u", cpu, q->objects, q->max);
}
+ up_read(&slub_lock);
kfree(cpus);
return x + sprintf(buf + x, "\n");
}
-SLAB_ATTR_RO(cpu_queues);
+SLAB_ATTR_RO(per_cpu_caches);

static ssize_t objects_show(struct kmem_cache *s, char *buf)
{
@@ -4246,12 +4409,14 @@ static int show_stat(struct kmem_cache *
if (!data)
return -ENOMEM;

+ down_read(&slub_lock);
for_each_online_cpu(cpu) {
- unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
+ unsigned x = per_cpu_ptr(s->cpu, cpu)->stat[si];

data[cpu] = x;
sum += x;
}
+ up_read(&slub_lock);

len = sprintf(buf, "%lu", sum);

@@ -4269,8 +4434,10 @@ static void clear_stat(struct kmem_cache
{
int cpu;

+ down_write(&slub_lock);
for_each_online_cpu(cpu)
- per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
+ per_cpu_ptr(s->cpu, cpu)->stat[si] = 0;
+ up_write(&slub_lock);
}

#define STAT_ATTR(si, text) \
@@ -4307,10 +4474,12 @@ static struct attribute *slab_attrs[] =
&objs_per_slab_attr.attr,
&order_attr.attr,
&min_partial_attr.attr,
+ &batch_size_attr.attr,
&objects_attr.attr,
&objects_partial_attr.attr,
&partial_attr.attr,
- &cpu_queues_attr.attr,
+ &per_cpu_caches_attr.attr,
+ &cpu_queue_size_attr.attr,
&ctor_attr.attr,
&aliases_attr.attr,
&align_attr.attr,
@@ -4672,7 +4841,7 @@ static int s_show(struct seq_file *m, vo
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
nr_objs, s->size, oo_objects(s->oo),
(1 << oo_order(s->oo)));
- seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0);
+ seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
0UL);
seq_putc(m, '\n');
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-04 11:00:40.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-04 11:09:44.000000000 -0500
@@ -30,13 +30,11 @@ enum stat_item {
ORDER_FALLBACK, /* Number of times fallback was necessary */
NR_SLUB_STAT_ITEMS };

-#define QUEUE_SIZE 50
-#define BATCH_SIZE 25
-
/* Queueing structure used for per cpu, l3 cache and alien queueing */
struct kmem_cache_queue {
int objects; /* Available objects */
- void *object[QUEUE_SIZE];
+ int max; /* Queue capacity */
+ void *object[];
};

struct kmem_cache_cpu {
@@ -71,12 +69,13 @@ struct kmem_cache_order_objects {
* Slab cache management.
*/
struct kmem_cache {
- struct kmem_cache_cpu __percpu *cpu_slab;
+ struct kmem_cache_cpu __percpu *cpu;
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
int objsize; /* The size of an object without meta data */
struct kmem_cache_order_objects oo;
+ int batch;

/* Allocation and freeing of slabs */
struct kmem_cache_order_objects max;
@@ -86,6 +85,8 @@ struct kmem_cache {
void (*ctor)(void *);
int inuse; /* Offset to metadata */
int align; /* Alignment */
+ int queue; /* specified queue size */
+ int cpu_queue; /* cpu queue size */
unsigned long min_partial;
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:36 UTC

Add a "touched" state in preparation for the implementation of cache
expiration.

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
include/linux/slub_def.h | 10 +++++----
mm/slub.c | 51 ++++++++++++++++++++++++++++++++++-------------
2 files changed, 43 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-05 13:36:33.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-05 13:39:59.000000000 -0500
@@ -52,11 +52,12 @@ enum stat_item {
* the per cpu queue.
*/
struct kmem_cache_queue {
- int objects; /* Available objects */
- int max; /* Queue capacity */
+ int objects; /* Available objects */
+ unsigned short max; /* Queue capacity */
+ unsigned short touched; /* Cache was touched */
union {
struct kmem_cache_queue *shared; /* cpu q -> shared q */
- spinlock_t lock; /* shared queue: lock */
+ spinlock_t lock; /* shared queue: lock */
spinlock_t alien_lock; /* alien cache lock */
};
void *object[];
@@ -72,7 +73,8 @@ struct kmem_cache_cpu {

struct kmem_cache_node {
spinlock_t lock; /* Protocts slab metadata on a node */
- unsigned long nr_partial;
+ unsigned touched;
+ unsigned nr_partial;
struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:39:26.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:39:59.000000000 -0500
@@ -1291,12 +1291,14 @@ static void add_partial(struct kmem_cach
else
list_add(&page->lru, &n->partial);
__SetPageSlubPartial(page);
+ n->touched = 1;
}

static inline void remove_partial(struct kmem_cache_node *n,
struct page *page)
{
n->nr_partial--;
+ n->touched = 1;
list_del(&page->lru);
__ClearPageSlubPartial(page);
}
@@ -1445,6 +1447,7 @@ static inline int drain_queue(struct kme
drain_objects(s, q->object, t);

q->objects -= t;
+ q->touched = 0;
if (q->objects)
memcpy(q->object, q->object + t,
q->objects * sizeof(void *));
@@ -1553,17 +1556,18 @@ static inline void init_shared_cache(str
{
spin_lock_init(&q->lock);
q->max = max;
- q->objects =0;
+ q->objects = 0;
+ q->touched = 0;
}

static inline void init_alien_cache(struct kmem_cache_queue *q, int max)
{
spin_lock_init(&q->alien_lock);
q->max = max;
- q->objects =0;
+ q->objects = 0;
+ q->touched = 0;
}

-
/* Determine a list of the active shared caches */
struct kmem_cache_queue **shared_caches(struct kmem_cache *s)
{
@@ -2000,6 +2004,7 @@ redo:
spin_lock(&a->lock);
if (likely(!queue_empty(a))) {
object = queue_get(a);
+ a->touched = 1;
spin_unlock(&a->lock);
stat(s, ALLOC_ALIEN);
return object;
@@ -2079,20 +2084,21 @@ static void slab_free_alien(struct kmem_
struct kmem_cache_queue *a = alien_cache(s, c, node);

if (a) {
- int slow = 0;
+ int touched = 1;

spin_lock(&a->lock);
while (unlikely(queue_full(a))) {
drain_queue(s, a, s->batch);
- slow = 1;
+ touched = 0;
}
queue_put(a, object);
spin_unlock(&a->lock);

- if (slow)
- stat(s, FREE_SLOWPATH);
- else
+ a->touched = touched;
+ if (touched)
stat(s, FREE_ALIEN);
+ else
+ stat(s, FREE_SLOWPATH);

} else {
/* Direct free to the slab */
@@ -2112,6 +2118,7 @@ static void *slab_alloc(struct kmem_cach
struct kmem_cache_queue *q;
struct kmem_cache_node *n;
struct page *page;
+ int batch;
unsigned long flags;

if (slab_pre_alloc_hook(s, gfpflags))
@@ -2136,6 +2143,7 @@ redo:

get_object:
object = queue_get(q);
+ q->touched = 1;

got_object:
if (kmem_cache_debug(s)) {
@@ -2152,6 +2160,10 @@ got_object:
return object;
}

+ batch = s->batch;
+ if (!q->touched && batch > 16)
+ batch = 16;
+
if (q->shared) {
/*
* Refill the cpu queue with the hottest objects
@@ -2161,9 +2173,10 @@ got_object:
int d = 0;

spin_lock(&l->lock);
- d = min(l->objects, s->batch);
+ d = min(l->objects, batch);

l->objects -= d;
+ l->touched = 1;
memcpy(q->object, l->object + l->objects,
d * sizeof(void *));
spin_unlock(&l->lock);
@@ -2180,11 +2193,11 @@ got_object:

/* Refill from partial lists */
spin_lock(&n->lock);
- while (q->objects < s->batch && !list_empty(&n->partial)) {
+ while (q->objects < batch && !list_empty(&n->partial)) {
page = list_entry(n->partial.next, struct page, lru);

refill_queue(s, q, page, min(available(page),
- s->batch - q->objects));
+ batch - q->objects));

if (all_objects_used(page))
partial_to_full(s, n, page);
@@ -2199,7 +2212,7 @@ got_object:

gfpflags &= gfp_allowed_mask;
/* Refill from free pages */
- while (q->objects < s->batch) {
+ while (q->objects < batch) {
int tail = 0;

if (gfpflags & __GFP_WAIT)
@@ -2226,7 +2239,7 @@ got_object:
* the partial list if so.
*/
if (q->objects < s->batch)
- refill_queue(s, q, page, min_t(int, page->objects, s->batch));
+ refill_queue(s, q, page, min_t(int, page->objects, batch));
else
tail = 1;

@@ -2304,6 +2317,7 @@ static void slab_free(struct kmem_cache
struct kmem_cache_cpu *c;
struct kmem_cache_queue *q;
unsigned long flags;
+ int touched = 1;

slab_free_hook(s, x);

@@ -2339,6 +2353,7 @@ static void slab_free(struct kmem_cache
memcpy(l->object + l->objects, q->object,
d * sizeof(void *));
l->objects += d;
+ l->touched = 1;
spin_unlock(&l->lock);

q->objects -= d;
@@ -2353,10 +2368,12 @@ static void slab_free(struct kmem_cache
drain_queue(s, q, s->batch);
stat(s, FREE_SLOWPATH);
}
+ touched = 0;
} else
stat(s, FREE_FASTPATH);

queue_put(q, x);
+ q->touched = touched;
out:
local_irq_restore(flags);
}
@@ -3924,7 +3941,7 @@ static int validate_slab_node(struct kme
}
if (count != n->nr_partial)
printk(KERN_ERR "SLUB %s: %ld partial slabs counted but "
- "counter=%ld\n", s->name, count, n->nr_partial);
+ "counter=%d\n", s->name, count, n->nr_partial);

if (!(s->flags & SLAB_STORE_USER))
goto out;
@@ -4549,6 +4566,8 @@ static ssize_t shared_caches_show(struct
x += sprintf(buf + x, "%d", cpu);
}
x += sprintf(buf +x, "=%d/%d", q->objects, q->max);
+ if (!q->touched)
+ x += sprintf(buf + x,"*");
}
up_read(&slub_lock);
kfree(caches);
@@ -4604,6 +4623,8 @@ static ssize_t per_cpu_caches_show(struc
struct kmem_cache_queue *q = &c->q;

x += sprintf(buf + x, " C%d=%u/%u", cpu, q->objects, q->max);
+ if (!q->touched)
+ x += sprintf(buf + x,"*");
}
up_read(&slub_lock);
kfree(cpus);
@@ -4772,6 +4793,8 @@ static ssize_t alien_caches_show(struct

x += sprintf(buf + x, "N%d=%d/%d",
node, a->objects, a->max);
+ if (!a->touched)
+ x += sprintf(buf + x,"*");
}
}
x += sprintf(buf + x, "]");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:39 UTC

There are some slab caches around that are rarely used and which are not
performance critical. Add a new SLAB_LOWMEM option to reduce the memory
requirements of such slabs. SLAB_LOWMEM caches will keep no empty slabs
around, have no shared or alien caches and will have a small per cpu
queue of 5 objects.

Signed-off-by: Christoph Lameter <***@linux.com>

---
include/linux/slab.h | 2 ++
mm/slub.c | 25 ++++++++++++++++---------
2 files changed, 18 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h 2010-10-05 13:40:04.000000000 -0500
+++ linux-2.6/include/linux/slab.h 2010-10-05 13:40:08.000000000 -0500
@@ -17,12 +17,14 @@
* The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set.
*/
#define SLAB_DEBUG_FREE 0x00000100UL /* DEBUG: Perform (expensive) checks on free */
+#define SLAB_LOWMEM 0x00000200UL /* Reduce memory usage of this slab */
#define SLAB_RED_ZONE 0x00000400UL /* DEBUG: Red zone objs in a cache */
#define SLAB_POISON 0x00000800UL /* DEBUG: Poison objects */
#define SLAB_HWCACHE_ALIGN 0x00002000UL /* Align objs on cache lines */
#define SLAB_CACHE_DMA 0x00004000UL /* Use GFP_DMA memory */
#define SLAB_STORE_USER 0x00010000UL /* DEBUG: Store the last owner for bug hunting */
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
+
/*
* SLAB_DESTROY_BY_RCU - **WARNING** READ THIS!
*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:40:04.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:40:08.000000000 -0500
@@ -2829,12 +2829,20 @@ static int kmem_cache_open(struct kmem_c
* The larger the object size is, the more pages we want on the partial
* list to avoid pounding the page allocator excessively.
*/
- set_min_partial(s, ilog2(s->size));
+ if (flags & SLAB_LOWMEM)
+ set_min_partial(s, 0);
+ else
+ set_min_partial(s, ilog2(s->size));
+
s->refcount = 1;
if (!init_kmem_cache_nodes(s))
goto error;

- s->queue = initial_queue_size(s->size);
+ if (flags & SLAB_LOWMEM)
+ s->queue = 5;
+ else
+ s->queue = initial_queue_size(s->size);
+
s->batch = (s->queue + 1) / 2;

#ifdef CONFIG_NUMA
@@ -2879,7 +2887,9 @@ static int kmem_cache_open(struct kmem_c

if (alloc_kmem_cache_cpus(s)) {
s->shared_queue_sysfs = 0;
- if (nr_cpu_ids > 1 && s->size < PAGE_SIZE) {
+ if (!(flags & SLAB_LOWMEM) &&
+ nr_cpu_ids > 1 &&
+ s->size < PAGE_SIZE) {
s->shared_queue_sysfs = 10 * s->batch;
alloc_shared_caches(s);
}
@@ -3788,7 +3798,7 @@ void __init kmem_cache_init(void)

kmem_cache_open(kmem_cache_node, "kmem_cache_node",
sizeof(struct kmem_cache_node),
- 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+ 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_LOWMEM, NULL);

hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);

@@ -3797,7 +3807,7 @@ void __init kmem_cache_init(void)

temp_kmem_cache = kmem_cache;
kmem_cache_open(kmem_cache, "kmem_cache", kmem_size,
- 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+ 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_LOWMEM, NULL);
kmem_cache = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
memcpy(kmem_cache, temp_kmem_cache, kmem_size);

@@ -3906,12 +3916,9 @@ void __init kmem_cache_init(void)
if (s && s->size) {
char *name = kasprintf(GFP_NOWAIT,
"dma-kmalloc-%d", s->objsize);
-
BUG_ON(!name);
kmalloc_dma_caches[i] = create_kmalloc_cache(name,
- s->objsize, SLAB_CACHE_DMA);
- /* DMA caches are rarely used. Reduce memory consumption */
- kmalloc_dma_caches[i]->shared_queue_sysfs = 0;
+ s->objsize, SLAB_CACHE_DMA | SLAB_LOWMEM);
}
}
#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:34 UTC

Strictly a performance enhancement by better tracking of objects
that are likely in the lowest cpu caches of processors.

SLAB uses one shared cache per NUMA node or one globally. However, that
is not satifactory for contemporary cpus. Those may have multiple
independent cpu caches per node. SLAB in these situation treats
cache cold objects like cache hot objects.

The shared caches of slub are per physical cpu cache for all cpus using
that cache. Shared cache content will not cross physical caches.

The shared cache can be dynamically configured via
/sys/kernel/slab/<cache>/shared_queue

The current shared cache state is available via
cat /sys/kernel/slab/<cache/<shared_caches>

Shared caches are always allocated in the sizes available in the kmalloc
array. Cache sizes are rounded up to the sizes available.

F.e. on my Dell with 8 cpus in 2 packages in which each 2 cpus shared
an l2 cache I get:

christoph@:/sys/kernel/slab$ cat kmalloc-64/shared_caches
384 C0,2=66/126 C1,3=126/126 C4,6=126/126 C5,7=66/126
christoph@:/sys/kernel/slab$ cat kmalloc-64/per_cpu_caches
617 C0=54/125 C1=37/125 C2=102/125 C3=76/125 C4=81/125 C5=108/125 C6=72/125 C7=87/125

Signed-off-by: Christoph Lameter <***@linux.com>

---
include/linux/slub_def.h | 9 +
mm/slub.c | 406 +++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 398 insertions(+), 17 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-05 13:19:37.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-05 13:26:32.000000000 -0500
@@ -17,9 +17,11 @@

enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu queue */
+ ALLOC_SHARED, /* Allocation caused a shared cache transaction */
ALLOC_DIRECT, /* Allocation bypassing queueing */
ALLOC_SLOWPATH, /* Allocation required refilling of queue */
FREE_FASTPATH, /* Free to cpu queue */
+ FREE_SHARED, /* Free caused a shared cache transaction */
FREE_DIRECT, /* Free bypassing queues */
FREE_SLOWPATH, /* Required pushing objects out of the queue */
FREE_ADD_PARTIAL, /* Freeing moved slab to partial list */
@@ -50,6 +52,10 @@ enum stat_item {
struct kmem_cache_queue {
int objects; /* Available objects */
int max; /* Queue capacity */
+ union {
+ struct kmem_cache_queue *shared; /* cpu q -> shared q */
+ spinlock_t lock; /* shared queue: lock */
+ };
void *object[];
};

@@ -103,12 +109,15 @@ struct kmem_cache {
int align; /* Alignment */
int queue; /* specified queue size */
int cpu_queue; /* cpu queue size */
+ int shared_queue; /* Actual shared queue size */
+ int nr_shared; /* Total # of shared caches */
unsigned long min_partial;
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */
#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
#endif
+ int shared_queue_sysfs; /* Desired shared queue size */

#ifdef CONFIG_NUMA
/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:19:37.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:30:50.000000000 -0500
@@ -1438,6 +1438,23 @@ static inline int drain_queue(struct kme
return t;
}

+static inline int drain_shared_cache(struct kmem_cache *s,
+ struct kmem_cache_queue *q)
+{
+ int n = 0;
+
+ if (!q)
+ return n;
+
+ if (!queue_empty(q)) {
+ spin_lock(&q->lock);
+ if (q->objects)
+ n = drain_queue(s, q, q->objects);
+ spin_unlock(&q->lock);
+ }
+ return n;
+}
+
/*
* Drain all objects from a per cpu queue
*/
@@ -1446,6 +1463,7 @@ static void flush_cpu_objects(struct kme
struct kmem_cache_queue *q = &c->q;

drain_queue(s, q, q->objects);
+ drain_shared_cache(s, q->shared);
stat(s, QUEUE_FLUSH);
}

@@ -1502,6 +1520,188 @@ struct kmem_cache_cpu *alloc_kmem_cache_
return k;
}

+static struct kmem_cache *get_slab(size_t size, gfp_t flags);
+
+static inline unsigned long shared_cache_size(int n)
+{
+ return n * sizeof(void *) + sizeof(struct kmem_cache_queue);
+}
+
+static inline unsigned long shared_cache_capacity(unsigned long size)
+{
+ return (size - sizeof(struct kmem_cache_queue)) / sizeof(void *);
+}
+
+static inline void init_shared_cache(struct kmem_cache_queue *q, int max)
+{
+ spin_lock_init(&q->lock);
+ q->max = max;
+ q->objects =0;
+}
+
+
+/* Determine a list of the active shared caches */
+struct kmem_cache_queue **shared_caches(struct kmem_cache *s)
+{
+ int cpu;
+ struct kmem_cache_queue **caches;
+ int nr;
+ int n;
+
+ if (!s->nr_shared)
+ return NULL;
+
+ caches = kzalloc(sizeof(struct kmem_cache_queue *)
+ * (s->nr_shared + 1), GFP_KERNEL);
+ if (!caches)
+ return ERR_PTR(-ENOMEM);
+
+ nr = 0;
+
+ /* Build list of shared caches */
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+ struct kmem_cache_queue *q = c->q.shared;
+
+ if (!q)
+ continue;
+
+ for (n = 0; n < nr; n++)
+ if (caches[n] == q)
+ break;
+
+ if (n < nr)
+ continue;
+
+ caches[nr++] = q;
+ }
+ caches[nr] = NULL;
+ BUG_ON(nr != s->nr_shared);
+ return caches;
+}
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags);
+
+/* Map of cpus that have no siblings or where we have broken topolocy info */
+static cpumask_t isolated_cpus;
+
+struct kmem_cache_queue *alloc_shared_cache_node(struct kmem_cache *s,
+ int node, const struct cpumask *map)
+{
+ struct kmem_cache_queue *l;
+ int max;
+ int size;
+ void *p;
+ int cpu;
+
+ /*
+ * Determine the size. Round it up to the size that a kmalloc cache
+ * supporting that size has. This will often align the size to a
+ * power of 2 especially on machines that have large kmalloc
+ * alignment requirements.
+ */
+ size = shared_cache_size(s->shared_queue_sysfs);
+ if (size < PAGE_SIZE / 2)
+ size = get_slab(size, GFP_KERNEL)->objsize;
+ else
+ size = PAGE_SIZE << get_order(size);
+
+ max = shared_cache_capacity(size);
+
+ /* Allocate shared cache */
+ p = kmalloc_node(size, GFP_KERNEL | __GFP_ZERO, node);
+ if (!p)
+ return NULL;
+ l = p;
+ init_shared_cache(l, max);
+
+ /* Link all cpus in this group to the shared cache */
+ for_each_cpu(cpu, map)
+ per_cpu_ptr(s->cpu, cpu)->q.shared = l;
+
+ s->shared_queue = max;
+ s->nr_shared++;
+
+ return l;
+}
+
+/*
+ * Allocate shared cpu caches.
+ * A shared cache is allocated for each series of cpus sharing a single cache
+ */
+static void alloc_shared_caches(struct kmem_cache *s)
+{
+ int cpu;
+ struct kmem_cache_queue *l;
+
+ if (slab_state < SYSFS || s->shared_queue_sysfs == 0)
+ return;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+ const struct cpumask *map =
+ per_cpu(cpu_info.llc_shared_map, cpu);
+
+ /* Skip cpus that already have assigned shared caches */
+ if (c->q.shared || cpu_isset(cpu, isolated_cpus))
+ continue;
+
+ if (cpumask_weight(map) < 2)
+ cpu_set(cpu, isolated_cpus);
+ else {
+ l = alloc_shared_cache_node(s, c->node, map);
+ if (!l)
+ printk(KERN_WARNING "SLUB: Out of memory allocating"
+ " shared cache for %s cpu %d node %d\n",
+ s->name, cpu, c->node);
+ }
+ }
+
+ /* Put all the isolated processor into their own shared cache */
+ if (!cpumask_empty(&isolated_cpus))
+ alloc_shared_cache_node(s, NUMA_NO_NODE, &isolated_cpus);
+}
+
+/*
+ * Flush shared caches.
+ *
+ * Called from IPI handler with interrupts disabled.
+ */
+static void __remove_shared_cache(void *d)
+{
+ struct kmem_cache *s = d;
+ struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu);
+ struct kmem_cache_queue *q = c->q.shared;
+
+ c->q.shared = NULL;
+ drain_shared_cache(s, q);
+}
+
+static int remove_shared_caches(struct kmem_cache *s)
+{
+ struct kmem_cache_queue **caches;
+ int i;
+
+ caches = shared_caches(s);
+ if (!caches)
+ return 0;
+ if (IS_ERR(caches))
+ return PTR_ERR(caches);
+
+ /* Go through a transaction on each cpu removing the pointers to the shared caches */
+ on_each_cpu(__remove_shared_cache, s, 1);
+
+ for(i = 0; i < s->nr_shared; i++) {
+ void *p = caches[i];
+
+ kfree(p);
+ }
+
+ kfree(caches);
+ s->nr_shared = 0;
+ s->shared_queue = 0;
+ return 0;
+}

#ifdef CONFIG_SYSFS
static void resize_cpu_queue(struct kmem_cache *s, int queue)
@@ -1509,6 +1709,9 @@ static void resize_cpu_queue(struct kmem
struct kmem_cache_cpu *n = alloc_kmem_cache_cpu(s, queue);
struct flush_control f;

+ /* Drop the shared caches if they exist */
+ remove_shared_caches(s);
+
/* Create the new cpu queue and then free the old one */
f.s = s;
f.c = s->cpu;
@@ -1553,6 +1756,9 @@ static void resize_cpu_queue(struct kmem

if (slab_state > UP)
free_percpu(f.c);
+
+ /* Get the shared caches back */
+ alloc_shared_caches(s);
}
#endif

@@ -1775,7 +1981,6 @@ redo:
q = &c->q;

if (likely(!queue_empty(q))) {
-
stat(s, ALLOC_FASTPATH);

get_object:
@@ -1796,6 +2001,28 @@ got_object:
return object;
}

+ if (q->shared) {
+ /*
+ * Refill the cpu queue with the hottest objects
+ * from the shared cache queue
+ */
+ struct kmem_cache_queue *l = q->shared;
+ int d = 0;
+
+ spin_lock(&l->lock);
+ d = min(l->objects, s->batch);
+
+ l->objects -= d;
+ memcpy(q->object, l->object + l->objects,
+ d * sizeof(void *));
+ spin_unlock(&l->lock);
+ q->objects = d;
+ if (d) {
+ stat(s, ALLOC_SHARED);
+ goto get_object;
+ }
+ }
+
stat(s, ALLOC_SLOWPATH);

n = get_node(s, node);
@@ -1950,9 +2177,30 @@ static void slab_free(struct kmem_cache

if (unlikely(queue_full(q))) {

- drain_queue(s, q, s->batch);
- stat(s, FREE_SLOWPATH);
-
+ /* Shared queue available and has space ? */
+ if (q->shared) {
+ struct kmem_cache_queue *l = q->shared;
+ int d;
+
+ spin_lock(&l->lock);
+ d = min(s->batch, l->max - l->objects);
+ memcpy(l->object + l->objects, q->object,
+ d * sizeof(void *));
+ l->objects += d;
+ spin_unlock(&l->lock);
+
+ q->objects -= d;
+ memcpy(q->object, q->object + d,
+ q->objects * sizeof(void *));
+
+ if (d)
+ stat(s, FREE_SHARED);
+ }
+
+ if (queue_full(q)) {
+ drain_queue(s, q, s->batch);
+ stat(s, FREE_SLOWPATH);
+ }
} else
stat(s, FREE_FASTPATH);

@@ -2411,8 +2659,11 @@ static int kmem_cache_open(struct kmem_c
s->queue = initial_queue_size(s->size);
s->batch = (s->queue + 1) / 2;

- if (alloc_kmem_cache_cpus(s))
+ if (alloc_kmem_cache_cpus(s)) {
+ s->shared_queue_sysfs = s->queue;
+ alloc_shared_caches(s);
return 1;
+ }

free_kmem_cache_nodes(s);
error:
@@ -2519,6 +2770,7 @@ static inline int kmem_cache_close(struc
int node;

down_read(&slub_lock);
+ remove_shared_caches(s);
flush_all(s);
free_percpu(s->cpu);
/* Attempt to free all objects */
@@ -3182,6 +3434,8 @@ void __init kmem_cache_init(void)
BUG_ON(!name);
kmalloc_dma_caches[i] = create_kmalloc_cache(name,
s->objsize, SLAB_CACHE_DMA);
+ /* DMA caches are rarely used. Reduce memory consumption */
+ kmalloc_dma_caches[i]->shared_queue_sysfs = 0;
}
}
#endif
@@ -3968,10 +4222,40 @@ static ssize_t min_partial_store(struct
}
SLAB_ATTR(min_partial);

-static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+static ssize_t queue_size_show(struct kmem_cache *s, char *buf)
{
return sprintf(buf, "%u\n", s->queue);
}
+SLAB_ATTR_RO(queue_size);
+
+
+static ssize_t batch_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%u\n", s->batch);
+}
+
+static ssize_t batch_size_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ unsigned long batch;
+ int err;
+
+ err = strict_strtoul(buf, 10, &batch);
+ if (err)
+ return err;
+
+ if (batch < s->queue || batch < 4)
+ return -EINVAL;
+
+ s->batch = batch;
+ return length;
+}
+SLAB_ATTR(batch_size);
+
+static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%u\n", s->cpu_queue);
+}

static ssize_t cpu_queue_size_store(struct kmem_cache *s,
const char *buf, size_t length)
@@ -3996,28 +4280,89 @@ static ssize_t cpu_queue_size_store(stru
}
SLAB_ATTR(cpu_queue_size);

-static ssize_t batch_size_show(struct kmem_cache *s, char *buf)
+static ssize_t shared_queue_size_show(struct kmem_cache *s, char *buf)
{
- return sprintf(buf, "%u\n", s->batch);
+ return sprintf(buf, "%u %u\n", s->shared_queue, s->shared_queue_sysfs);
}

-static ssize_t batch_size_store(struct kmem_cache *s,
+static ssize_t shared_queue_size_store(struct kmem_cache *s,
const char *buf, size_t length)
{
- unsigned long batch;
+ unsigned long queue;
int err;

- err = strict_strtoul(buf, 10, &batch);
+ err = strict_strtoul(buf, 10, &queue);
if (err)
return err;

- if (batch < s->queue || batch < 4)
+ if (queue > 10000 || queue < 4)
return -EINVAL;

- s->batch = batch;
- return length;
+ down_write(&slub_lock);
+ err = remove_shared_caches(s);
+ if (!err) {
+ if (s->batch > queue)
+ s->batch = queue;
+
+ s->shared_queue_sysfs = queue;
+ if (queue)
+ alloc_shared_caches(s);
+ }
+ up_write(&slub_lock);
+ return err ? err : length;
}
-SLAB_ATTR(batch_size);
+SLAB_ATTR(shared_queue_size);
+
+static ssize_t shared_caches_show(struct kmem_cache *s, char *buf)
+{
+ unsigned long total = 0;
+ int x, n;
+ int cpu;
+ struct kmem_cache_queue **caches;
+
+ down_read(&slub_lock);
+ caches = shared_caches(s);
+ if (!caches) {
+ up_read(&slub_lock);
+ return -ENOENT;
+ }
+
+ if (IS_ERR(caches)) {
+ up_read(&slub_lock);
+ return PTR_ERR(caches);
+ }
+
+ for (n = 0; n < s->nr_shared; n++)
+ total += caches[n]->objects;
+
+ x = sprintf(buf, "%lu", total);
+
+ for (n = 0; n < s->nr_shared; n++) {
+ int first = 1;
+ struct kmem_cache_queue *q = caches[n];
+
+ x += sprintf(buf + x, " C");
+
+ /* Find cpus using the shared cache */
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+
+ if (q != c->q.shared)
+ continue;
+
+ if (first)
+ first = 0;
+ else
+ x += sprintf(buf + x, ",");
+ x += sprintf(buf + x, "%d", cpu);
+ }
+ x += sprintf(buf +x, "=%d/%d", q->objects, q->max);
+ }
+ up_read(&slub_lock);
+ kfree(caches);
+ return x + sprintf(buf + x, "\n");
+}
+SLAB_ATTR_RO(shared_caches);

static ssize_t ctor_show(struct kmem_cache *s, char *buf)
{
@@ -4351,9 +4696,11 @@ static ssize_t text##_store(struct kmem_
SLAB_ATTR(text); \

STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
+STAT_ATTR(ALLOC_SHARED, alloc_shared);
STAT_ATTR(ALLOC_DIRECT, alloc_direct);
STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
STAT_ATTR(FREE_FASTPATH, free_fastpath);
+STAT_ATTR(FREE_SHARED, free_shared);
STAT_ATTR(FREE_DIRECT, free_direct);
STAT_ATTR(FREE_SLOWPATH, free_slowpath);
STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
@@ -4371,12 +4718,15 @@ static struct attribute *slab_attrs[] =
&objs_per_slab_attr.attr,
&order_attr.attr,
&min_partial_attr.attr,
+ &queue_size_attr.attr,
&batch_size_attr.attr,
+ &shared_queue_size_attr.attr,
&objects_attr.attr,
&objects_partial_attr.attr,
&partial_attr.attr,
&per_cpu_caches_attr.attr,
&cpu_queue_size_attr.attr,
+ &shared_caches_attr.attr,
&ctor_attr.attr,
&aliases_attr.attr,
&align_attr.attr,
@@ -4401,9 +4751,11 @@ static struct attribute *slab_attrs[] =
#endif
#ifdef CONFIG_SLUB_STATS
&alloc_fastpath_attr.attr,
+ &alloc_shared_attr.attr,
&alloc_direct_attr.attr,
&alloc_slowpath_attr.attr,
&free_fastpath_attr.attr,
+ &free_shared_attr.attr,
&free_direct_attr.attr,
&free_slowpath_attr.attr,
&free_add_partial_attr.attr,
@@ -4652,6 +5004,7 @@ static int __init slab_sysfs_init(void)
if (err)
printk(KERN_ERR "SLUB: Unable to add boot slab %s"
" to sysfs\n", s->name);
+ alloc_shared_caches(s);
}

while (alias_list) {
@@ -4708,6 +5061,24 @@ static void s_stop(struct seq_file *m, v
up_read(&slub_lock);
}

+static unsigned long shared_objects(struct kmem_cache *s)
+{
+ unsigned long shared = 0;
+ int n;
+ struct kmem_cache_queue **caches;
+
+ caches = shared_caches(s);
+ if (IS_ERR(caches))
+ return PTR_ERR(caches);
+
+ if (caches) {
+ for(n = 0; n < s->nr_shared; n++)
+ shared += caches[n]->objects;
+
+ kfree(caches);
+ }
+ return shared;
+}
static int s_show(struct seq_file *m, void *p)
{
unsigned long nr_partials = 0;
@@ -4737,9 +5108,10 @@ static int s_show(struct seq_file *m, vo
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
nr_objs, s->size, oo_objects(s->oo),
(1 << oo_order(s->oo)));
- seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
+ seq_printf(m, " : tunables %4u %4u %4u", s->cpu_queue, s->batch, s->shared_queue);
+
seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
- 0UL);
+ shared_objects(s));
seq_putc(m, '\n');
return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:37 UTC

Provides a variety of functions that allow expiring objects from slabs.

kmem_cache_expire(struct kmem_cache ,int node)
Expire objects of a specific slab.

kmem_cache_expire_all(int node)
Walk through all caches and expire objects.

The functions return the number of bytes reclaimed.

Expiration works by scanning through the queues and partial
lists for untouched caches. Those are then reduced or reorganized.
Cache state is set to untouched after a expiration run.

Manual expiration may be done through the sysfs filesytem.

/sys/kernel/slab/<cache>/expire

can take a node number or -1 for global expiration.

A "cat" will display the number of bytes reclaimed for a given
expiration run.

SLAB performs a scan of all its slabs every 2 seconds.
The approach here means that the user (or the kernel) has more
control over the expiration of cached data and thereby more
control over the time when the OS can disturb the application
through extensive processing that likely severely disturbs the
per cpu caches.

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
include/linux/slab.h | 3
include/linux/slub_def.h | 15 +
mm/slab.c | 12 +
mm/slob.c | 12 +
mm/slub.c | 419 +++++++++++++++++++++++++++++++++++++++--------
5 files changed, 395 insertions(+), 66 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:39:59.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:40:04.000000000 -0500
@@ -1569,7 +1569,7 @@ static inline void init_alien_cache(stru
}

/* Determine a list of the active shared caches */
-struct kmem_cache_queue **shared_caches(struct kmem_cache *s)
+struct kmem_cache_queue **shared_caches(struct kmem_cache *s, int node)
{
int cpu;
struct kmem_cache_queue **caches;
@@ -1594,6 +1594,9 @@ struct kmem_cache_queue **shared_caches(
if (!q)
continue;

+ if (node != NUMA_NO_NODE && node != c->node)
+ continue;
+
for (n = 0; n < nr; n++)
if (caches[n] == q)
break;
@@ -1604,7 +1607,7 @@ struct kmem_cache_queue **shared_caches(
caches[nr++] = q;
}
caches[nr] = NULL;
- BUG_ON(nr != s->nr_shared);
+ BUG_ON(node == NUMA_NO_NODE && nr != s->nr_shared);
return caches;
}

@@ -1613,29 +1616,36 @@ struct kmem_cache_queue **shared_caches(
*/

#ifdef CONFIG_NUMA
+
+static inline struct kmem_cache_queue *__alien_cache(struct kmem_cache *s,
+ struct kmem_cache_queue *q, int node)
+{
+ void *p = q;
+
+ p -= (node << s->alien_shift);
+
+ return (struct kmem_cache_queue *)p;
+}
+
/* Given an allocation context determine the alien queue to use */
static inline struct kmem_cache_queue *alien_cache(struct kmem_cache *s,
struct kmem_cache_cpu *c, int node)
{
- void *p = c->q.shared;
-
/* If the cache does not have any alien caches return NULL */
- if (!aliens(s) || !p || node == c->node)
+ if (!aliens(s) || !c->q.shared || node == c->node)
return NULL;

/*
* Map [0..(c->node - 1)] -> [1..c->node].
*
* This effectively removes the current node (which is serviced by
- * the shared cachei) from the list and avoids hitting 0 (which would
+ * the shared cache) from the list and avoids hitting 0 (which would
* result in accessing the shared queue used for the cpu cache).
*/
if (node < c->node)
node++;

- p -= (node << s->alien_shift);
-
- return (struct kmem_cache_queue *)p;
+ return __alien_cache(s, c->q.shared, node);
}

static inline void drain_alien_caches(struct kmem_cache *s,
@@ -1776,7 +1786,7 @@ static int remove_shared_caches(struct k
struct kmem_cache_queue **caches;
int i;

- caches = shared_caches(s);
+ caches = shared_caches(s, NUMA_NO_NODE);
if (!caches)
return 0;
if (IS_ERR(caches))
@@ -3275,75 +3285,330 @@ void kfree(const void *x)
}
EXPORT_SYMBOL(kfree);

-/*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
- *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
- */
-int kmem_cache_shrink(struct kmem_cache *s)
+static struct list_head *alloc_slabs_by_inuse(struct kmem_cache *s)
{
- int node;
- int i;
- struct kmem_cache_node *n;
- struct page *page;
- struct page *t;
int objects = oo_objects(s->max);
- struct list_head *slabs_by_inuse =
+ struct list_head *h =
kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+
+ return h;
+
+}
+
+static int shrink_partial_list(struct kmem_cache *s, int node)
+{
+ int i;
+ struct kmem_cache_node *n = get_node(s, node);
+ struct page *page;
+ struct page *t;
+ int reclaimed = 0;
unsigned long flags;
+ int objects = oo_objects(s->max);
+ struct list_head *slabs_by_inuse;
+
+ if (!n->nr_partial)
+ return 0;
+
+ slabs_by_inuse = alloc_slabs_by_inuse(s);
+
+ if (!slabs_by_inuse)
+ return -ENOMEM;
+
+ for (i = 0; i < objects; i++)
+ INIT_LIST_HEAD(slabs_by_inuse + i);
+
+ spin_lock_irqsave(&n->lock, flags);
+
+ /*
+ * Build lists indexed by the items in use in each slab.
+ *
+ * Note that concurrent frees may occur while we hold the
+ * list_lock. page->inuse here is the upper limit.
+ */
+ list_for_each_entry_safe(page, t, &n->partial, lru) {
+ if (all_objects_available(page)) {
+ remove_partial(n, page);
+ reclaimed += PAGE_SIZE << compound_order(page);
+ discard_slab(s, page);
+ } else {
+ list_move(&page->lru,
+ slabs_by_inuse + inuse(page));
+ }
+ }
+
+ /*
+ * Rebuild the partial list with the slabs filled up most
+ * first and the least used slabs at the end.
+ * This will cause the partial list to be shrunk during
+ * allocations and memory to be freed up when more objects
+ * are freed in pages at the tail.
+ */
+ for (i = objects - 1; i >= 0; i--)
+ list_splice(slabs_by_inuse + i, n->partial.prev);
+
+ n->touched = 0;
+ spin_unlock_irqrestore(&n->lock, flags);
+ kfree(slabs_by_inuse);
+ return reclaimed;
+}
+
+static int expire_cache(struct kmem_cache *s, struct kmem_cache_queue *q,
+ int lock)
+{
+ unsigned long flags = 0;
+ int n;
+
+ if (!q || queue_empty(q))
+ return 0;
+
+ if (lock)
+ spin_lock_irqsave(&q->lock, flags);
+ else
+ local_irq_save(flags);
+
+ n = drain_queue(s, q, s->batch);
+
+ if (lock)
+ spin_unlock_irqrestore(&q->lock, flags);
+ else
+ local_irq_restore(flags);
+
+ return n;
+}

- if (!slabs_by_inuse)
+static inline int node_match(int node, int n)
+{
+ return node == NUMA_NO_NODE || node == n;
+}
+
+static int expire_partials(struct kmem_cache *s, int node)
+{
+ struct kmem_cache_node *n = get_node(s, node);
+
+ if (!n->nr_partial || n->touched) {
+ n->touched = 0;
+ return 0;
+ }
+
+ /* Check error code */
+ return shrink_partial_list(s, node) *
+ PAGE_SHIFT << oo_order(s->oo);
+}
+
+static int expire_cpu_caches(struct kmem_cache *s, int node)
+{
+ cpumask_var_t saved_mask;
+ int reclaimed = 0;
+ int cpu;
+
+ if (!alloc_cpumask_var(&saved_mask, GFP_KERNEL))
return -ENOMEM;

- flush_all(s);
- for_each_node_state(node, N_NORMAL_MEMORY) {
- n = get_node(s, node);
+ cpumask_copy(saved_mask, &current->cpus_allowed);
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+ struct kmem_cache_queue *q = &c->q;

- if (!n->nr_partial)
- continue;
+ if (!q->touched && node_match(node, c->node)) {
+ /*
+ * Switch affinity to the target cpu to allow access
+ * to the cpu cache
+ */
+ set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
+ reclaimed += expire_cache(s, q, 0) * s->size;
+ }
+ q->touched = 0;
+ }
+ set_cpus_allowed_ptr(current, saved_mask);
+ free_cpumask_var(saved_mask);

- for (i = 0; i < objects; i++)
- INIT_LIST_HEAD(slabs_by_inuse + i);
+ return reclaimed;
+}

- spin_lock_irqsave(&n->lock, flags);
+#ifdef CONFIG_SMP
+static int expire_shared_caches(struct kmem_cache *s, int node)
+{
+ struct kmem_cache_queue **l;
+ int reclaimed = 0;
+ struct kmem_cache_queue **caches = shared_caches(s, node);

- /*
- * Build lists indexed by the items in use in each slab.
- *
- * Note that concurrent frees may occur while we hold the
- * lock. page->inuse here is the upper limit.
- */
- list_for_each_entry_safe(page, t, &n->partial, lru) {
- if (all_objects_available(page)) {
- /*
- * Must hold slab lock here because slab_free
- * may have freed the last object and be
- * waiting to release the slab.
- */
- remove_partial(n, page);
- discard_slab(s, page);
- } else {
- list_move(&page->lru,
- slabs_by_inuse + inuse(page));
+ if (!caches)
+ return -ENOMEM;
+
+ for (l = caches; *l; l++) {
+ struct kmem_cache_queue *q = *l;
+
+ if (!q->touched)
+ reclaimed += expire_cache(s, q, 1) * s->size;
+
+ q->touched = 0;
+ }
+
+ kfree(caches);
+ return reclaimed;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static int expire_alien_caches(struct kmem_cache *s, int nid)
+{
+ int reclaimed = 0;
+ struct kmem_cache_queue **caches = shared_caches(s, nid);
+
+ if (!caches)
+ return -ENOMEM;
+
+ if (aliens(s)) {
+ struct kmem_cache_queue **l;
+
+ for (l = caches; *l; l++) {
+ int node;
+
+ for (node = 0; node < nr_node_ids - 1;
+ node++) {
+ struct kmem_cache_queue *a =
+ __alien_cache(s, *l, node);
+
+ if (!a->touched)
+ reclaimed += expire_cache(s, a, 1)
+ * s->size;
+ a->touched = 0;
}
}
+ }
+ return reclaimed;
+}
+#endif

- /*
- * Rebuild the partial list with the slabs filled up most
- * first and the least used slabs at the end.
- */
- for (i = objects - 1; i >= 0; i--)
- list_splice(slabs_by_inuse + i, n->partial.prev);
+/*
+ * Cache expiration is called when the kernel is low on memory in a node
+ * or globally (specify node == NUMA_NO_NODE).
+ *
+ * Cache expiration works by reducing caching memory used by the allocator.
+ * It starts with caches that are not that important for performance.
+ * If it cannot retrieve memory in a low importance cache then it will
+ * start expiring data from more important caches.
+ * The function returns 0 when all caches have been expired and no
+ * objects are cached anymore.
+ *
+ * low impact Dropping of empty partial list slabs
+ * Drop a batch from the alien caches
+ * Drop a batch from the shared caches
+ * high impact Drop a batch from the cpu caches
+ */

- spin_unlock_irqrestore(&n->lock, flags);
+typedef int (*expire_t)(struct kmem_cache *,
+ int nid);
+
+static expire_t expire_methods[] =
+{
+ expire_partials,
+#ifdef CONFIG_SMP
+#ifdef CONFIG_NUMA
+ expire_alien_caches,
+#endif
+ expire_shared_caches,
+#endif
+ expire_cpu_caches,
+ NULL
+};
+
+long kmem_cache_expire(struct kmem_cache *s, int node)
+{
+ int reclaimed = 0;
+ int n;
+
+ for (n = 0; n < NR_EXPIRE; n++) {
+ if (node == NUMA_NO_NODE) {
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ int r = expire_methods[n](s, node);
+
+ if (r < 0) {
+ reclaimed = r;
+ goto out;
+ }
+ reclaimed += r;
+ }
+ } else
+ reclaimed = expire_methods[n](s, node);
}
+out:
+ return reclaimed;
+}
+
+static long __kmem_cache_expire_all(int node)
+{
+ struct kmem_cache *s;
+ int reclaimed = 0;
+ int n;
+
+ for (n = 0; n < NR_EXPIRE; n++) {
+ int r;
+
+ list_for_each_entry(s, &slab_caches, list) {
+
+ r = expire_methods[n](s, node);
+ if (r < 0)
+ return r;
+
+ reclaimed += r;
+ }
+ }
+ return reclaimed;
+}
+
+long kmem_cache_expire_all(int node)
+{
+ int reclaimed = 0;
+
+ /*
+ * Expiration may be done from reclaim. Therefore recursion
+ * is possible. The trylock avoids recusion issues and keeps
+ * lockdep happy.
+ *
+ * Take the write lock to ensure that only a single reclaimer
+ * is active at a time.
+ */
+ if (!down_write_trylock(&slub_lock))
+ return 0;
+
+ if (node == NUMA_NO_NODE) {
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ int r = __kmem_cache_expire_all(node);
+
+ if (r < 0) {
+ reclaimed = r;
+ goto out;
+ }
+ reclaimed += r;
+ }
+ } else
+ reclaimed = __kmem_cache_expire_all(node);
+
+out:
+ up_write(&slub_lock);
+ return reclaimed;
+}
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists and sorts
+ * the remaining slabs by the number of items in use. The slabs with the
+ * most items in use come first. New allocations will then fill those up
+ * and thus they can be removed from the partial lists.
+ *
+ * The slabs with the least items are placed last. This results in them
+ * being allocated from last increasing the chance that the last objects
+ * are freed in them.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+ int node;
+
+ flush_all(s);
+
+ for_each_node_state(node, N_NORMAL_MEMORY)
+ shrink_partial_list(s, node);

- kfree(slabs_by_inuse);
return 0;
}
EXPORT_SYMBOL(kmem_cache_shrink);
@@ -4530,7 +4795,7 @@ static ssize_t shared_caches_show(struct
struct kmem_cache_queue **caches;

down_read(&slub_lock);
- caches = shared_caches(s);
+ caches = shared_caches(s, NUMA_NO_NODE);
if (!caches) {
up_read(&slub_lock);
return -ENOENT;
@@ -4715,7 +4980,7 @@ static ssize_t alien_caches_show(struct
return -ENOSYS;

down_read(&slub_lock);
- caches = shared_caches(s);
+ caches = shared_caches(s, NUMA_NO_NODE);
if (!caches) {
up_read(&slub_lock);
return -ENOENT;
@@ -4994,6 +5259,29 @@ static ssize_t failslab_store(struct kme
SLAB_ATTR(failslab);
#endif

+static ssize_t expire_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%lu\n", s->last_expired_bytes);
+}
+
+static ssize_t expire_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ long node;
+ int err;
+
+ err = strict_strtol(buf, 10, &node);
+ if (err)
+ return err;
+
+ if (node > nr_node_ids || node < -1)
+ return -EINVAL;
+
+ s->last_expired_bytes = kmem_cache_expire(s, node);
+ return length;
+}
+SLAB_ATTR(expire);
+
static ssize_t shrink_show(struct kmem_cache *s, char *buf)
{
return 0;
@@ -5111,6 +5399,7 @@ static struct attribute *slab_attrs[] =
&reclaim_account_attr.attr,
&destroy_by_rcu_attr.attr,
&shrink_attr.attr,
+ &expire_attr.attr,
#ifdef CONFIG_SLUB_DEBUG
&total_objects_attr.attr,
&slabs_attr.attr,
@@ -5450,7 +5739,7 @@ static unsigned long shared_objects(stru
int n;
struct kmem_cache_queue **caches;

- caches = shared_caches(s);
+ caches = shared_caches(s, NUMA_NO_NODE);
if (IS_ERR(caches))
return PTR_ERR(caches);

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h 2010-10-05 13:26:31.000000000 -0500
+++ linux-2.6/include/linux/slab.h 2010-10-05 13:40:04.000000000 -0500
@@ -103,12 +103,15 @@ struct kmem_cache *kmem_cache_create(con
void (*)(void *));
void kmem_cache_destroy(struct kmem_cache *);
int kmem_cache_shrink(struct kmem_cache *);
+long kmem_cache_expire(struct kmem_cache *, int);
void kmem_cache_free(struct kmem_cache *, void *);
unsigned int kmem_cache_size(struct kmem_cache *);
const char *kmem_cache_name(struct kmem_cache *);
int kern_ptr_validate(const void *ptr, unsigned long size);
int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);

+long kmem_cache_expire_all(int node);
+
/*
* Please use this macro to create slab caches. Simply specify the
* name of the structure and maybe some flags that are listed above.
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-05 13:39:59.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-05 13:40:04.000000000 -0500
@@ -35,6 +35,18 @@ enum stat_item {
ORDER_FALLBACK, /* Number of times fallback was necessary */
NR_SLUB_STAT_ITEMS };

+enum expire_item {
+ EXPIRE_PARTIAL,
+#ifdef CONFIG_NUMA
+ EXPIRE_ALIEN_CACHES,
+#endif
+#ifdef CONFIG_SMP
+ EXPIRE_SHARED_CACHES,
+#endif
+ EXPIRE_CPU_CACHES,
+ NR_EXPIRE
+};
+
/*
* Queueing structure used for per cpu, l3 cache and alien queueing.
*
@@ -47,7 +59,7 @@ enum stat_item {
* Foreign objects will then be on the queue until memory becomes available
* again on the node. Freeing objects always occurs to the correct node.
*
- * Which means that queueing is no longer effective since
+ * Which means that queueing is then no longer so effective since
* objects are freed to the alien caches after having been dequeued from
* the per cpu queue.
*/
@@ -122,6 +134,7 @@ struct kmem_cache {
struct list_head list; /* List of slab caches */
#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
+ unsigned long last_expired_bytes;
#endif
int shared_queue_sysfs; /* Desired shared queue size */

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c 2010-10-05 13:26:31.000000000 -0500
+++ linux-2.6/mm/slab.c 2010-10-05 13:40:04.000000000 -0500
@@ -2592,6 +2592,18 @@ int kmem_cache_shrink(struct kmem_cache
}
EXPORT_SYMBOL(kmem_cache_shrink);

+unsigned long kmem_cache_expire(struct kmem_cache *cachep, int node)
+{
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_expire);
+
+unsigned long kmem_cache_expire_all(int node)
+{
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_expire_all);
+
/**
* kmem_cache_destroy - delete a cache
* @cachep: the cache to destroy
Index: linux-2.6/mm/slob.c
===================================================================
--- linux-2.6.orig/mm/slob.c 2010-10-05 13:26:31.000000000 -0500
+++ linux-2.6/mm/slob.c 2010-10-05 13:40:04.000000000 -0500
@@ -678,6 +678,18 @@ int kmem_cache_shrink(struct kmem_cache
}
EXPORT_SYMBOL(kmem_cache_shrink);

+unsigned long kmem_cache_expire(struct kmem_cache *cachep, int node)
+{
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_expire);
+
+unsigned long kmem_cache_expire_all(int node)
+{
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_expire_all);
+
int kmem_ptr_validate(struct kmem_cache *a, const void *b)
{
return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:41 UTC

Add counters and consistently count alien allocations that
have to go to the page allocator.

Signed-off-by: Christoph Lameter <***@linux.com>

---
include/linux/slub_def.h | 2 ++
mm/slub.c | 7 ++++++-
2 files changed, 8 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-05 13:40:04.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-05 13:40:14.000000000 -0500
@@ -19,12 +19,14 @@ enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu queue */
ALLOC_SHARED, /* Allocation caused a shared cache transaction */
ALLOC_ALIEN, /* Allocation from alien cache */
+ ALLOC_ALIEN_SLOW, /* Alien allocation from partial */
ALLOC_DIRECT, /* Allocation bypassing queueing */
ALLOC_SLOWPATH, /* Allocation required refilling of queue */
FREE_FASTPATH, /* Free to cpu queue */
FREE_SHARED, /* Free caused a shared cache transaction */
FREE_DIRECT, /* Free bypassing queues */
FREE_ALIEN, /* Free to alien node */
+ FREE_ALIEN_SLOW, /* Alien free had to drain cache */
FREE_SLOWPATH, /* Required pushing objects out of the queue */
FREE_ADD_PARTIAL, /* Freeing moved slab to partial list */
FREE_REMOVE_PARTIAL, /* Freeing removed from partial list */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:40:11.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:40:14.000000000 -0500
@@ -2021,6 +2021,7 @@ redo:
}
}

+ stat(s, ALLOC_ALIEN_SLOW);
spin_lock(&n->lock);
if (!list_empty(&n->partial)) {

@@ -2108,7 +2109,7 @@ static void slab_free_alien(struct kmem_
if (touched)
stat(s, FREE_ALIEN);
else
- stat(s, FREE_SLOWPATH);
+ stat(s, FREE_ALIEN_SLOW);

} else {
/* Direct free to the slab */
@@ -5430,11 +5431,13 @@ SLAB_ATTR(text); \
STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
STAT_ATTR(ALLOC_SHARED, alloc_shared);
STAT_ATTR(ALLOC_ALIEN, alloc_alien);
+STAT_ATTR(ALLOC_ALIEN_SLOW, alloc_alien_slow);
STAT_ATTR(ALLOC_DIRECT, alloc_direct);
STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
STAT_ATTR(FREE_FASTPATH, free_fastpath);
STAT_ATTR(FREE_SHARED, free_shared);
STAT_ATTR(FREE_ALIEN, free_alien);
+STAT_ATTR(FREE_ALIEN_SLOW, free_alien_slow);
STAT_ATTR(FREE_DIRECT, free_direct);
STAT_ATTR(FREE_SLOWPATH, free_slowpath);
STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
@@ -5494,9 +5497,11 @@ static struct attribute *slab_attrs[] =
&alloc_alien_attr.attr,
&alloc_direct_attr.attr,
&alloc_slowpath_attr.attr,
+ &alloc_alien_slow_attr.attr,
&free_fastpath_attr.attr,
&free_shared_attr.attr,
&free_alien_attr.attr,
+ &free_alien_slow_attr.attr,
&free_direct_attr.attr,
&free_slowpath_attr.attr,
&free_add_partial_attr.attr,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:38 UTC

We already do slab reclaim during page reclaim. Add a call to
object expiration in slub whenever shrink_slab() is called.
If the reclaim is zone specific then use the node of the zone
to restrict reclaim in slub.

Signed-off-by: Christoph Lameter <***@linux.com>

---
mm/vmscan.c | 4 ++++
1 file changed, 4 insertions(+)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2010-10-04 08:14:25.000000000 -0500
+++ linux-2.6/mm/vmscan.c 2010-10-04 08:26:47.000000000 -0500
@@ -1917,6 +1917,7 @@ static unsigned long do_try_to_free_page
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}
+ kmem_cache_expire_all(NUMA_NO_NODE);
}
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2221,6 +2222,7 @@ loop_again:
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
+ kmem_cache_expire_all(zone_to_nid(zone));
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
if (zone->all_unreclaimable)
@@ -2722,6 +2724,8 @@ static int __zone_reclaim(struct zone *z
break;
}

+ kmem_cache_expire_all(zone_to_nid(zone));
+
/*
* Update nr_reclaimed by the number of slab pages we
* reclaimed from this zone.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:32 UTC

Slub applies policies and cpuset restriction currently only on the page
level. The patch here changes that to apply policies to individual allocations
(like SLAB). This comes with a cost of increased complexiy in the allocator.

The allocation does not build alien queues (later patch) and is a bit
ineffective since a slab has to be taken from the partial lists (via lock
and unlock) and possibly shifted back after taking one object out of it.

Memory policies and cpuset redirection is only applied to slabs marked with
SLAB_MEM_SPREAD (also like SLAB).

Use Lee Schermerhorns new *_mem functionality to always find the nearest
node in case we are on a memoryless node.

Note that the handling of queues is significantly different from SLAB.
SLAB has pure queues that only contain objects from the respective nodes
and therefore has to undergo fallback functions if nodes are exhausted.

The approach here has queues that usually contain objects from the
corresponding NUMA nodes. If nodes are exhausted then objects from
foreign nodes may appear in queues as the page allocator falls back
to other nodes. The foreign objects will be freed back to the
correct queues though so that these conditions are temporary.

The caching effect of the queues will degrade in situations when memory
on some nodes is no longer available.

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
include/linux/slub_def.h | 22 ++++
mm/slub.c | 208 ++++++++++++++++++-----------------------------
2 files changed, 100 insertions(+), 130 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 08:26:09.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 08:26:27.000000000 -0500
@@ -1148,10 +1148,7 @@ static inline struct page *alloc_slab_pa

flags |= __GFP_NOTRACK;

- if (node == NUMA_NO_NODE)
- return alloc_pages(flags, order);
- else
- return alloc_pages_exact_node(node, flags, order);
+ return alloc_pages_exact_node(node, flags, order);
}

static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
@@ -1376,14 +1373,15 @@ static inline int lock_and_freeze_slab(s
/*
* Try to allocate a partial slab from a specific node.
*/
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial(struct kmem_cache *s, int node)
{
struct page *page;
+ struct kmem_cache_node *n = get_node(s, node);

/*
* Racy check. If we mistakenly see no partial slabs then we
* just allocate an empty slab. If we mistakenly try to get a
- * partial slab and there is none available then get_partials()
+ * partial slab and there is none available then get_partial()
* will return NULL.
*/
if (!n || !n->nr_partial)
@@ -1400,76 +1398,6 @@ out:
}

/*
- * Get a page from somewhere. Search in increasing NUMA distances.
- */
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
-{
-#ifdef CONFIG_NUMA
- struct zonelist *zonelist;
- struct zoneref *z;
- struct zone *zone;
- enum zone_type high_zoneidx = gfp_zone(flags);
- struct page *page;
-
- /*
- * The defrag ratio allows a configuration of the tradeoffs between
- * inter node defragmentation and node local allocations. A lower
- * defrag_ratio increases the tendency to do local allocations
- * instead of attempting to obtain partial slabs from other nodes.
- *
- * If the defrag_ratio is set to 0 then kmalloc() always
- * returns node local objects. If the ratio is higher then kmalloc()
- * may return off node objects because partial slabs are obtained
- * from other nodes and filled up.
- *
- * If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
- * defrag_ratio = 1000) then every (well almost) allocation will
- * first attempt to defrag slab caches on other nodes. This means
- * scanning over all nodes to look for partial slabs which may be
- * expensive if we do it every time we are trying to find a slab
- * with available objects.
- */
- if (!s->remote_node_defrag_ratio ||
- get_cycles() % 1024 > s->remote_node_defrag_ratio)
- return NULL;
-
- get_mems_allowed();
- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
- struct kmem_cache_node *n;
-
- n = get_node(s, zone_to_nid(zone));
-
- if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
- n->nr_partial > s->min_partial) {
- page = get_partial_node(n);
- if (page) {
- put_mems_allowed();
- return page;
- }
- }
- }
- put_mems_allowed();
-#endif
- return NULL;
-}
-
-/*
- * Get a partial page, lock it and return it.
- */
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
-{
- struct page *page;
- int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
-
- page = get_partial_node(get_node(s, searchnode));
- if (page || node != -1)
- return page;
-
- return get_any_partial(s, flags);
-}
-
-/*
* Move the vector of objects back to the slab pages they came from
*/
void drain_objects(struct kmem_cache *s, void **object, int nr)
@@ -1650,6 +1578,7 @@ struct kmem_cache_cpu *alloc_kmem_cache_
struct kmem_cache_cpu *c = per_cpu_ptr(k, cpu);

c->q.max = max;
+ c->node = cpu_to_mem(cpu);
}

s->cpu_queue = max;
@@ -1710,19 +1639,6 @@ static void resize_cpu_queue(struct kmem
}
#endif

-/*
- * Check if the objects in a per cpu structure fit numa
- * locality expectations.
- */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
-{
-#ifdef CONFIG_NUMA
- if (node != NUMA_NO_NODE && c->node != node)
- return 0;
-#endif
- return 1;
-}
-
static unsigned long count_partial(struct kmem_cache_node *n,
int (*get_count)(struct page *))
{
@@ -1782,6 +1698,30 @@ slab_out_of_memory(struct kmem_cache *s,
}

/*
+ * Determine the final numa node from which the allocation will
+ * be occurring. Allocations can be redirected for slabs marked
+ * with SLAB_MEM_SPREAD by memory policies and cpusets options.
+ */
+static inline int find_numa_node(struct kmem_cache *s,
+ int node, int local_node)
+{
+#ifdef CONFIG_NUMA
+ if (unlikely(s->flags & SLAB_MEM_SPREAD)) {
+ if (node == NUMA_NO_NODE && !in_interrupt()) {
+ if (cpuset_do_slab_mem_spread())
+ return cpuset_mem_spread_node();
+
+ get_mems_allowed();
+ if (current->mempolicy)
+ local_node = slab_node(current->mempolicy);
+ put_mems_allowed();
+ }
+ }
+#endif
+ return local_node;
+}
+
+/*
* Retrieve pointers to nr objects from a slab into the object array.
* Slab must be locked.
*/
@@ -1839,12 +1779,49 @@ void to_lists(struct kmem_cache *s, stru

/* Handling of objects from other nodes */

+static void *slab_alloc_node(struct kmem_cache *s, struct kmem_cache_cpu *c,
+ gfp_t gfpflags, int node)
+{
+#ifdef CONFIG_NUMA
+ struct page *page;
+ void *object;
+
+ page = get_partial(s, node);
+ if (!page) {
+ gfpflags &= gfp_allowed_mask;
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_enable();
+
+ page = new_slab(s, gfpflags, node);
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_disable();
+
+ if (!page)
+ return NULL;
+
+ slab_lock(page);
+ }
+
+ retrieve_objects(s, page, &object, 1);
+ stat(s, ALLOC_DIRECT);
+
+ to_lists(s, page, 0);
+ slab_unlock(page);
+ return object;
+#else
+ return NULL;
+#endif
+}
+
static void slab_free_alien(struct kmem_cache *s,
struct kmem_cache_cpu *c, struct page *page, void *object, int node)
{
#ifdef CONFIG_NUMA
/* Direct free to the slab */
drain_objects(s, &object, 1);
+ stat(s, FREE_DIRECT);
#endif
}

@@ -1864,18 +1841,21 @@ static void *slab_alloc(struct kmem_cach
redo:
local_irq_save(flags);
c = __this_cpu_ptr(s->cpu);
- q = &c->q;
- if (unlikely(queue_empty(q) || !node_match(c, node))) {

- if (unlikely(!node_match(c, node))) {
- flush_cpu_objects(s, c);
- c->node = node;
- }
+ node = find_numa_node(s, node, c->node);
+ if (unlikely(node != c->node)) {
+ object = slab_alloc_node(s, c, gfpflags, node);
+ if (!object)
+ goto oom;
+ goto got_it;
+ }
+ q = &c->q;
+ if (unlikely(queue_empty(q))) {

while (q->objects < s->batch) {
struct page *new;

- new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
+ new = get_partial(s, node);
if (unlikely(!new)) {

gfpflags &= gfp_allowed_mask;
@@ -1914,6 +1894,7 @@ redo:

object = queue_get(q);

+got_it:
if (kmem_cache_debug(s)) {
if (!alloc_debug_processing(s, object, addr))
goto redo;
@@ -1998,7 +1979,6 @@ static void slab_free(struct kmem_cache

if (unlikely(node != c->node)) {
slab_free_alien(s, c, page, x, node);
- stat(s, FREE_ALIEN);
goto out;
}
}
@@ -2462,9 +2442,6 @@ static int kmem_cache_open(struct kmem_c
*/
set_min_partial(s, ilog2(s->size));
s->refcount = 1;
-#ifdef CONFIG_NUMA
- s->remote_node_defrag_ratio = 1000;
-#endif
if (!init_kmem_cache_nodes(s))
goto error;

@@ -4362,30 +4339,6 @@ static ssize_t shrink_store(struct kmem_
}
SLAB_ATTR(shrink);

-#ifdef CONFIG_NUMA
-static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
-{
- return sprintf(buf, "%d\n", s->remote_node_defrag_ratio / 10);
-}
-
-static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
- const char *buf, size_t length)
-{
- unsigned long ratio;
- int err;
-
- err = strict_strtoul(buf, 10, &ratio);
- if (err)
- return err;
-
- if (ratio <= 100)
- s->remote_node_defrag_ratio = ratio * 10;
-
- return length;
-}
-SLAB_ATTR(remote_node_defrag_ratio);
-#endif
-
#ifdef CONFIG_SLUB_STATS
static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
{
@@ -4444,8 +4397,10 @@ static ssize_t text##_store(struct kmem_
SLAB_ATTR(text); \

STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
+STAT_ATTR(ALLOC_DIRECT, alloc_direct);
STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
STAT_ATTR(FREE_FASTPATH, free_fastpath);
+STAT_ATTR(FREE_DIRECT, free_direct);
STAT_ATTR(FREE_SLOWPATH, free_slowpath);
STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
@@ -4490,13 +4445,12 @@ static struct attribute *slab_attrs[] =
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
-#ifdef CONFIG_NUMA
- &remote_node_defrag_ratio_attr.attr,
-#endif
#ifdef CONFIG_SLUB_STATS
&alloc_fastpath_attr.attr,
+ &alloc_direct_attr.attr,
&alloc_slowpath_attr.attr,
&free_fastpath_attr.attr,
+ &free_direct_attr.attr,
&free_slowpath_attr.attr,
&free_add_partial_attr.attr,
&free_remove_partial_attr.attr,
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-04 08:26:02.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-04 08:26:27.000000000 -0500
@@ -17,20 +17,36 @@

enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu queue */
+ ALLOC_DIRECT, /* Allocation bypassing queueing */
ALLOC_SLOWPATH, /* Allocation required refilling of queue */
FREE_FASTPATH, /* Free to cpu queue */
+ FREE_DIRECT, /* Free bypassing queues */
FREE_SLOWPATH, /* Required pushing objects out of the queue */
FREE_ADD_PARTIAL, /* Freeing moved slab to partial list */
FREE_REMOVE_PARTIAL, /* Freeing removed from partial list */
ALLOC_FROM_PARTIAL, /* slab with objects acquired from partial */
ALLOC_SLAB, /* New slab acquired from page allocator */
- FREE_ALIEN, /* Free to alien node */
FREE_SLAB, /* Slab freed to the page allocator */
QUEUE_FLUSH, /* Flushing of the per cpu queue */
ORDER_FALLBACK, /* Number of times fallback was necessary */
NR_SLUB_STAT_ITEMS };

-/* Queueing structure used for per cpu, l3 cache and alien queueing */
+/*
+ * Queueing structure used for per cpu, l3 cache and alien queueing.
+ *
+ * Queues contain objects from a particular node.
+ * Per cpu and shared queues from kmem_cache_cpu->node
+ * alien caches from other nodes.
+ *
+ * However, this is not strictly enforced if the page allocator redirects
+ * allocation to other nodes because f.e. there is no memory on the node.
+ * Foreign objects will then be on the queue until memory becomes available
+ * again on the node. Freeing objects always occurs to the correct node.
+ *
+ * Which means that queueing is no longer effective since
+ * objects are freed to the alien caches after having been dequeued from
+ * the per cpu queue.
+ */
struct kmem_cache_queue {
int objects; /* Available objects */
int max; /* Queue capacity */
@@ -41,7 +57,7 @@ struct kmem_cache_cpu {
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
- int node; /* objects only from this numa node */
+ int node; /* The memory node local to the cpu */
struct kmem_cache_queue q;
};

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:35 UTC

Alien caches are essential to track cachelines from a foreign node that are
present in a local cpu cache. They are therefore a form of the prior
introduced shared cache. Alien caches of the number of nodes minus one are
allocated for *each* lowest level shared cpu cache.

SLABs problem in this area is that the cpu caches are not properly tracked.
If there are multiple cpu caches on the same node then SLAB may not
properly track cache hotness of objects.

Alien caches are sizes differently than shared caches but are allocated
in the same contiguous memory area. The shared cache pointer is used
to reach the alien caches too. At positive offsets we fine shared cache
objects. At negative objects the alien caches are placed.

Alien caches can be switched off and configured on a cache by cache
basis using files in /sys/kernel/slab/<cache>/alien_queue_size.

Alien status is available in /sys/kernel/slab/<cache>/alien_caches.

/sys/kernel/slab/TCP$ cat alien_caches
9 C0,4,8,12,16,20,24,28=9[N1=3/30:N2=1/30:N3=5/30]
C1,5,9,13,17,21,25,29=5[N0=1/30:N2=3/30:N3=1/30]
C2,6,10,14,18,22,26,30=2[N0=0/30:N1=1/30:N3=1/30]
C3,7,11,15,19,23,27,31=5[N0=2/30:N1=1/30:N2=2/30]

Alien caches are displayed for a 4 node machine for each of the l3 caching
domains. For each domain we have the foreign nodes listed with the number
of objects queued for each node within the l3 caching domain.

Signed-off-by: Christoph Lameter <***@linux.com>

---
include/linux/slub_def.h | 6
mm/slub.c | 403 ++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 387 insertions(+), 22 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:36:14.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:39:26.000000000 -0500
@@ -38,6 +38,9 @@
* The slub_lock semaphore protects against configuration modifications like
* adding new queues, reconfiguring queues and removing queues.
*
+ * Nesting:
+ * The per node lock nests inside of the alien lock.
+ *
* Interrupts are disabled during allocation and deallocation in order to
* make the slab allocator safe to use in the context of an irq. In addition
* interrupts are disabled to ensure that the processor does not change
@@ -118,6 +121,16 @@ static inline int kmem_cache_debug(struc

/* Internal SLUB flags */
#define __OBJECT_POISON 0x80000000UL /* Poison object */
+#define __ALIEN_CACHE 0x20000000UL /* Slab has alien caches */
+
+static inline int aliens(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ return (s->flags & __ALIEN_CACHE) != 0;
+#else
+ return 0;
+#endif
+}

static int kmem_size = sizeof(struct kmem_cache);

@@ -1455,6 +1468,9 @@ static inline int drain_shared_cache(str
return n;
}

+static void drain_alien_caches(struct kmem_cache *s,
+ struct kmem_cache_cpu *c);
+
/*
* Drain all objects from a per cpu queue
*/
@@ -1464,6 +1480,7 @@ static void flush_cpu_objects(struct kme

drain_queue(s, q, q->objects);
drain_shared_cache(s, q->shared);
+ drain_alien_caches(s, c);
stat(s, QUEUE_FLUSH);
}

@@ -1539,6 +1556,13 @@ static inline void init_shared_cache(str
q->objects =0;
}

+static inline void init_alien_cache(struct kmem_cache_queue *q, int max)
+{
+ spin_lock_init(&q->alien_lock);
+ q->max = max;
+ q->objects =0;
+}
+

/* Determine a list of the active shared caches */
struct kmem_cache_queue **shared_caches(struct kmem_cache *s)
@@ -1580,6 +1604,50 @@ struct kmem_cache_queue **shared_caches(
return caches;
}

+/*
+ * Alien caches which are also shared caches
+ */
+
+#ifdef CONFIG_NUMA
+/* Given an allocation context determine the alien queue to use */
+static inline struct kmem_cache_queue *alien_cache(struct kmem_cache *s,
+ struct kmem_cache_cpu *c, int node)
+{
+ void *p = c->q.shared;
+
+ /* If the cache does not have any alien caches return NULL */
+ if (!aliens(s) || !p || node == c->node)
+ return NULL;
+
+ /*
+ * Map [0..(c->node - 1)] -> [1..c->node].
+ *
+ * This effectively removes the current node (which is serviced by
+ * the shared cachei) from the list and avoids hitting 0 (which would
+ * result in accessing the shared queue used for the cpu cache).
+ */
+ if (node < c->node)
+ node++;
+
+ p -= (node << s->alien_shift);
+
+ return (struct kmem_cache_queue *)p;
+}
+
+static inline void drain_alien_caches(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY)
+ drain_shared_cache(s, alien_cache(s, c, node));
+}
+
+#else
+static inline void drain_alien_caches(struct kmem_cache *s,
+ struct kmem_cache_cpu *c) {}
+#endif
+
static struct kmem_cache *get_slab(size_t size, gfp_t flags);

/* Map of cpus that have no siblings or where we have broken topolocy info */
@@ -1593,6 +1661,13 @@ struct kmem_cache_queue *alloc_shared_ca
int size;
void *p;
int cpu;
+ int alien_max = 0;
+ int alien_size = 0;
+
+ if (aliens(s)) {
+ alien_size = (nr_node_ids - 1) << s->alien_shift;
+ alien_max = shared_cache_capacity(1 << s->alien_shift);
+ }

/*
* Determine the size. Round it up to the size that a kmalloc cache
@@ -1600,20 +1675,34 @@ struct kmem_cache_queue *alloc_shared_ca
* power of 2 especially on machines that have large kmalloc
* alignment requirements.
*/
- size = shared_cache_size(s->shared_queue_sysfs);
- if (size < PAGE_SIZE / 2)
+ size = shared_cache_size(s->shared_queue_sysfs) + alien_size;
+ if (size <= PAGE_SIZE / 2)
size = get_slab(size, GFP_KERNEL)->objsize;
else
size = PAGE_SIZE << get_order(size);

- max = shared_cache_capacity(size);
+ max = shared_cache_capacity(size - alien_size);

/* Allocate shared cache */
p = kmalloc_node(size, GFP_KERNEL | __GFP_ZERO, node);
if (!p)
return NULL;
- l = p;
+
+ l = p + alien_size;
init_shared_cache(l, max);
+#ifdef CONFIG_NUMA
+ /* And initialize the alien caches now */
+ if (aliens(s)) {
+ int node;
+
+ for (node = 0; node < nr_node_ids - 1; node++) {
+ struct kmem_cache_queue *a =
+ p + (node << s->alien_shift);
+
+ init_alien_cache(a, alien_max);
+ }
+ }
+#endif

/* Link all cpus in this group to the shared cache */
for_each_cpu(cpu, map)
@@ -1675,6 +1764,7 @@ static void __remove_shared_cache(void *

c->q.shared = NULL;
drain_shared_cache(s, q);
+ drain_alien_caches(s, c);
}

static int remove_shared_caches(struct kmem_cache *s)
@@ -1694,6 +1784,9 @@ static int remove_shared_caches(struct k
for(i = 0; i < s->nr_shared; i++) {
void *p = caches[i];

+ if (aliens(s))
+ p -= (nr_node_ids - 1) << s->alien_shift;
+
kfree(p);
}

@@ -1897,14 +1990,35 @@ static void *slab_alloc_node(struct kmem
gfp_t gfpflags, int node)
{
#ifdef CONFIG_NUMA
+ struct kmem_cache_queue *a = alien_cache(s, c, node);
struct page *page;
void *object;
struct kmem_cache_node *n = get_node(s, node);

+ if (a) {
+redo:
+ spin_lock(&a->lock);
+ if (likely(!queue_empty(a))) {
+ object = queue_get(a);
+ spin_unlock(&a->lock);
+ stat(s, ALLOC_ALIEN);
+ return object;
+ }
+ }
+
spin_lock(&n->lock);
- if (list_empty(&n->partial)) {
+ if (!list_empty(&n->partial)) {
+
+ page = list_entry(n->partial.prev, struct page, lru);
+ stat(s, ALLOC_FROM_PARTIAL);
+
+ } else {

spin_unlock(&n->lock);
+
+ if (a)
+ spin_unlock(&a->lock);
+
gfpflags &= gfp_allowed_mask;

if (gfpflags & __GFP_WAIT)
@@ -1918,13 +2032,26 @@ static void *slab_alloc_node(struct kmem
if (!page)
return NULL;

+ if (a)
+ spin_lock(&a->lock);
+
+ /* Node and alien cache may have changed ! */
+ node = page_to_nid(page);
+ n = get_node(s, node);
+
spin_lock(&n->lock);
+ stat(s, ALLOC_SLAB);
+ }

- } else
- page = list_entry(n->partial.prev, struct page, lru);
+ if (a) {

- retrieve_objects(s, page, &object, 1);
- stat(s, ALLOC_DIRECT);
+ refill_queue(s, a, page, available(page));
+ spin_unlock(&a->lock);
+
+ } else {
+ retrieve_objects(s, page, &object, 1);
+ stat(s, ALLOC_DIRECT);
+ }

if (!all_objects_used(page)) {

@@ -1935,6 +2062,10 @@ static void *slab_alloc_node(struct kmem
partial_to_full(s, n, page);

spin_unlock(&n->lock);
+
+ if (a)
+ goto redo;
+
return object;
#else
return NULL;
@@ -1945,9 +2076,29 @@ static void slab_free_alien(struct kmem_
struct kmem_cache_cpu *c, struct page *page, void *object, int node)
{
#ifdef CONFIG_NUMA
- /* Direct free to the slab */
- drain_objects(s, &object, 1);
- stat(s, FREE_DIRECT);
+ struct kmem_cache_queue *a = alien_cache(s, c, node);
+
+ if (a) {
+ int slow = 0;
+
+ spin_lock(&a->lock);
+ while (unlikely(queue_full(a))) {
+ drain_queue(s, a, s->batch);
+ slow = 1;
+ }
+ queue_put(a, object);
+ spin_unlock(&a->lock);
+
+ if (slow)
+ stat(s, FREE_SLOWPATH);
+ else
+ stat(s, FREE_ALIEN);
+
+ } else {
+ /* Direct free to the slab */
+ drain_objects(s, &object, 1);
+ stat(s, FREE_DIRECT);
+ }
#endif
}

@@ -2038,12 +2189,13 @@ got_object:
if (all_objects_used(page))
partial_to_full(s, n, page);

- stat(s, ALLOC_FROM_PARTIAL);
}
spin_unlock(&n->lock);

- if (!queue_empty(q))
+ if (!queue_empty(q)) {
+ stat(s, ALLOC_FROM_PARTIAL);
goto get_object;
+ }

gfpflags &= gfp_allowed_mask;
/* Refill from free pages */
@@ -2294,7 +2446,6 @@ static inline int slab_order(int size, i
continue;

rem = slab_size % size;
-
if (rem <= slab_size / fract_leftover)
break;

@@ -2659,9 +2810,52 @@ static int kmem_cache_open(struct kmem_c
s->queue = initial_queue_size(s->size);
s->batch = (s->queue + 1) / 2;

+#ifdef CONFIG_NUMA
+ if (nr_node_ids > 1) {
+ /*
+ * Alien cache configuration. The more NUMA nodes we have the
+ * smaller the alien caches become since the penalties in terms
+ * of space and latency increase. The user will have code for
+ * locality on these boxes anyways since a large portion of
+ * memory will be distant to the processor.
+ *
+ * A set of alien caches is allocated for each lowest level
+ * cpu cache. The alien set covers all nodes except the node
+ * that is nearest to the processor.
+ *
+ * Create large alien cache for small node configuration so
+ * that these can work like shared caches do to preserve the
+ * cpu cache hot state of objects.
+ */
+ int lines = fls(ALIGN(shared_cache_size(s->queue),
+ cache_line_size()) -1);
+ int min = fls(cache_line_size() - 1);
+
+ /* Limit the sizes of the alien caches to some sane values */
+ if (nr_node_ids <= 4)
+ /*
+ * Keep the sizes roughly the same as the shared cache
+ * unless it gets too huge.
+ */
+ s->alien_shift = min(PAGE_SHIFT - 1, lines);
+
+ else if (nr_node_ids <= 32)
+ /* Maximum of 4 cachelines */
+ s->alien_shift = min(2 + min, lines);
+ else
+ /* Clamp down to one cacheline */
+ s->alien_shift = min;
+
+ s->flags |= __ALIEN_CACHE;
+ }
+#endif
+
if (alloc_kmem_cache_cpus(s)) {
- s->shared_queue_sysfs = s->queue;
- alloc_shared_caches(s);
+ s->shared_queue_sysfs = 0;
+ if (nr_cpu_ids > 1 && s->size < PAGE_SIZE) {
+ s->shared_queue_sysfs = 10 * s->batch;
+ alloc_shared_caches(s);
+ }
return 1;
}

@@ -4295,14 +4489,12 @@ static ssize_t shared_queue_size_store(s
if (err)
return err;

- if (queue > 10000 || queue < 4)
+ if (queue && (queue > 10000 || queue < 4 || queue < s->batch))
return -EINVAL;

down_write(&slub_lock);
err = remove_shared_caches(s);
if (!err) {
- if (s->batch > queue)
- s->batch = queue;

s->shared_queue_sysfs = queue;
if (queue)
@@ -4431,6 +4623,166 @@ static ssize_t objects_partial_show(stru
}
SLAB_ATTR_RO(objects_partial);

+#ifdef CONFIG_NUMA
+static ssize_t alien_queue_size_show(struct kmem_cache *s, char *buf)
+{
+ if (aliens(s))
+ return sprintf(buf, "%tu %u\n",
+ ((1 << s->alien_shift)
+ - sizeof(struct kmem_cache_queue)) /
+ sizeof(void *), s->alien_shift);
+ else
+ return sprintf(buf, "0\n");
+}
+
+static ssize_t alien_queue_size_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ unsigned long queue;
+ int err;
+ int oldshift;
+
+ if (nr_node_ids == 1)
+ return -ENOSYS;
+
+ err = strict_strtoul(buf, 10, &queue);
+ if (err)
+ return err;
+
+ if (queue < 0 || queue > 1000)
+ return -EINVAL;
+
+ down_write(&slub_lock);
+ oldshift = s->alien_shift;
+
+ err = remove_shared_caches(s);
+ if (!err) {
+ if (queue == 0) {
+ s->flags &= ~__ALIEN_CACHE;
+ s->alien_shift = 0;
+ } else {
+ unsigned long size;
+
+ s->flags |= __ALIEN_CACHE;
+
+ size = max_t(unsigned long, cache_line_size(),
+ sizeof(struct kmem_cache_queue)
+ + queue * sizeof(void *));
+ size = ALIGN(size, cache_line_size());
+ s->alien_shift = fls(size + (size -1)) - 1;
+ }
+
+ if (oldshift != s->alien_shift)
+ alloc_shared_caches(s);
+ }
+
+ up_write(&slub_lock);
+ return err ? err : length;
+}
+SLAB_ATTR(alien_queue_size);
+
+static ssize_t alien_caches_show(struct kmem_cache *s, char *buf)
+{
+ unsigned long total;
+ int x;
+ int n;
+ int i;
+ int cpu, node;
+ struct kmem_cache_queue **caches;
+
+ if (!(s->flags & __ALIEN_CACHE) || s->alien_shift == 0)
+ return -ENOSYS;
+
+ down_read(&slub_lock);
+ caches = shared_caches(s);
+ if (!caches) {
+ up_read(&slub_lock);
+ return -ENOENT;
+ }
+
+ if (IS_ERR(caches)) {
+ up_read(&slub_lock);
+ return PTR_ERR(caches);
+ }
+
+ total = 0;
+ for (i = 0; i < s->nr_shared; i++) {
+ struct kmem_cache_queue *q = caches[i];
+
+ for (n = 1; n < nr_node_ids; n++) {
+ struct kmem_cache_queue *a =
+ (void *)q - (n << s->alien_shift);
+
+ total += a->objects;
+ }
+ }
+ x = sprintf(buf, "%lu", total);
+
+ for (n = 0; n < s->nr_shared; n++) {
+ struct kmem_cache_queue *q = caches[n];
+ struct kmem_cache_queue *a;
+ struct kmem_cache_cpu *c = NULL;
+ int first;
+
+ x += sprintf(buf + x, " C");
+ first = 1;
+ /* Find cpus using the shared cache */
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *z = per_cpu_ptr(s->cpu, cpu);
+
+ if (q != z->q.shared)
+ continue;
+
+ if (z)
+ c = z;
+
+ if (first)
+ first = 0;
+ else
+ x += sprintf(buf + x, ",");
+
+ x += sprintf(buf + x, "%d", cpu);
+ }
+
+ if (!c) {
+ x += sprintf(buf +x, "=<none>");
+ continue;
+ }
+
+ /* The total of objects for a particular shared cache */
+ total = 0;
+ for_each_online_node(node) {
+ struct kmem_cache_queue *a =
+ alien_cache(s, c, node);
+
+ if (a)
+ total += a->objects;
+ }
+ x += sprintf(buf +x, "=%lu[", total);
+
+ first = 1;
+ for_each_online_node(node) {
+ a = alien_cache(s, c, node);
+
+ if (a) {
+ if (first)
+ first = 0;
+ else
+ x += sprintf(buf + x, ":");
+
+ x += sprintf(buf + x, "N%d=%d/%d",
+ node, a->objects, a->max);
+ }
+ }
+ x += sprintf(buf + x, "]");
+ }
+ up_read(&slub_lock);
+ kfree(caches);
+ return x + sprintf(buf + x, "\n");
+}
+SLAB_ATTR_RO(alien_caches);
+#endif
+
static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
{
return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
@@ -4697,10 +5049,12 @@ SLAB_ATTR(text); \

STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
STAT_ATTR(ALLOC_SHARED, alloc_shared);
+STAT_ATTR(ALLOC_ALIEN, alloc_alien);
STAT_ATTR(ALLOC_DIRECT, alloc_direct);
STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
STAT_ATTR(FREE_FASTPATH, free_fastpath);
STAT_ATTR(FREE_SHARED, free_shared);
+STAT_ATTR(FREE_ALIEN, free_alien);
STAT_ATTR(FREE_DIRECT, free_direct);
STAT_ATTR(FREE_SLOWPATH, free_slowpath);
STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
@@ -4749,13 +5103,19 @@ static struct attribute *slab_attrs[] =
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
+#ifdef CONFIG_NUMA
+ &alien_caches_attr.attr,
+ &alien_queue_size_attr.attr,
+#endif
#ifdef CONFIG_SLUB_STATS
&alloc_fastpath_attr.attr,
&alloc_shared_attr.attr,
+ &alloc_alien_attr.attr,
&alloc_direct_attr.attr,
&alloc_slowpath_attr.attr,
&free_fastpath_attr.attr,
&free_shared_attr.attr,
+ &free_alien_attr.attr,
&free_direct_attr.attr,
&free_slowpath_attr.attr,
&free_add_partial_attr.attr,
@@ -5108,7 +5468,8 @@ static int s_show(struct seq_file *m, vo
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
nr_objs, s->size, oo_objects(s->oo),
(1 << oo_order(s->oo)));
- seq_printf(m, " : tunables %4u %4u %4u", s->cpu_queue, s->batch, s->shared_queue);
+ seq_printf(m, " : tunables %4u %4u %4u", s->cpu_queue, s->batch,
+ (s->shared_queue + s->batch / 2 ) / s->batch);

seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
shared_objects(s));
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-05 13:36:14.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-05 13:36:33.000000000 -0500
@@ -18,11 +18,13 @@
enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu queue */
ALLOC_SHARED, /* Allocation caused a shared cache transaction */
+ ALLOC_ALIEN, /* Allocation from alien cache */
ALLOC_DIRECT, /* Allocation bypassing queueing */
ALLOC_SLOWPATH, /* Allocation required refilling of queue */
FREE_FASTPATH, /* Free to cpu queue */
FREE_SHARED, /* Free caused a shared cache transaction */
FREE_DIRECT, /* Free bypassing queues */
+ FREE_ALIEN, /* Free to alien node */
FREE_SLOWPATH, /* Required pushing objects out of the queue */
FREE_ADD_PARTIAL, /* Freeing moved slab to partial list */
FREE_REMOVE_PARTIAL, /* Freeing removed from partial list */
@@ -55,6 +57,7 @@ struct kmem_cache_queue {
union {
struct kmem_cache_queue *shared; /* cpu q -> shared q */
spinlock_t lock; /* shared queue: lock */
+ spinlock_t alien_lock; /* alien cache lock */
};
void *object[];
};
@@ -97,7 +100,8 @@ struct kmem_cache {
int size; /* The size of an object including meta data */
int objsize; /* The size of an object without meta data */
struct kmem_cache_order_objects oo;
- int batch;
+ int batch; /* batch size */
+ int alien_shift; /* Shift to size alien caches */

/* Allocation and freeing of slabs */
struct kmem_cache_order_objects max;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:26 UTC

Currently disabling CONFIG_SLUB_DEBUG also disabled SYSFS support meaning
that the slabs cannot be tuned without DEBUG.

Make SYSFS support independent of CONFIG_SLUB_DEBUG

Signed-off-by: Christoph Lameter <***@linux.com>

---
include/linux/slub_def.h | 2 +-
lib/Kconfig.debug | 2 +-
mm/slub.c | 40 +++++++++++++++++++++++++++++++++++-----
3 files changed, 37 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 08:16:36.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 08:17:49.000000000 -0500
@@ -198,7 +198,7 @@ struct track {

enum track_item { TRACK_ALLOC, TRACK_FREE };

-#ifdef CONFIG_SLUB_DEBUG
+#ifdef CONFIG_SYSFS
static int sysfs_slab_add(struct kmem_cache *);
static int sysfs_slab_alias(struct kmem_cache *, const char *);
static void sysfs_slab_remove(struct kmem_cache *);
@@ -1102,7 +1102,7 @@ static inline void slab_free_hook(struct
static inline void slab_free_hook_irq(struct kmem_cache *s,
void *object) {}

-#endif
+#endif /* CONFIG_SLUB_DEBUG */

/*
* Slab allocation and freeing
@@ -3373,7 +3373,7 @@ void *__kmalloc_node_track_caller(size_t
}
#endif

-#ifdef CONFIG_SLUB_DEBUG
+#ifdef CONFIG_SYSFS
static int count_inuse(struct page *page)
{
return page->inuse;
@@ -3383,7 +3383,9 @@ static int count_total(struct page *page
{
return page->objects;
}
+#endif

+#ifdef CONFIG_SLUB_DEBUG
static int validate_slab(struct kmem_cache *s, struct page *page,
unsigned long *map)
{
@@ -3474,6 +3476,7 @@ static long validate_slab_cache(struct k
kfree(map);
return count;
}
+#endif

#ifdef SLUB_RESILIENCY_TEST
static void resiliency_test(void)
@@ -3532,9 +3535,12 @@ static void resiliency_test(void)
validate_slab_cache(kmalloc_caches[9]);
}
#else
+#ifdef CONFIG_SYSFS
static void resiliency_test(void) {};
#endif
+#endif

+#ifdef CONFIG_DEBUG
/*
* Generate lists of code addresses where slabcache objects are allocated
* and freed.
@@ -3763,7 +3769,9 @@ static int list_locations(struct kmem_ca
len += sprintf(buf, "No data\n");
return len;
}
+#endif

+#ifdef CONFIG_SYSFS
enum slab_stat_type {
SL_ALL, /* All slabs */
SL_PARTIAL, /* Only partially allocated slabs */
@@ -3816,6 +3824,8 @@ static ssize_t show_slab_objects(struct
}
}

+ down_read(&slub_lock);
+#ifdef CONFIG_SLUB_DEBUG
if (flags & SO_ALL) {
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);
@@ -3832,7 +3842,9 @@ static ssize_t show_slab_objects(struct
nodes[node] += x;
}

- } else if (flags & SO_PARTIAL) {
+ } else
+#endif
+ if (flags & SO_PARTIAL) {
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);

@@ -3857,6 +3869,7 @@ static ssize_t show_slab_objects(struct
return x + sprintf(buf + x, "\n");
}

+#ifdef CONFIG_SLUB_DEBUG
static int any_slab_objects(struct kmem_cache *s)
{
int node;
@@ -3872,6 +3885,7 @@ static int any_slab_objects(struct kmem_
}
return 0;
}
+#endif

#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
#define to_slab(n) container_of(n, struct kmem_cache, kobj);
@@ -3973,11 +3987,13 @@ static ssize_t aliases_show(struct kmem_
}
SLAB_ATTR_RO(aliases);

+#ifdef CONFIG_SLUB_DEBUG
static ssize_t slabs_show(struct kmem_cache *s, char *buf)
{
return show_slab_objects(s, buf, SO_ALL);
}
SLAB_ATTR_RO(slabs);
+#endif

static ssize_t partial_show(struct kmem_cache *s, char *buf)
{
@@ -4003,6 +4019,7 @@ static ssize_t objects_partial_show(stru
}
SLAB_ATTR_RO(objects_partial);

+#ifdef CONFIG_SLUB_DEBUG
static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
{
return show_slab_objects(s, buf, SO_ALL|SO_TOTAL);
@@ -4055,6 +4072,7 @@ static ssize_t failslab_store(struct kme
}
SLAB_ATTR(failslab);
#endif
+#endif

static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
{
@@ -4091,6 +4109,7 @@ static ssize_t destroy_by_rcu_show(struc
}
SLAB_ATTR_RO(destroy_by_rcu);

+#ifdef CONFIG_SLUB_DEBUG
static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
{
return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
@@ -4166,6 +4185,7 @@ static ssize_t validate_store(struct kme
return ret;
}
SLAB_ATTR(validate);
+#endif

static ssize_t shrink_show(struct kmem_cache *s, char *buf)
{
@@ -4186,6 +4206,7 @@ static ssize_t shrink_store(struct kmem_
}
SLAB_ATTR(shrink);

+#ifdef CONFIG_SLUB_DEBUG
static ssize_t alloc_calls_show(struct kmem_cache *s, char *buf)
{
if (!(s->flags & SLAB_STORE_USER))
@@ -4201,6 +4222,7 @@ static ssize_t free_calls_show(struct km
return list_locations(s, buf, TRACK_FREE);
}
SLAB_ATTR_RO(free_calls);
+#endif

#ifdef CONFIG_NUMA
static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
@@ -4307,25 +4329,33 @@ static struct attribute *slab_attrs[] =
&min_partial_attr.attr,
&objects_attr.attr,
&objects_partial_attr.attr,
+#ifdef CONFIG_SLUB_DEBUG
&total_objects_attr.attr,
&slabs_attr.attr,
+#endif
&partial_attr.attr,
&cpu_slabs_attr.attr,
&ctor_attr.attr,
&aliases_attr.attr,
&align_attr.attr,
+#ifdef CONFIG_SLUB_DEBUG
&sanity_checks_attr.attr,
&trace_attr.attr,
+#endif
&hwcache_align_attr.attr,
&reclaim_account_attr.attr,
&destroy_by_rcu_attr.attr,
+#ifdef CONFIG_SLUB_DEBUG
&red_zone_attr.attr,
&poison_attr.attr,
&store_user_attr.attr,
&validate_attr.attr,
+#endif
&shrink_attr.attr,
+#ifdef CONFIG_SLUB_DEBUG
&alloc_calls_attr.attr,
&free_calls_attr.attr,
+#endif
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
@@ -4608,7 +4638,7 @@ static int __init slab_sysfs_init(void)
}

__initcall(slab_sysfs_init);
-#endif
+#endif /* CONFIG_SYSFS */

/*
* The /proc/slabinfo ABI
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug 2010-10-04 08:14:26.000000000 -0500
+++ linux-2.6/lib/Kconfig.debug 2010-10-04 08:17:49.000000000 -0500
@@ -353,7 +353,7 @@ config SLUB_DEBUG_ON
config SLUB_STATS
default n
bool "Enable SLUB performance statistics"
- depends on SLUB && SLUB_DEBUG && SYSFS
+ depends on SLUB && SYSFS
help
SLUB statistics are useful to debug SLUBs allocation behavior in
order find ways to optimize the allocator. This should never be
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-04 08:16:36.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-04 08:17:49.000000000 -0500
@@ -87,7 +87,7 @@ struct kmem_cache {
unsigned long min_partial;
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */
-#ifdef CONFIG_SLUB_DEBUG
+#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Pekka Enberg

2010-10-06 14:02:05 UTC

Post by Christoph Lameter
Currently disabling CONFIG_SLUB_DEBUG also disabled SYSFS support meaning
that the slabs cannot be tuned without DEBUG.
Make SYSFS support independent of CONFIG_SLUB_DEBUG

I applied this patch. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:33 UTC

With queueing slub will have to free and allocate lots of objects in one go.
Taking the page lock for each free or taking the per node lock for each page
can cause a lot of atomic operations. Change locking conventions so that
page strut metadata is stable under the node lock only. Then the page lock
can be dropped.

Signed-off-by: Christoph Lameter <***@linux.com>

---
include/linux/slub_def.h | 2
mm/slub.c | 377 ++++++++++++++++++++---------------------------
2 files changed, 167 insertions(+), 212 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-04 08:26:27.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-04 08:47:21.000000000 -0500
@@ -29,41 +29,14 @@
#include <linux/fault-inject.h>

/*
- * Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * Locking:
+ * All slab metadata (aside from queues and percpu data) is protected
+ * by a per node lock in struct kmem_cache_node.
+ * Shared and alien caches have a lock protecting their queue alone
+ * Per cpu queues are protected by only allowing access from a single cpu.
*
- * The slab_lock protects operations on the object of a particular
- * slab and its metadata in the page struct. If the slab lock
- * has been taken then no allocations nor frees can be performed
- * on the objects in the slab nor can the slab be added or removed
- * from the partial or full lists since this would mean modifying
- * the page_struct of the slab.
- *
- * The list_lock protects the partial and full list on each node and
- * the partial slab counter. If taken then no new slabs may be added or
- * removed from the lists nor make the number of partial slabs be modified.
- * (Note that the total number of slabs is an atomic value that may be
- * modified without taking the list lock).
- *
- * The list_lock is a centralized lock and thus we avoid taking it as
- * much as possible. As long as SLUB does not have to handle partial
- * slabs, operations can continue without any centralized lock. F.e.
- * allocating a long series of objects that fill up slabs does not require
- * the list lock.
- *
- * The lock order is sometimes inverted when we are trying to get a slab
- * off a list. We take the list_lock and then look for a page on the list
- * to use. While we do that objects in the slabs may be freed. We can
- * only operate on the slab if we have also taken the slab_lock. So we use
- * a slab_trylock() on the slab. If trylock was successful then no frees
- * can occur anymore and we can use the slab for allocations etc. If the
- * slab_trylock() does not succeed then frees are in progress in the slab and
- * we must stay away from it for a while since we may cause a bouncing
- * cacheline if we try to acquire the lock. So go onto the next slab.
- * If all pages are busy then we may allocate a new slab instead of reusing
- * a partial slab. A new slab has noone operating on it and thus there is
- * no danger of cacheline contention.
+ * The slub_lock semaphore protects against configuration modifications like
+ * adding new queues, reconfiguring queues and removing queues.
*
* Interrupts are disabled during allocation and deallocation in order to
* make the slab allocator safe to use in the context of an irq. In addition
@@ -82,7 +55,6 @@
* Slabs are freed when they become empty. Teardown and setup is
* minimal so we rely on the page allocators per cpu caches for
* fast frees and allocs.
- *
*/

#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
@@ -845,16 +817,9 @@ static inline void slab_free_hook_irq(st
/*
* Tracking of fully allocated slabs for debugging purposes.
*/
-static inline void add_full(struct kmem_cache *s,
- struct kmem_cache_node *n, struct page *page)
+static inline void add_full(struct kmem_cache_node *n, struct page *page)
{
-
- if (!(s->flags & SLAB_STORE_USER))
- return;
-
- spin_lock(&n->list_lock);
list_add(&page->lru, &n->full);
- spin_unlock(&n->list_lock);
}

static inline void remove_full(struct kmem_cache *s,
@@ -863,9 +828,7 @@ static inline void remove_full(struct km
if (!(s->flags & SLAB_STORE_USER))
return;

- spin_lock(&n->list_lock);
list_del(&page->lru);
- spin_unlock(&n->list_lock);
}

/* Tracking of the number of slabs for debugging purposes */
@@ -1102,7 +1065,7 @@ static inline int slab_pad_check(struct
{ return 1; }
static inline int check_object(struct kmem_cache *s, struct page *page,
void *object, u8 val) { return 1; }
-static inline void add_full(struct kmem_cache *s, struct kmem_cache_node *n,
+static inline void add_full(struct kmem_cache_node *n,
struct page *page) {}
static inline void remove_full(struct kmem_cache *s,
struct kmem_cache_node *n, struct page *page) {}
@@ -1304,97 +1267,37 @@ static void discard_slab(struct kmem_cac
}

/*
- * Per slab locking using the pagelock
- */
-static __always_inline void slab_lock(struct page *page)
-{
- bit_spin_lock(PG_locked, &page->flags);
-}
-
-static __always_inline void slab_unlock(struct page *page)
-{
- __bit_spin_unlock(PG_locked, &page->flags);
-}
-
-static __always_inline int slab_trylock(struct page *page)
-{
- int rc = 1;
-
- rc = bit_spin_trylock(PG_locked, &page->flags);
- return rc;
-}
-
-/*
* Management of partially allocated slabs
*/
static void add_partial(struct kmem_cache_node *n,
struct page *page, int tail)
{
- spin_lock(&n->list_lock);
n->nr_partial++;
if (tail)
list_add_tail(&page->lru, &n->partial);
else
list_add(&page->lru, &n->partial);
__SetPageSlubPartial(page);
- spin_unlock(&n->list_lock);
}

-static inline void __remove_partial(struct kmem_cache_node *n,
+static inline void remove_partial(struct kmem_cache_node *n,
struct page *page)
{
- list_del(&page->lru);
n->nr_partial--;
+ list_del(&page->lru);
__ClearPageSlubPartial(page);
}

-static void remove_partial(struct kmem_cache_node *n, struct page *page)
-{
- spin_lock(&n->list_lock);
- __remove_partial(n, page);
- spin_unlock(&n->list_lock);
-}
-
-/*
- * Lock slab and remove from the partial list.
- *
- * Must hold list_lock.
- */
-static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
- struct page *page)
-{
- if (slab_trylock(page)) {
- __remove_partial(n, page);
- return 1;
- }
- return 0;
-}
-
-/*
- * Try to allocate a partial slab from a specific node.
- */
-static struct page *get_partial(struct kmem_cache *s, int node)
+static inline void partial_to_full(struct kmem_cache *s,
+ struct kmem_cache_node *n, struct page *page)
{
- struct page *page;
- struct kmem_cache_node *n = get_node(s, node);
+ if (PageSlubPartial(page))
+ remove_partial(n, page);

- /*
- * Racy check. If we mistakenly see no partial slabs then we
- * just allocate an empty slab. If we mistakenly try to get a
- * partial slab and there is none available then get_partial()
- * will return NULL.
- */
- if (!n || !n->nr_partial)
- return NULL;
-
- spin_lock(&n->list_lock);
- list_for_each_entry(page, &n->partial, lru)
- if (lock_and_freeze_slab(n, page))
- goto out;
- page = NULL;
-out:
- spin_unlock(&n->list_lock);
- return page;
+#ifdef CONFIG_SLAB_DEBUG
+ if (s->flags & SLAB_STORE_USER)
+ list_add(&page->lru, &n->full);
+#endif
}

/*
@@ -1403,16 +1306,31 @@ out:
void drain_objects(struct kmem_cache *s, void **object, int nr)
{
int i;
+ struct kmem_cache_node *n = NULL;
+ struct page *page = NULL;
+ void *addr = 0;
+ unsigned long size = 0;

for (i = 0 ; i < nr; ) {
-
void *p = object[i];
- struct page *page = virt_to_head_page(p);
- void *addr = page_address(page);
- unsigned long size = PAGE_SIZE << compound_order(page);
+ struct page *npage = virt_to_head_page(p);
unsigned long *m;
unsigned long offset;
- struct kmem_cache_node *n;
+
+ if (npage != page) {
+ struct kmem_cache_node *n2 = get_node(s, page_to_nid(npage));
+
+ page = npage;
+ addr = page_address(page);
+ size = PAGE_SIZE << compound_order(page);
+
+ if (n != n2) {
+ if (n)
+ spin_unlock(&n->lock);
+ n = n2;
+ spin_lock(&n->lock);
+ }
+ }

#ifdef CONFIG_SLUB_DEBUG
if (kmem_cache_debug(s) && !PageSlab(page)) {
@@ -1421,7 +1339,6 @@ void drain_objects(struct kmem_cache *s,
continue;
}
#endif
- slab_lock(page);
m = map(page);

offset = p - addr;
@@ -1478,7 +1395,7 @@ void drain_objects(struct kmem_cache *s,

offset = new_offset;
}
- n = get_node(s, page_to_nid(page));
+
if (bitmap_full(m, page->objects) && n->nr_partial > s->min_partial) {

/* All objects are available now */
@@ -1489,7 +1406,6 @@ void drain_objects(struct kmem_cache *s,
} else
remove_full(s, n, page);

- slab_unlock(page);
discard_slab(s, page);

} else {
@@ -1502,9 +1418,10 @@ void drain_objects(struct kmem_cache *s,
add_partial(n, page, 0);
stat(s, FREE_ADD_PARTIAL);
}
- slab_unlock(page);
}
}
+ if (n)
+ spin_unlock(&n->lock);
}

static inline int drain_queue(struct kmem_cache *s,
@@ -1646,10 +1563,10 @@ static unsigned long count_partial(struc
unsigned long x = 0;
struct page *page;

- spin_lock_irqsave(&n->list_lock, flags);
+ spin_lock_irqsave(&n->lock, flags);
list_for_each_entry(page, &n->partial, lru)
x += get_count(page);
- spin_unlock_irqrestore(&n->list_lock, flags);
+ spin_unlock_irqrestore(&n->lock, flags);
return x;
}

@@ -1734,6 +1651,7 @@ void retrieve_objects(struct kmem_cache
int i = find_first_bit(m, page->objects);
void *a;

+ VM_BUG_ON(i >= page->objects);
__clear_bit(i, m);
a = addr + i * s->size;

@@ -1767,16 +1685,6 @@ static inline void refill_queue(struct k
q->objects += d;
}

-void to_lists(struct kmem_cache *s, struct page *page, int tail)
-{
- if (!all_objects_used(page))
-
- add_partial(get_node(s, page_to_nid(page)), page, tail);
-
- else
- add_full(s, get_node(s, page_to_nid(page)), page);
-}
-
/* Handling of objects from other nodes */

static void *slab_alloc_node(struct kmem_cache *s, struct kmem_cache_cpu *c,
@@ -1785,9 +1693,12 @@ static void *slab_alloc_node(struct kmem
#ifdef CONFIG_NUMA
struct page *page;
void *object;
+ struct kmem_cache_node *n = get_node(s, node);

- page = get_partial(s, node);
- if (!page) {
+ spin_lock(&n->lock);
+ if (list_empty(&n->partial)) {
+
+ spin_unlock(&n->lock);
gfpflags &= gfp_allowed_mask;

if (gfpflags & __GFP_WAIT)
@@ -1801,14 +1712,23 @@ static void *slab_alloc_node(struct kmem
if (!page)
return NULL;

- slab_lock(page);
- }
+ spin_lock(&n->lock);
+
+ } else
+ page = list_entry(n->partial.prev, struct page, lru);

retrieve_objects(s, page, &object, 1);
stat(s, ALLOC_DIRECT);

- to_lists(s, page, 0);
- slab_unlock(page);
+ if (!all_objects_used(page)) {
+
+ if (!PageSlubPartial(page))
+ add_partial(n, page, 1);
+
+ } else
+ partial_to_full(s, n, page);
+
+ spin_unlock(&n->lock);
return object;
#else
return NULL;
@@ -1833,13 +1753,15 @@ static void *slab_alloc(struct kmem_cach
void *object;
struct kmem_cache_cpu *c;
struct kmem_cache_queue *q;
+ struct kmem_cache_node *n;
+ struct page *page;
unsigned long flags;

if (slab_pre_alloc_hook(s, gfpflags))
return NULL;

-redo:
local_irq_save(flags);
+redo:
c = __this_cpu_ptr(s->cpu);

node = find_numa_node(s, node, c->node);
@@ -1847,66 +1769,107 @@ redo:
object = slab_alloc_node(s, c, gfpflags, node);
if (!object)
goto oom;
- goto got_it;
+ goto got_object;
}
+
q = &c->q;
- if (unlikely(queue_empty(q))) {

- while (q->objects < s->batch) {
- struct page *new;
+ if (likely(!queue_empty(q))) {

- new = get_partial(s, node);
- if (unlikely(!new)) {
+ stat(s, ALLOC_FASTPATH);

- gfpflags &= gfp_allowed_mask;
+get_object:
+ object = queue_get(q);

- if (gfpflags & __GFP_WAIT)
- local_irq_enable();
+got_object:
+ if (kmem_cache_debug(s)) {
+ if (!alloc_debug_processing(s, object, addr))
+ goto redo;
+ }
+ local_irq_restore(flags);

- new = new_slab(s, gfpflags, node);
+ if (unlikely(gfpflags & __GFP_ZERO))
+ memset(object, 0, s->objsize);

- if (gfpflags & __GFP_WAIT)
- local_irq_disable();
+ slab_post_alloc_hook(s, gfpflags, object);

- /* process may have moved to different cpu */
- c = __this_cpu_ptr(s->cpu);
- q = &c->q;
+ return object;
+ }

- if (!new) {
- if (queue_empty(q))
- goto oom;
- break;
- }
- stat(s, ALLOC_SLAB);
- slab_lock(new);
- } else
- stat(s, ALLOC_FROM_PARTIAL);
+ stat(s, ALLOC_SLOWPATH);

- refill_queue(s, q, new, available(new));
- to_lists(s, new, 0);
+ n = get_node(s, node);

- slab_unlock(new);
- }
- stat(s, ALLOC_SLOWPATH);
+ /* Refill from partial lists */
+ spin_lock(&n->lock);
+ while (q->objects < s->batch && !list_empty(&n->partial)) {
+ page = list_entry(n->partial.next, struct page, lru);

- } else
- stat(s, ALLOC_FASTPATH);
+ refill_queue(s, q, page, min(available(page),
+ s->batch - q->objects));

- object = queue_get(q);
+ if (all_objects_used(page))
+ partial_to_full(s, n, page);

-got_it:
- if (kmem_cache_debug(s)) {
- if (!alloc_debug_processing(s, object, addr))
- goto redo;
+ stat(s, ALLOC_FROM_PARTIAL);
}
- local_irq_restore(flags);
+ spin_unlock(&n->lock);

- if (unlikely(gfpflags & __GFP_ZERO))
- memset(object, 0, s->objsize);
+ if (!queue_empty(q))
+ goto get_object;

- slab_post_alloc_hook(s, gfpflags, object);
+ gfpflags &= gfp_allowed_mask;
+ /* Refill from free pages */
+ while (q->objects < s->batch) {
+ int tail = 0;

- return object;
+ if (gfpflags & __GFP_WAIT)
+ local_irq_enable();
+
+ page = new_slab(s, gfpflags, node);
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_disable();
+
+ node = page_to_nid(page);
+ n = get_node(s, node);
+
+ /* process may have moved to different cpu */
+ c = __this_cpu_ptr(s->cpu);
+ q = &c->q;
+
+ if (!page)
+ goto oom;
+
+ /*
+ * Cpu may have switched and the local queue may have
+ * enough objects. Just push the unused objects into
+ * the partial list if so.
+ */
+ if (q->objects < s->batch)
+ refill_queue(s, q, page, min_t(int, page->objects, s->batch));
+ else
+ tail = 1;
+
+ stat(s, ALLOC_SLAB);
+ if (!all_objects_used(page)) {
+
+ spin_lock(&n->lock);
+ add_partial(n, page, tail);
+ spin_unlock(&n->lock);
+
+ }
+#ifdef CONFIG_SLUB_DEBUG
+ else if (s->flags & SLAB_STORE_USER) {
+
+ spin_lock(&n->lock);
+ add_full(n, page);
+ spin_unlock(&n->lock);
+
+ }
+#endif
+ }
+ goto get_object;

oom:
local_irq_restore(flags);
@@ -2172,7 +2135,7 @@ static void
init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
{
n->nr_partial = 0;
- spin_lock_init(&n->list_lock);
+ spin_lock_init(&n->lock);
INIT_LIST_HEAD(&n->partial);
#ifdef CONFIG_SLUB_DEBUG
atomic_long_set(&n->nr_slabs, 0);
@@ -2514,7 +2477,6 @@ static void list_slab_objects(struct kme
long *m = map(page);

slab_err(s, page, "%s", text);
- slab_lock(page);

for_each_object(p, s, addr, page->objects) {

@@ -2524,7 +2486,6 @@ static void list_slab_objects(struct kme
print_tracking(s, p);
}
}
- slab_unlock(page);
kfree(map);
#endif
}
@@ -2537,17 +2498,17 @@ static void free_partial(struct kmem_cac
unsigned long flags;
struct page *page, *h;

- spin_lock_irqsave(&n->list_lock, flags);
+ spin_lock_irqsave(&n->lock, flags);
list_for_each_entry_safe(page, h, &n->partial, lru) {
if (all_objects_available(page)) {
- __remove_partial(n, page);
+ remove_partial(n, page);
discard_slab(s, page);
} else {
list_slab_objects(s, page,
"Objects remaining on kmem_cache_close()");
}
}
- spin_unlock_irqrestore(&n->list_lock, flags);
+ spin_unlock_irqrestore(&n->lock, flags);
}

/*
@@ -2886,23 +2847,22 @@ int kmem_cache_shrink(struct kmem_cache
for (i = 0; i < objects; i++)
INIT_LIST_HEAD(slabs_by_inuse + i);

- spin_lock_irqsave(&n->list_lock, flags);
+ spin_lock_irqsave(&n->lock, flags);

/*
* Build lists indexed by the items in use in each slab.
*
* Note that concurrent frees may occur while we hold the
- * list_lock. page->inuse here is the upper limit.
+ * lock. page->inuse here is the upper limit.
*/
list_for_each_entry_safe(page, t, &n->partial, lru) {
- if (all_objects_available(page) && slab_trylock(page)) {
+ if (all_objects_available(page)) {
/*
* Must hold slab lock here because slab_free
* may have freed the last object and be
* waiting to release the slab.
*/
- __remove_partial(n, page);
- slab_unlock(page);
+ remove_partial(n, page);
discard_slab(s, page);
} else {
list_move(&page->lru,
@@ -2917,7 +2877,7 @@ int kmem_cache_shrink(struct kmem_cache
for (i = objects - 1; i >= 0; i--)
list_splice(slabs_by_inuse + i, n->partial.prev);

- spin_unlock_irqrestore(&n->list_lock, flags);
+ spin_unlock_irqrestore(&n->lock, flags);
}

kfree(slabs_by_inuse);
@@ -3495,15 +3455,7 @@ static int validate_slab(struct kmem_cac

static unsigned long validate_slab_slab(struct kmem_cache *s, struct page *page)
{
- unsigned long errors = 0;
-
- if (slab_trylock(page)) {
- errors = validate_slab(s, page);
- slab_unlock(page);
- } else
- printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
- s->name, page);
- return errors;
+ return validate_slab(s, page);
}

static int validate_slab_node(struct kmem_cache *s,
@@ -3514,10 +3466,13 @@ static int validate_slab_node(struct kme
unsigned long flags;
unsigned long errors;

- spin_lock_irqsave(&n->list_lock, flags);
+ spin_lock_irqsave(&n->lock, flags);

list_for_each_entry(page, &n->partial, lru) {
- errors += validate_slab_slab(s, page);
+ if (get_node(s, page_to_nid(page)) == n)
+ errors += validate_slab_slab(s, page);
+ else
+ printk(KERN_ERR "SLUB %s: Partial list page from wrong node\n", s->name);
count++;
}
if (count != n->nr_partial)
@@ -3537,7 +3492,7 @@ static int validate_slab_node(struct kme
atomic_long_read(&n->nr_slabs));

out:
- spin_unlock_irqrestore(&n->list_lock, flags);
+ spin_unlock_irqrestore(&n->lock, flags);
return errors;
}

@@ -3715,12 +3670,12 @@ static int list_locations(struct kmem_ca
if (!atomic_long_read(&n->nr_slabs))
continue;

- spin_lock_irqsave(&n->list_lock, flags);
+ spin_lock_irqsave(&n->lock, flags);
list_for_each_entry(page, &n->partial, lru)
process_slab(&t, s, page, alloc);
list_for_each_entry(page, &n->full, lru)
process_slab(&t, s, page, alloc);
- spin_unlock_irqrestore(&n->list_lock, flags);
+ spin_unlock_irqrestore(&n->lock, flags);
}

for (i = 0; i < t.count; i++) {
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-10-04 08:26:27.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-10-04 08:41:48.000000000 -0500
@@ -62,7 +62,7 @@ struct kmem_cache_cpu {
};

struct kmem_cache_node {
- spinlock_t list_lock; /* Protect partial list and nr_partial */
+ spinlock_t lock; /* Protocts slab metadata on a node */
unsigned long nr_partial;
struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:40 UTC

Provide some more detail on what is going on with various types of object
in slabs. This is mainly useful for debugging the queueing operations.

Signed-off-by: Christoph Lameter <***@linux.com>

---
mm/slub.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 74 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-05 13:40:08.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-05 13:40:11.000000000 -0500
@@ -4151,12 +4151,24 @@ static int count_total(struct page *page
#endif

#ifdef CONFIG_SLUB_DEBUG
-static int validate_slab(struct kmem_cache *s, struct page *page)
+
+struct validate_counters {
+ int objects;
+ int available;
+ int queue;
+ int checked;
+ int unchecked;
+ int hist[];
+};
+
+static int validate_slab(struct kmem_cache *s, struct page *page,
+ int partial, struct validate_counters *v)
{
void *p;
void *addr = page_address(page);
unsigned long *m = map(page);
unsigned long errors = 0;
+ unsigned long inuse = 0;

if (!check_slab(s, page) || !verify_slab(s, page))
return 0;
@@ -4168,7 +4180,10 @@ static int validate_slab(struct kmem_cac
/* Available */
if (!check_object(s, page, p, SLUB_RED_INACTIVE))
errors++;
+ else
+ v->available++;
} else {
+ inuse++;
#ifdef CONFIG_SLUB_DEBUG
/*
* We cannot check if the object is on a queue without
@@ -4178,24 +4193,45 @@ static int validate_slab(struct kmem_cac
if (s->flags & SLAB_RED_ZONE) {
u8 *q = p + s->objsize;

- if (*q != SLUB_RED_QUEUE)
+ if (*q != SLUB_RED_QUEUE) {
if (!check_object(s, page, p, SLUB_RED_ACTIVE))
errors++;
- }
+ else
+ v->checked++;
+ } else
+ v->queue++;
+ } else
+ /*
+ * Allocated object that cannot be verified
+ * since red zoning is diabled. The object
+ * may be free after all if its on a queue.
+ */
#endif
+ v->unchecked++;
}
}

+ v->hist[inuse]++;
+
+ if (inuse < page->objects) {
+ if (!partial)
+ slab_err(s, page, "Objects available but not on partial list");
+ } else {
+ if (partial)
+ slab_err(s, page, "On partial list but no object available");
+ }
+ v->objects += page->objects;
return errors;
}

-static unsigned long validate_slab_slab(struct kmem_cache *s, struct page *page)
+static unsigned long validate_slab_slab(struct kmem_cache *s,
+ struct page *page, int partial, struct validate_counters *v)
{
- return validate_slab(s, page);
+ return validate_slab(s, page, partial, v);
}

static int validate_slab_node(struct kmem_cache *s,
- struct kmem_cache_node *n)
+ struct kmem_cache_node *n, struct validate_counters *v)
{
unsigned long count = 0;
struct page *page;
@@ -4206,7 +4242,7 @@ static int validate_slab_node(struct kme

list_for_each_entry(page, &n->partial, lru) {
if (get_node(s, page_to_nid(page)) == n)
- errors += validate_slab_slab(s, page);
+ errors += validate_slab_slab(s, page, 1, v);
else
printk(KERN_ERR "SLUB %s: Partial list page from wrong node\n", s->name);
count++;
@@ -4219,7 +4255,7 @@ static int validate_slab_node(struct kme
goto out;

list_for_each_entry(page, &n->full, lru) {
- validate_slab_slab(s, page);
+ errors += validate_slab_slab(s, page, 0, v);
count++;
}
if (count != atomic_long_read(&n->nr_slabs))
@@ -4235,15 +4271,41 @@ out:
static long validate_slab_cache(struct kmem_cache *s)
{
int node;
- unsigned long count = 0;
+ int i;
+ struct validate_counters *v;
+ unsigned long errors = 0;
+ int maxobj = oo_objects(s->max);
+
+ v = kzalloc(GFP_KERNEL, offsetof(struct validate_counters, hist) + maxobj * sizeof(int));
+ if (!v)
+ return -ENOMEM;

- flush_all(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);

- count += validate_slab_node(s, n);
+ errors += validate_slab_node(s, n, v);
}
- return count;
+
+ printk(KERN_DEBUG "Validation of slab %s: total=%d available=%d checked=%d",
+ s->name, v->objects, v->available, v->checked);
+
+ if (v->unchecked)
+ printk(" unchecked=%d", v->unchecked);
+
+ if (v->queue)
+ printk(" onqueue=%d", v->queue);
+
+ if (errors)
+ printk(" errors=%lu", errors);
+
+ for (i = 0; i < maxobj; i++)
+ if (v->hist[i])
+ printk(" p<%d>=%d", i, v->hist[i]);
+
+ printk("\n");
+ kfree(v);
+
+ return errors;
}
/*
* Generate lists of code addresses where slabcache objects are allocated

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-05 18:57:31 UTC

People get confused because the output repeats some basic hardware
configuration values. Some of the items listed no
longer have the same relevance in the queued form of SLUB.

Signed-off-by: Christoph Lameter <***@linux-foundation.org>

---
mm/slub.c | 6 ------
1 file changed, 6 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-10-02 18:10:45.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-10-02 18:10:50.000000000 -0500
@@ -3249,12 +3249,6 @@ void __init kmem_cache_init(void)
}
}
#endif
- printk(KERN_INFO
- "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
- " CPUs=%d, Nodes=%d\n",
- caches, cache_line_size(),
- slub_min_order, slub_max_order, slub_min_objects,
- nr_cpu_ids, nr_node_ids);
}

void __init kmem_cache_init_late(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Pekka Enberg

2010-10-06 08:01:35 UTC

(Adding more people who've taken interest in slab performance in the
past to CC.)

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).
- Implement object expiration via reclaim VM logic.
The following is a release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like in SLAB whereas much of the infrastructure
comes from SLUB.
After this patches SLUB will track the cpu cache contents
1. SLUB accurately tracks cpu caches instead of assuming that there
is only a single cpu cache per node or system.
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.
3. SLUB caches are dynamically configurable via the sysfs filesystem.
4. There is no per slab page metadata structure to maintain (aside
from the object bitmap that usually fits into the page struct).
5. Has all the resiliency and diagnostic features of SLUB.
The unified allocator is a merging of SLUB with some queuing concepts from
SLAB and a new way of managing objects in the slabs using bitmaps. Memory
wise this is slightly more inefficient than SLUB (due to the need to place
large bitmaps --sized a few words--in some slab pages if there are more
than BITS_PER_LONG objects in a slab) but in general does not increase space
use too much.
The SLAB scheme of not touching the object during management is adopted.
The unified allocator can efficiently free and allocate cache cold objects
without causing cache misses.
Some numbers using tcp_rr on localhost
Dell R910 128G RAM, 64 processors, 4 NUMA nodes
threads unified slub slab
64 4141798 3729037 3884939
128 4146587 3890993 4105276
192 4003063 3876570 4110971
256 3928857 3942806 4099249
320 3922623 3969042 4093283
384 3827603 4002833 4108420
448 4140345 4027251 4118534
512 4163741 4050130 4122644
576 4175666 4099934 4149355
640 4190332 4142570 4175618
704 4198779 4173177 4193657
768 4662216 4200462 4222686

Are there any stability problems left? Have you tried other benchmarks
(e.g. hackbench, sysbench)? Can we merge the series in smaller
batches? For example, if we leave out the NUMA parts in the first
stage, do we expect to see performance regressions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Richard Kennedy

2010-10-06 11:03:27 UTC

Post by Pekka Enberg
(Adding more people who've taken interest in slab performance in the
past to CC.)

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).
- Implement object expiration via reclaim VM logic.
The following is a release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like in SLAB whereas much of the infrastructure
comes from SLUB.
After this patches SLUB will track the cpu cache contents
1. SLUB accurately tracks cpu caches instead of assuming that there
is only a single cpu cache per node or system.
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.
3. SLUB caches are dynamically configurable via the sysfs filesystem.
4. There is no per slab page metadata structure to maintain (aside
from the object bitmap that usually fits into the page struct).
5. Has all the resiliency and diagnostic features of SLUB.
The unified allocator is a merging of SLUB with some queuing concepts from
SLAB and a new way of managing objects in the slabs using bitmaps. Memory
wise this is slightly more inefficient than SLUB (due to the need to place
large bitmaps --sized a few words--in some slab pages if there are more
than BITS_PER_LONG objects in a slab) but in general does not increase space
use too much.
The SLAB scheme of not touching the object during management is adopted.
The unified allocator can efficiently free and allocate cache cold objects
without causing cache misses.

Hi Christoph,
What tree are these patches against ? I'm getting patch failures on the
main tree.

regards
Richard

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Pekka Enberg

2010-10-06 11:19:13 UTC

Post by Richard Kennedy
What tree are these patches against ? I'm getting patch failures on the
main tree.

The 'slab/next' branch of slab.git:

http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=summary

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Richard Kennedy

2010-10-06 15:46:19 UTC

Post by Pekka Enberg

Post by Richard Kennedy
What tree are these patches against ? I'm getting patch failures on the
main tree.

http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=summary

thanks, I'll have a look at that

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 16:21:13 UTC

Some additional patches to slabinfo

Subject: slub: Move slabinfo.c to tools/slub/slabinfo.c

We now have a tools directory for these things.

Signed-off-by: Christoph Lameter <***@linux.com>

---
Documentation/vm/slabinfo.c | 1364 --------------------------------------------
tools/slub/slabinfo.c | 1364 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 1364 insertions(+), 1364 deletions(-)

Index: linux-2.6/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.orig/Documentation/vm/slabinfo.c 2010-10-05 16:26:18.000000000 -0500
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,1364 +0,0 @@
-/*
- * Slabinfo: Tool to get reports about slabs
- *
- * (C) 2007 sgi, Christoph Lameter
- *
- * Compile by:
- *
- * gcc -o slabinfo slabinfo.c
- */
-#include <stdio.h>
-#include <stdlib.h>
-#include <sys/types.h>
-#include <dirent.h>
-#include <strings.h>
-#include <string.h>
-#include <unistd.h>
-#include <stdarg.h>
-#include <getopt.h>
-#include <regex.h>
-#include <errno.h>
-
-#define MAX_SLABS 500
-#define MAX_ALIASES 500
-#define MAX_NODES 1024
-
-struct slabinfo {
- char *name;
- int alias;
- int refs;
- int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
- int hwcache_align, object_size, objs_per_slab;
- int sanity_checks, slab_size, store_user, trace;
- int order, poison, reclaim_account, red_zone;
- unsigned long partial, objects, slabs, objects_partial, objects_total;
- unsigned long alloc_fastpath, alloc_slowpath;
- unsigned long free_fastpath, free_slowpath;
- unsigned long free_frozen, free_add_partial, free_remove_partial;
- unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill;
- unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
- unsigned long deactivate_to_head, deactivate_to_tail;
- unsigned long deactivate_remote_frees, order_fallback;
- int numa[MAX_NODES];
- int numa_partial[MAX_NODES];
-} slabinfo[MAX_SLABS];
-
-struct aliasinfo {
- char *name;
- char *ref;
- struct slabinfo *slab;
-} aliasinfo[MAX_ALIASES];
-
-int slabs = 0;
-int actual_slabs = 0;
-int aliases = 0;
-int alias_targets = 0;
-int highest_node = 0;
-
-char buffer[4096];
-
-int show_empty = 0;
-int show_report = 0;
-int show_alias = 0;
-int show_slab = 0;
-int skip_zero = 1;
-int show_numa = 0;
-int show_track = 0;
-int show_first_alias = 0;
-int validate = 0;
-int shrink = 0;
-int show_inverted = 0;
-int show_single_ref = 0;
-int show_totals = 0;
-int sort_size = 0;
-int sort_active = 0;
-int set_debug = 0;
-int show_ops = 0;
-int show_activity = 0;
-
-/* Debug options */
-int sanity = 0;
-int redzone = 0;
-int poison = 0;
-int tracking = 0;
-int tracing = 0;
-
-int page_size;
-
-regex_t pattern;
-
-static void fatal(const char *x, ...)
-{
- va_list ap;
-
- va_start(ap, x);
- vfprintf(stderr, x, ap);
- va_end(ap);
- exit(EXIT_FAILURE);
-}
-
-static void usage(void)
-{
- printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
- "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
- "-a|--aliases Show aliases\n"
- "-A|--activity Most active slabs first\n"
- "-d<options>|--debug=<options> Set/Clear Debug options\n"
- "-D|--display-active Switch line format to activity\n"
- "-e|--empty Show empty slabs\n"
- "-f|--first-alias Show first alias\n"
- "-h|--help Show usage information\n"
- "-i|--inverted Inverted list\n"
- "-l|--slabs Show slabs\n"
- "-n|--numa Show NUMA information\n"
- "-o|--ops Show kmem_cache_ops\n"
- "-s|--shrink Shrink slabs\n"
- "-r|--report Detailed report on single slabs\n"
- "-S|--Size Sort by size\n"
- "-t|--tracking Show alloc/free information\n"
- "-T|--Totals Show summary information\n"
- "-v|--validate Validate slabs\n"
- "-z|--zero Include empty slabs\n"
- "-1|--1ref Single reference\n"
- "\nValid debug options (FZPUT may be combined)\n"
- "a / A Switch on all debug options (=FZUP)\n"
- "- Switch off all debug options\n"
- "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
- "z / Z Redzoning\n"
- "p / P Poisoning\n"
- "u / U Tracking\n"
- "t / T Tracing\n"
- );
-}
-
-static unsigned long read_obj(const char *name)
-{
- FILE *f = fopen(name, "r");
-
- if (!f)
- buffer[0] = 0;
- else {
- if (!fgets(buffer, sizeof(buffer), f))
- buffer[0] = 0;
- fclose(f);
- if (buffer[strlen(buffer)] == '\n')
- buffer[strlen(buffer)] = 0;
- }
- return strlen(buffer);
-}
-
-
-/*
- * Get the contents of an attribute
- */
-static unsigned long get_obj(const char *name)
-{
- if (!read_obj(name))
- return 0;
-
- return atol(buffer);
-}
-
-static unsigned long get_obj_and_str(const char *name, char **x)
-{
- unsigned long result = 0;
- char *p;
-
- *x = NULL;
-
- if (!read_obj(name)) {
- x = NULL;
- return 0;
- }
- result = strtoul(buffer, &p, 10);
- while (*p == ' ')
- p++;
- if (*p)
- *x = strdup(p);
- return result;
-}
-
-static void set_obj(struct slabinfo *s, const char *name, int n)
-{
- char x[100];
- FILE *f;
-
- snprintf(x, 100, "%s/%s", s->name, name);
- f = fopen(x, "w");
- if (!f)
- fatal("Cannot write to %s\n", x);
-
- fprintf(f, "%d\n", n);
- fclose(f);
-}
-
-static unsigned long read_slab_obj(struct slabinfo *s, const char *name)
-{
- char x[100];
- FILE *f;
- size_t l;
-
- snprintf(x, 100, "%s/%s", s->name, name);
- f = fopen(x, "r");
- if (!f) {
- buffer[0] = 0;
- l = 0;
- } else {
- l = fread(buffer, 1, sizeof(buffer), f);
- buffer[l] = 0;
- fclose(f);
- }
- return l;
-}
-
-
-/*
- * Put a size string together
- */
-static int store_size(char *buffer, unsigned long value)
-{
- unsigned long divisor = 1;
- char trailer = 0;
- int n;
-
- if (value > 1000000000UL) {
- divisor = 100000000UL;
- trailer = 'G';
- } else if (value > 1000000UL) {
- divisor = 100000UL;
- trailer = 'M';
- } else if (value > 1000UL) {
- divisor = 100;
- trailer = 'K';
- }
-
- value /= divisor;
- n = sprintf(buffer, "%ld",value);
- if (trailer) {
- buffer[n] = trailer;
- n++;
- buffer[n] = 0;
- }
- if (divisor != 1) {
- memmove(buffer + n - 2, buffer + n - 3, 4);
- buffer[n-2] = '.';
- n++;
- }
- return n;
-}
-
-static void decode_numa_list(int *numa, char *t)
-{
- int node;
- int nr;
-
- memset(numa, 0, MAX_NODES * sizeof(int));
-
- if (!t)
- return;
-
- while (*t == 'N') {
- t++;
- node = strtoul(t, &t, 10);
- if (*t == '=') {
- t++;
- nr = strtoul(t, &t, 10);
- numa[node] = nr;
- if (node > highest_node)
- highest_node = node;
- }
- while (*t == ' ')
- t++;
- }
-}
-
-static void slab_validate(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- set_obj(s, "validate", 1);
-}
-
-static void slab_shrink(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- set_obj(s, "shrink", 1);
-}
-
-int line = 0;
-
-static void first_line(void)
-{
- if (show_activity)
- printf("Name Objects Alloc Free %%Fast Fallb O\n");
- else
- printf("Name Objects Objsize Space "
- "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n");
-}
-
-/*
- * Find the shortest alias of a slab
- */
-static struct aliasinfo *find_one_alias(struct slabinfo *find)
-{
- struct aliasinfo *a;
- struct aliasinfo *best = NULL;
-
- for(a = aliasinfo;a < aliasinfo + aliases; a++) {
- if (a->slab == find &&
- (!best || strlen(best->name) < strlen(a->name))) {
- best = a;
- if (strncmp(a->name,"kmall", 5) == 0)
- return best;
- }
- }
- return best;
-}
-
-static unsigned long slab_size(struct slabinfo *s)
-{
- return s->slabs * (page_size << s->order);
-}
-
-static unsigned long slab_activity(struct slabinfo *s)
-{
- return s->alloc_fastpath + s->free_fastpath +
- s->alloc_slowpath + s->free_slowpath;
-}
-
-static void slab_numa(struct slabinfo *s, int mode)
-{
- int node;
-
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (!highest_node) {
- printf("\n%s: No NUMA information available.\n", s->name);
- return;
- }
-
- if (skip_zero && !s->slabs)
- return;
-
- if (!line) {
- printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
- for(node = 0; node <= highest_node; node++)
- printf(" %4d", node);
- printf("\n----------------------");
- for(node = 0; node <= highest_node; node++)
- printf("-----");
- printf("\n");
- }
- printf("%-21s ", mode ? "All slabs" : s->name);
- for(node = 0; node <= highest_node; node++) {
- char b[20];
-
- store_size(b, s->numa[node]);
- printf(" %4s", b);
- }
- printf("\n");
- if (mode) {
- printf("%-21s ", "Partial slabs");
- for(node = 0; node <= highest_node; node++) {
- char b[20];
-
- store_size(b, s->numa_partial[node]);
- printf(" %4s", b);
- }
- printf("\n");
- }
- line++;
-}
-
-static void show_tracking(struct slabinfo *s)
-{
- printf("\n%s: Kernel object allocation\n", s->name);
- printf("-----------------------------------------------------------------------\n");
- if (read_slab_obj(s, "alloc_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
- printf("\n%s: Kernel object freeing\n", s->name);
- printf("------------------------------------------------------------------------\n");
- if (read_slab_obj(s, "free_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
-}
-
-static void ops(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (read_slab_obj(s, "ops")) {
- printf("\n%s: kmem_cache operations\n", s->name);
- printf("--------------------------------------------\n");
- printf(buffer);
- } else
- printf("\n%s has no kmem_cache operations\n", s->name);
-}
-
-static const char *onoff(int x)
-{
- if (x)
- return "On ";
- return "Off";
-}
-
-static void slab_stats(struct slabinfo *s)
-{
- unsigned long total_alloc;
- unsigned long total_free;
- unsigned long total;
-
- if (!s->alloc_slab)
- return;
-
- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
-
- if (!total_alloc)
- return;
-
- printf("\n");
- printf("Slab Perf Counter Alloc Free %%Al %%Fr\n");
- printf("--------------------------------------------------\n");
- printf("Fastpath %8lu %8lu %3lu %3lu\n",
- s->alloc_fastpath, s->free_fastpath,
- s->alloc_fastpath * 100 / total_alloc,
- s->free_fastpath * 100 / total_free);
- printf("Slowpath %8lu %8lu %3lu %3lu\n",
- total_alloc - s->alloc_fastpath, s->free_slowpath,
- (total_alloc - s->alloc_fastpath) * 100 / total_alloc,
- s->free_slowpath * 100 / total_free);
- printf("Page Alloc %8lu %8lu %3lu %3lu\n",
- s->alloc_slab, s->free_slab,
- s->alloc_slab * 100 / total_alloc,
- s->free_slab * 100 / total_free);
- printf("Add partial %8lu %8lu %3lu %3lu\n",
- s->deactivate_to_head + s->deactivate_to_tail,
- s->free_add_partial,
- (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc,
- s->free_add_partial * 100 / total_free);
- printf("Remove partial %8lu %8lu %3lu %3lu\n",
- s->alloc_from_partial, s->free_remove_partial,
- s->alloc_from_partial * 100 / total_alloc,
- s->free_remove_partial * 100 / total_free);
-
- printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
- s->deactivate_remote_frees, s->free_frozen,
- s->deactivate_remote_frees * 100 / total_alloc,
- s->free_frozen * 100 / total_free);
-
- printf("Total %8lu %8lu\n\n", total_alloc, total_free);
-
- if (s->cpuslab_flush)
- printf("Flushes %8lu\n", s->cpuslab_flush);
-
- if (s->alloc_refill)
- printf("Refill %8lu\n", s->alloc_refill);
-
- total = s->deactivate_full + s->deactivate_empty +
- s->deactivate_to_head + s->deactivate_to_tail;
-
- if (total)
- printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
- "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
- s->deactivate_full, (s->deactivate_full * 100) / total,
- s->deactivate_empty, (s->deactivate_empty * 100) / total,
- s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
- s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
-}
-
-static void report(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n",
- s->name, s->aliases, s->order, s->objects);
- if (s->hwcache_align)
- printf("** Hardware cacheline aligned\n");
- if (s->cache_dma)
- printf("** Memory is allocated in a special DMA zone\n");
- if (s->destroy_by_rcu)
- printf("** Slabs are destroyed via RCU\n");
- if (s->reclaim_account)
- printf("** Reclaim accounting active\n");
-
- printf("\nSizes (bytes) Slabs Debug Memory\n");
- printf("------------------------------------------------------------------------\n");
- printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
- s->object_size, s->slabs, onoff(s->sanity_checks),
- s->slabs * (page_size << s->order));
- printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n",
- s->slab_size, s->slabs - s->partial - s->cpu_slabs,
- onoff(s->red_zone), s->objects * s->object_size);
- printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n",
- page_size << s->order, s->partial, onoff(s->poison),
- s->slabs * (page_size << s->order) - s->objects * s->object_size);
- printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n",
- s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user),
- (s->slab_size - s->object_size) * s->objects);
- printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
- s->align, s->objs_per_slab, onoff(s->trace),
- ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
- s->slabs);
-
- ops(s);
- show_tracking(s);
- slab_numa(s, 1);
- slab_stats(s);
-}
-
-static void slabcache(struct slabinfo *s)
-{
- char size_str[20];
- char dist_str[40];
- char flags[20];
- char *p = flags;
-
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (actual_slabs == 1) {
- report(s);
- return;
- }
-
- if (skip_zero && !show_empty && !s->slabs)
- return;
-
- if (show_empty && s->slabs)
- return;
-
- store_size(size_str, slab_size(s));
- snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
- s->partial, s->cpu_slabs);
-
- if (!line++)
- first_line();
-
- if (s->aliases)
- *p++ = '*';
- if (s->cache_dma)
- *p++ = 'd';
- if (s->hwcache_align)
- *p++ = 'A';
- if (s->poison)
- *p++ = 'P';
- if (s->reclaim_account)
- *p++ = 'a';
- if (s->red_zone)
- *p++ = 'Z';
- if (s->sanity_checks)
- *p++ = 'F';
- if (s->store_user)
- *p++ = 'U';
- if (s->trace)
- *p++ = 'T';
-
- *p = 0;
- if (show_activity) {
- unsigned long total_alloc;
- unsigned long total_free;
-
- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
-
- printf("%-21s %8ld %10ld %10ld %3ld %3ld %5ld %1d\n",
- s->name, s->objects,
- total_alloc, total_free,
- total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
- total_free ? (s->free_fastpath * 100 / total_free) : 0,
- s->order_fallback, s->order);
- }
- else
- printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
- s->name, s->objects, s->object_size, size_str, dist_str,
- s->objs_per_slab, s->order,
- s->slabs ? (s->partial * 100) / s->slabs : 100,
- s->slabs ? (s->objects * s->object_size * 100) /
- (s->slabs * (page_size << s->order)) : 100,
- flags);
-}
-
-/*
- * Analyze debug options. Return false if something is amiss.
- */
-static int debug_opt_scan(char *opt)
-{
- if (!opt || !opt[0] || strcmp(opt, "-") == 0)
- return 1;
-
- if (strcasecmp(opt, "a") == 0) {
- sanity = 1;
- poison = 1;
- redzone = 1;
- tracking = 1;
- return 1;
- }
-
- for ( ; *opt; opt++)
- switch (*opt) {
- case 'F' : case 'f':
- if (sanity)
- return 0;
- sanity = 1;
- break;
- case 'P' : case 'p':
- if (poison)
- return 0;
- poison = 1;
- break;
-
- case 'Z' : case 'z':
- if (redzone)
- return 0;
- redzone = 1;
- break;
-
- case 'U' : case 'u':
- if (tracking)
- return 0;
- tracking = 1;
- break;
-
- case 'T' : case 't':
- if (tracing)
- return 0;
- tracing = 1;
- break;
- default:
- return 0;
- }
- return 1;
-}
-
-static int slab_empty(struct slabinfo *s)
-{
- if (s->objects > 0)
- return 0;
-
- /*
- * We may still have slabs even if there are no objects. Shrinking will
- * remove them.
- */
- if (s->slabs != 0)
- set_obj(s, "shrink", 1);
-
- return 1;
-}
-
-static void slab_debug(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (sanity && !s->sanity_checks) {
- set_obj(s, "sanity", 1);
- }
- if (!sanity && s->sanity_checks) {
- if (slab_empty(s))
- set_obj(s, "sanity", 0);
- else
- fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name);
- }
- if (redzone && !s->red_zone) {
- if (slab_empty(s))
- set_obj(s, "red_zone", 1);
- else
- fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
- }
- if (!redzone && s->red_zone) {
- if (slab_empty(s))
- set_obj(s, "red_zone", 0);
- else
- fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
- }
- if (poison && !s->poison) {
- if (slab_empty(s))
- set_obj(s, "poison", 1);
- else
- fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
- }
- if (!poison && s->poison) {
- if (slab_empty(s))
- set_obj(s, "poison", 0);
- else
- fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
- }
- if (tracking && !s->store_user) {
- if (slab_empty(s))
- set_obj(s, "store_user", 1);
- else
- fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
- }
- if (!tracking && s->store_user) {
- if (slab_empty(s))
- set_obj(s, "store_user", 0);
- else
- fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
- }
- if (tracing && !s->trace) {
- if (slabs == 1)
- set_obj(s, "trace", 1);
- else
- fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name);
- }
- if (!tracing && s->trace)
- set_obj(s, "trace", 1);
-}
-
-static void totals(void)
-{
- struct slabinfo *s;
-
- int used_slabs = 0;
- char b1[20], b2[20], b3[20], b4[20];
- unsigned long long max = 1ULL << 63;
-
- /* Object size */
- unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
-
- /* Number of partial slabs in a slabcache */
- unsigned long long min_partial = max, max_partial = 0,
- avg_partial, total_partial = 0;
-
- /* Number of slabs in a slab cache */
- unsigned long long min_slabs = max, max_slabs = 0,
- avg_slabs, total_slabs = 0;
-
- /* Size of the whole slab */
- unsigned long long min_size = max, max_size = 0,
- avg_size, total_size = 0;
-
- /* Bytes used for object storage in a slab */
- unsigned long long min_used = max, max_used = 0,
- avg_used, total_used = 0;
-
- /* Waste: Bytes used for alignment and padding */
- unsigned long long min_waste = max, max_waste = 0,
- avg_waste, total_waste = 0;
- /* Number of objects in a slab */
- unsigned long long min_objects = max, max_objects = 0,
- avg_objects, total_objects = 0;
- /* Waste per object */
- unsigned long long min_objwaste = max,
- max_objwaste = 0, avg_objwaste,
- total_objwaste = 0;
-
- /* Memory per object */
- unsigned long long min_memobj = max,
- max_memobj = 0, avg_memobj,
- total_objsize = 0;
-
- /* Percentage of partial slabs per slab */
- unsigned long min_ppart = 100, max_ppart = 0,
- avg_ppart, total_ppart = 0;
-
- /* Number of objects in partial slabs */
- unsigned long min_partobj = max, max_partobj = 0,
- avg_partobj, total_partobj = 0;
-
- /* Percentage of partial objects of all objects in a slab */
- unsigned long min_ppartobj = 100, max_ppartobj = 0,
- avg_ppartobj, total_ppartobj = 0;
-
-
- for (s = slabinfo; s < slabinfo + slabs; s++) {
- unsigned long long size;
- unsigned long used;
- unsigned long long wasted;
- unsigned long long objwaste;
- unsigned long percentage_partial_slabs;
- unsigned long percentage_partial_objs;
-
- if (!s->slabs || !s->objects)
- continue;
-
- used_slabs++;
-
- size = slab_size(s);
- used = s->objects * s->object_size;
- wasted = size - used;
- objwaste = s->slab_size - s->object_size;
-
- percentage_partial_slabs = s->partial * 100 / s->slabs;
- if (percentage_partial_slabs > 100)
- percentage_partial_slabs = 100;
-
- percentage_partial_objs = s->objects_partial * 100
- / s->objects;
-
- if (percentage_partial_objs > 100)
- percentage_partial_objs = 100;
-
- if (s->object_size < min_objsize)
- min_objsize = s->object_size;
- if (s->partial < min_partial)
- min_partial = s->partial;
- if (s->slabs < min_slabs)
- min_slabs = s->slabs;
- if (size < min_size)
- min_size = size;
- if (wasted < min_waste)
- min_waste = wasted;
- if (objwaste < min_objwaste)
- min_objwaste = objwaste;
- if (s->objects < min_objects)
- min_objects = s->objects;
- if (used < min_used)
- min_used = used;
- if (s->objects_partial < min_partobj)
- min_partobj = s->objects_partial;
- if (percentage_partial_slabs < min_ppart)
- min_ppart = percentage_partial_slabs;
- if (percentage_partial_objs < min_ppartobj)
- min_ppartobj = percentage_partial_objs;
- if (s->slab_size < min_memobj)
- min_memobj = s->slab_size;
-
- if (s->object_size > max_objsize)
- max_objsize = s->object_size;
- if (s->partial > max_partial)
- max_partial = s->partial;
- if (s->slabs > max_slabs)
- max_slabs = s->slabs;
- if (size > max_size)
- max_size = size;
- if (wasted > max_waste)
- max_waste = wasted;
- if (objwaste > max_objwaste)
- max_objwaste = objwaste;
- if (s->objects > max_objects)
- max_objects = s->objects;
- if (used > max_used)
- max_used = used;
- if (s->objects_partial > max_partobj)
- max_partobj = s->objects_partial;
- if (percentage_partial_slabs > max_ppart)
- max_ppart = percentage_partial_slabs;
- if (percentage_partial_objs > max_ppartobj)
- max_ppartobj = percentage_partial_objs;
- if (s->slab_size > max_memobj)
- max_memobj = s->slab_size;
-
- total_partial += s->partial;
- total_slabs += s->slabs;
- total_size += size;
- total_waste += wasted;
-
- total_objects += s->objects;
- total_used += used;
- total_partobj += s->objects_partial;
- total_ppart += percentage_partial_slabs;
- total_ppartobj += percentage_partial_objs;
-
- total_objwaste += s->objects * objwaste;
- total_objsize += s->objects * s->slab_size;
- }
-
- if (!total_objects) {
- printf("No objects\n");
- return;
- }
- if (!used_slabs) {
- printf("No slabs\n");
- return;
- }
-
- /* Per slab averages */
- avg_partial = total_partial / used_slabs;
- avg_slabs = total_slabs / used_slabs;
- avg_size = total_size / used_slabs;
- avg_waste = total_waste / used_slabs;
-
- avg_objects = total_objects / used_slabs;
- avg_used = total_used / used_slabs;
- avg_partobj = total_partobj / used_slabs;
- avg_ppart = total_ppart / used_slabs;
- avg_ppartobj = total_ppartobj / used_slabs;
-
- /* Per object object sizes */
- avg_objsize = total_used / total_objects;
- avg_objwaste = total_objwaste / total_objects;
- avg_partobj = total_partobj * 100 / total_objects;
- avg_memobj = total_objsize / total_objects;
-
- printf("Slabcache Totals\n");
- printf("----------------\n");
- printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n",
- slabs, aliases, alias_targets, used_slabs);
-
- store_size(b1, total_size);store_size(b2, total_waste);
- store_size(b3, total_waste * 100 / total_used);
- printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
-
- store_size(b1, total_objects);store_size(b2, total_partobj);
- store_size(b3, total_partobj * 100 / total_objects);
- printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3);
-
- printf("\n");
- printf("Per Cache Average Min Max Total\n");
- printf("---------------------------------------------------------\n");
-
- store_size(b1, avg_objects);store_size(b2, min_objects);
- store_size(b3, max_objects);store_size(b4, total_objects);
- printf("#Objects %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_slabs);store_size(b2, min_slabs);
- store_size(b3, max_slabs);store_size(b4, total_slabs);
- printf("#Slabs %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_partial);store_size(b2, min_partial);
- store_size(b3, max_partial);store_size(b4, total_partial);
- printf("#PartSlab %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
- store_size(b1, avg_ppart);store_size(b2, min_ppart);
- store_size(b3, max_ppart);
- store_size(b4, total_partial * 100 / total_slabs);
- printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_partobj);store_size(b2, min_partobj);
- store_size(b3, max_partobj);
- store_size(b4, total_partobj);
- printf("PartObjs %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj);
- store_size(b3, max_ppartobj);
- store_size(b4, total_partobj * 100 / total_objects);
- printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_size);store_size(b2, min_size);
- store_size(b3, max_size);store_size(b4, total_size);
- printf("Memory %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_used);store_size(b2, min_used);
- store_size(b3, max_used);store_size(b4, total_used);
- printf("Used %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_waste);store_size(b2, min_waste);
- store_size(b3, max_waste);store_size(b4, total_waste);
- printf("Loss %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- printf("\n");
- printf("Per Object Average Min Max\n");
- printf("---------------------------------------------\n");
-
- store_size(b1, avg_memobj);store_size(b2, min_memobj);
- store_size(b3, max_memobj);
- printf("Memory %10s %10s %10s\n",
- b1, b2, b3);
- store_size(b1, avg_objsize);store_size(b2, min_objsize);
- store_size(b3, max_objsize);
- printf("User %10s %10s %10s\n",
- b1, b2, b3);
-
- store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
- store_size(b3, max_objwaste);
- printf("Loss %10s %10s %10s\n",
- b1, b2, b3);
-}
-
-static void sort_slabs(void)
-{
- struct slabinfo *s1,*s2;
-
- for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
- for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
- int result;
-
- if (sort_size)
- result = slab_size(s1) < slab_size(s2);
- else if (sort_active)
- result = slab_activity(s1) < slab_activity(s2);
- else
- result = strcasecmp(s1->name, s2->name);
-
- if (show_inverted)
- result = -result;
-
- if (result > 0) {
- struct slabinfo t;
-
- memcpy(&t, s1, sizeof(struct slabinfo));
- memcpy(s1, s2, sizeof(struct slabinfo));
- memcpy(s2, &t, sizeof(struct slabinfo));
- }
- }
- }
-}
-
-static void sort_aliases(void)
-{
- struct aliasinfo *a1,*a2;
-
- for (a1 = aliasinfo; a1 < aliasinfo + aliases; a1++) {
- for (a2 = a1 + 1; a2 < aliasinfo + aliases; a2++) {
- char *n1, *n2;
-
- n1 = a1->name;
- n2 = a2->name;
- if (show_alias && !show_inverted) {
- n1 = a1->ref;
- n2 = a2->ref;
- }
- if (strcasecmp(n1, n2) > 0) {
- struct aliasinfo t;
-
- memcpy(&t, a1, sizeof(struct aliasinfo));
- memcpy(a1, a2, sizeof(struct aliasinfo));
- memcpy(a2, &t, sizeof(struct aliasinfo));
- }
- }
- }
-}
-
-static void link_slabs(void)
-{
- struct aliasinfo *a;
- struct slabinfo *s;
-
- for (a = aliasinfo; a < aliasinfo + aliases; a++) {
-
- for (s = slabinfo; s < slabinfo + slabs; s++)
- if (strcmp(a->ref, s->name) == 0) {
- a->slab = s;
- s->refs++;
- break;
- }
- if (s == slabinfo + slabs)
- fatal("Unresolved alias %s\n", a->ref);
- }
-}
-
-static void alias(void)
-{
- struct aliasinfo *a;
- char *active = NULL;
-
- sort_aliases();
- link_slabs();
-
- for(a = aliasinfo; a < aliasinfo + aliases; a++) {
-
- if (!show_single_ref && a->slab->refs == 1)
- continue;
-
- if (!show_inverted) {
- if (active) {
- if (strcmp(a->slab->name, active) == 0) {
- printf(" %s", a->name);
- continue;
- }
- }
- printf("\n%-12s <- %s", a->slab->name, a->name);
- active = a->slab->name;
- }
- else
- printf("%-20s -> %s\n", a->name, a->slab->name);
- }
- if (active)
- printf("\n");
-}
-
-
-static void rename_slabs(void)
-{
- struct slabinfo *s;
- struct aliasinfo *a;
-
- for (s = slabinfo; s < slabinfo + slabs; s++) {
- if (*s->name != ':')
- continue;
-
- if (s->refs > 1 && !show_first_alias)
- continue;
-
- a = find_one_alias(s);
-
- if (a)
- s->name = a->name;
- else {
- s->name = "*";
- actual_slabs--;
- }
- }
-}
-
-static int slab_mismatch(char *slab)
-{
- return regexec(&pattern, slab, 0, NULL, 0);
-}
-
-static void read_slab_dir(void)
-{
- DIR *dir;
- struct dirent *de;
- struct slabinfo *slab = slabinfo;
- struct aliasinfo *alias = aliasinfo;
- char *p;
- char *t;
- int count;
-
- if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
- fatal("SYSFS support for SLUB not active\n");
-
- dir = opendir(".");
- while ((de = readdir(dir))) {
- if (de->d_name[0] == '.' ||
- (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
- continue;
- switch (de->d_type) {
- case DT_LNK:
- alias->name = strdup(de->d_name);
- count = readlink(de->d_name, buffer, sizeof(buffer));
-
- if (count < 0)
- fatal("Cannot read symlink %s\n", de->d_name);
-
- buffer[count] = 0;
- p = buffer + count;
- while (p > buffer && p[-1] != '/')
- p--;
- alias->ref = strdup(p);
- alias++;
- break;
- case DT_DIR:
- if (chdir(de->d_name))
- fatal("Unable to access slab %s\n", slab->name);
- slab->name = strdup(de->d_name);
- slab->alias = 0;
- slab->refs = 0;
- slab->aliases = get_obj("aliases");
- slab->align = get_obj("align");
- slab->cache_dma = get_obj("cache_dma");
- slab->cpu_slabs = get_obj("cpu_slabs");
- slab->destroy_by_rcu = get_obj("destroy_by_rcu");
- slab->hwcache_align = get_obj("hwcache_align");
- slab->object_size = get_obj("object_size");
- slab->objects = get_obj("objects");
- slab->objects_partial = get_obj("objects_partial");
- slab->objects_total = get_obj("objects_total");
- slab->objs_per_slab = get_obj("objs_per_slab");
- slab->order = get_obj("order");
- slab->partial = get_obj("partial");
- slab->partial = get_obj_and_str("partial", &t);
- decode_numa_list(slab->numa_partial, t);
- free(t);
- slab->poison = get_obj("poison");
- slab->reclaim_account = get_obj("reclaim_account");
- slab->red_zone = get_obj("red_zone");
- slab->sanity_checks = get_obj("sanity_checks");
- slab->slab_size = get_obj("slab_size");
- slab->slabs = get_obj_and_str("slabs", &t);
- decode_numa_list(slab->numa, t);
- free(t);
- slab->store_user = get_obj("store_user");
- slab->trace = get_obj("trace");
- slab->alloc_fastpath = get_obj("alloc_fastpath");
- slab->alloc_slowpath = get_obj("alloc_slowpath");
- slab->free_fastpath = get_obj("free_fastpath");
- slab->free_slowpath = get_obj("free_slowpath");
- slab->free_frozen= get_obj("free_frozen");
- slab->free_add_partial = get_obj("free_add_partial");
- slab->free_remove_partial = get_obj("free_remove_partial");
- slab->alloc_from_partial = get_obj("alloc_from_partial");
- slab->alloc_slab = get_obj("alloc_slab");
- slab->alloc_refill = get_obj("alloc_refill");
- slab->free_slab = get_obj("free_slab");
- slab->cpuslab_flush = get_obj("cpuslab_flush");
- slab->deactivate_full = get_obj("deactivate_full");
- slab->deactivate_empty = get_obj("deactivate_empty");
- slab->deactivate_to_head = get_obj("deactivate_to_head");
- slab->deactivate_to_tail = get_obj("deactivate_to_tail");
- slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
- slab->order_fallback = get_obj("order_fallback");
- chdir("..");
- if (slab->name[0] == ':')
- alias_targets++;
- slab++;
- break;
- default :
- fatal("Unknown file type %lx\n", de->d_type);
- }
- }
- closedir(dir);
- slabs = slab - slabinfo;
- actual_slabs = slabs;
- aliases = alias - aliasinfo;
- if (slabs > MAX_SLABS)
- fatal("Too many slabs\n");
- if (aliases > MAX_ALIASES)
- fatal("Too many aliases\n");
-}
-
-static void output_slabs(void)
-{
- struct slabinfo *slab;
-
- for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
-
- if (slab->alias)
- continue;
-
-
- if (show_numa)
- slab_numa(slab, 0);
- else if (show_track)
- show_tracking(slab);
- else if (validate)
- slab_validate(slab);
- else if (shrink)
- slab_shrink(slab);
- else if (set_debug)
- slab_debug(slab);
- else if (show_ops)
- ops(slab);
- else if (show_slab)
- slabcache(slab);
- else if (show_report)
- report(slab);
- }
-}
-
-struct option opts[] = {
- { "aliases", 0, NULL, 'a' },
- { "activity", 0, NULL, 'A' },
- { "debug", 2, NULL, 'd' },
- { "display-activity", 0, NULL, 'D' },
- { "empty", 0, NULL, 'e' },
- { "first-alias", 0, NULL, 'f' },
- { "help", 0, NULL, 'h' },
- { "inverted", 0, NULL, 'i'},
- { "numa", 0, NULL, 'n' },
- { "ops", 0, NULL, 'o' },
- { "report", 0, NULL, 'r' },
- { "shrink", 0, NULL, 's' },
- { "slabs", 0, NULL, 'l' },
- { "track", 0, NULL, 't'},
- { "validate", 0, NULL, 'v' },
- { "zero", 0, NULL, 'z' },
- { "1ref", 0, NULL, '1'},
- { NULL, 0, NULL, 0 }
-};
-
-int main(int argc, char *argv[])
-{
- int c;
- int err;
- char *pattern_source;
-
- page_size = getpagesize();
-
- while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
- opts, NULL)) != -1)
- switch (c) {
- case '1':
- show_single_ref = 1;
- break;
- case 'a':
- show_alias = 1;
- break;
- case 'A':
- sort_active = 1;
- break;
- case 'd':
- set_debug = 1;
- if (!debug_opt_scan(optarg))
- fatal("Invalid debug option '%s'\n", optarg);
- break;
- case 'D':
- show_activity = 1;
- break;
- case 'e':
- show_empty = 1;
- break;
- case 'f':
- show_first_alias = 1;
- break;
- case 'h':
- usage();
- return 0;
- case 'i':
- show_inverted = 1;
- break;
- case 'n':
- show_numa = 1;
- break;
- case 'o':
- show_ops = 1;
- break;
- case 'r':
- show_report = 1;
- break;
- case 's':
- shrink = 1;
- break;
- case 'l':
- show_slab = 1;
- break;
- case 't':
- show_track = 1;
- break;
- case 'v':
- validate = 1;
- break;
- case 'z':
- skip_zero = 0;
- break;
- case 'T':
- show_totals = 1;
- break;
- case 'S':
- sort_size = 1;
- break;
-
- default:
- fatal("%s: Invalid option '%c'\n", argv[0], optopt);
-
- }
-
- if (!show_slab && !show_alias && !show_track && !show_report
- && !validate && !shrink && !set_debug && !show_ops)
- show_slab = 1;
-
- if (argc > optind)
- pattern_source = argv[optind];
- else
- pattern_source = ".*";
-
- err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
- if (err)
- fatal("%s: Invalid pattern '%s' code %d\n",
- argv[0], pattern_source, err);
- read_slab_dir();
- if (show_alias)
- alias();
- else
- if (show_totals)
- totals();
- else {
- link_slabs();
- rename_slabs();
- sort_slabs();
- output_slabs();
- }
- return 0;
-}
Index: linux-2.6/tools/slub/slabinfo.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/tools/slub/slabinfo.c 2010-10-05 16:26:25.000000000 -0500
@@ -0,0 +1,1364 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+ char *name;
+ int alias;
+ int refs;
+ int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
+ int hwcache_align, object_size, objs_per_slab;
+ int sanity_checks, slab_size, store_user, trace;
+ int order, poison, reclaim_account, red_zone;
+ unsigned long partial, objects, slabs, objects_partial, objects_total;
+ unsigned long alloc_fastpath, alloc_slowpath;
+ unsigned long free_fastpath, free_slowpath;
+ unsigned long free_frozen, free_add_partial, free_remove_partial;
+ unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill;
+ unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
+ unsigned long deactivate_to_head, deactivate_to_tail;
+ unsigned long deactivate_remote_frees, order_fallback;
+ int numa[MAX_NODES];
+ int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+struct aliasinfo {
+ char *name;
+ char *ref;
+ struct slabinfo *slab;
+} aliasinfo[MAX_ALIASES];
+
+int slabs = 0;
+int actual_slabs = 0;
+int aliases = 0;
+int alias_targets = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_alias = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int show_first_alias = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_single_ref = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+static void fatal(const char *x, ...)
+{
+ va_list ap;
+
+ va_start(ap, x);
+ vfprintf(stderr, x, ap);
+ va_end(ap);
+ exit(EXIT_FAILURE);
+}
+
+static void usage(void)
+{
+ printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+ "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+ "-a|--aliases Show aliases\n"
+ "-A|--activity Most active slabs first\n"
+ "-d<options>|--debug=<options> Set/Clear Debug options\n"
+ "-D|--display-active Switch line format to activity\n"
+ "-e|--empty Show empty slabs\n"
+ "-f|--first-alias Show first alias\n"
+ "-h|--help Show usage information\n"
+ "-i|--inverted Inverted list\n"
+ "-l|--slabs Show slabs\n"
+ "-n|--numa Show NUMA information\n"
+ "-o|--ops Show kmem_cache_ops\n"
+ "-s|--shrink Shrink slabs\n"
+ "-r|--report Detailed report on single slabs\n"
+ "-S|--Size Sort by size\n"
+ "-t|--tracking Show alloc/free information\n"
+ "-T|--Totals Show summary information\n"
+ "-v|--validate Validate slabs\n"
+ "-z|--zero Include empty slabs\n"
+ "-1|--1ref Single reference\n"
+ "\nValid debug options (FZPUT may be combined)\n"
+ "a / A Switch on all debug options (=FZUP)\n"
+ "- Switch off all debug options\n"
+ "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
+ "z / Z Redzoning\n"
+ "p / P Poisoning\n"
+ "u / U Tracking\n"
+ "t / T Tracing\n"
+ );
+}
+
+static unsigned long read_obj(const char *name)
+{
+ FILE *f = fopen(name, "r");
+
+ if (!f)
+ buffer[0] = 0;
+ else {
+ if (!fgets(buffer, sizeof(buffer), f))
+ buffer[0] = 0;
+ fclose(f);
+ if (buffer[strlen(buffer)] == '\n')
+ buffer[strlen(buffer)] = 0;
+ }
+ return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+static unsigned long get_obj(const char *name)
+{
+ if (!read_obj(name))
+ return 0;
+
+ return atol(buffer);
+}
+
+static unsigned long get_obj_and_str(const char *name, char **x)
+{
+ unsigned long result = 0;
+ char *p;
+
+ *x = NULL;
+
+ if (!read_obj(name)) {
+ x = NULL;
+ return 0;
+ }
+ result = strtoul(buffer, &p, 10);
+ while (*p == ' ')
+ p++;
+ if (*p)
+ *x = strdup(p);
+ return result;
+}
+
+static void set_obj(struct slabinfo *s, const char *name, int n)
+{
+ char x[100];
+ FILE *f;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "w");
+ if (!f)
+ fatal("Cannot write to %s\n", x);
+
+ fprintf(f, "%d\n", n);
+ fclose(f);
+}
+
+static unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+ char x[100];
+ FILE *f;
+ size_t l;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "r");
+ if (!f) {
+ buffer[0] = 0;
+ l = 0;
+ } else {
+ l = fread(buffer, 1, sizeof(buffer), f);
+ buffer[l] = 0;
+ fclose(f);
+ }
+ return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+static int store_size(char *buffer, unsigned long value)
+{
+ unsigned long divisor = 1;
+ char trailer = 0;
+ int n;
+
+ if (value > 1000000000UL) {
+ divisor = 100000000UL;
+ trailer = 'G';
+ } else if (value > 1000000UL) {
+ divisor = 100000UL;
+ trailer = 'M';
+ } else if (value > 1000UL) {
+ divisor = 100;
+ trailer = 'K';
+ }
+
+ value /= divisor;
+ n = sprintf(buffer, "%ld",value);
+ if (trailer) {
+ buffer[n] = trailer;
+ n++;
+ buffer[n] = 0;
+ }
+ if (divisor != 1) {
+ memmove(buffer + n - 2, buffer + n - 3, 4);
+ buffer[n-2] = '.';
+ n++;
+ }
+ return n;
+}
+
+static void decode_numa_list(int *numa, char *t)
+{
+ int node;
+ int nr;
+
+ memset(numa, 0, MAX_NODES * sizeof(int));
+
+ if (!t)
+ return;
+
+ while (*t == 'N') {
+ t++;
+ node = strtoul(t, &t, 10);
+ if (*t == '=') {
+ t++;
+ nr = strtoul(t, &t, 10);
+ numa[node] = nr;
+ if (node > highest_node)
+ highest_node = node;
+ }
+ while (*t == ' ')
+ t++;
+ }
+}
+
+static void slab_validate(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "validate", 1);
+}
+
+static void slab_shrink(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+static void first_line(void)
+{
+ if (show_activity)
+ printf("Name Objects Alloc Free %%Fast Fallb O\n");
+ else
+ printf("Name Objects Objsize Space "
+ "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n");
+}
+
+/*
+ * Find the shortest alias of a slab
+ */
+static struct aliasinfo *find_one_alias(struct slabinfo *find)
+{
+ struct aliasinfo *a;
+ struct aliasinfo *best = NULL;
+
+ for(a = aliasinfo;a < aliasinfo + aliases; a++) {
+ if (a->slab == find &&
+ (!best || strlen(best->name) < strlen(a->name))) {
+ best = a;
+ if (strncmp(a->name,"kmall", 5) == 0)
+ return best;
+ }
+ }
+ return best;
+}
+
+static unsigned long slab_size(struct slabinfo *s)
+{
+ return s->slabs * (page_size << s->order);
+}
+
+static unsigned long slab_activity(struct slabinfo *s)
+{
+ return s->alloc_fastpath + s->free_fastpath +
+ s->alloc_slowpath + s->free_slowpath;
+}
+
+static void slab_numa(struct slabinfo *s, int mode)
+{
+ int node;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (!highest_node) {
+ printf("\n%s: No NUMA information available.\n", s->name);
+ return;
+ }
+
+ if (skip_zero && !s->slabs)
+ return;
+
+ if (!line) {
+ printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+ for(node = 0; node <= highest_node; node++)
+ printf(" %4d", node);
+ printf("\n----------------------");
+ for(node = 0; node <= highest_node; node++)
+ printf("-----");
+ printf("\n");
+ }
+ printf("%-21s ", mode ? "All slabs" : s->name);
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ if (mode) {
+ printf("%-21s ", "Partial slabs");
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa_partial[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ }
+ line++;
+}
+
+static void show_tracking(struct slabinfo *s)
+{
+ printf("\n%s: Kernel object allocation\n", s->name);
+ printf("-----------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "alloc_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+ printf("\n%s: Kernel object freeing\n", s->name);
+ printf("------------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "free_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+}
+
+static void ops(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (read_slab_obj(s, "ops")) {
+ printf("\n%s: kmem_cache operations\n", s->name);
+ printf("--------------------------------------------\n");
+ printf(buffer);
+ } else
+ printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+static const char *onoff(int x)
+{
+ if (x)
+ return "On ";
+ return "Off";
+}
+
+static void slab_stats(struct slabinfo *s)
+{
+ unsigned long total_alloc;
+ unsigned long total_free;
+ unsigned long total;
+
+ if (!s->alloc_slab)
+ return;
+
+ total_alloc = s->alloc_fastpath + s->alloc_slowpath;
+ total_free = s->free_fastpath + s->free_slowpath;
+
+ if (!total_alloc)
+ return;
+
+ printf("\n");
+ printf("Slab Perf Counter Alloc Free %%Al %%Fr\n");
+ printf("--------------------------------------------------\n");
+ printf("Fastpath %8lu %8lu %3lu %3lu\n",
+ s->alloc_fastpath, s->free_fastpath,
+ s->alloc_fastpath * 100 / total_alloc,
+ s->free_fastpath * 100 / total_free);
+ printf("Slowpath %8lu %8lu %3lu %3lu\n",
+ total_alloc - s->alloc_fastpath, s->free_slowpath,
+ (total_alloc - s->alloc_fastpath) * 100 / total_alloc,
+ s->free_slowpath * 100 / total_free);
+ printf("Page Alloc %8lu %8lu %3lu %3lu\n",
+ s->alloc_slab, s->free_slab,
+ s->alloc_slab * 100 / total_alloc,
+ s->free_slab * 100 / total_free);
+ printf("Add partial %8lu %8lu %3lu %3lu\n",
+ s->deactivate_to_head + s->deactivate_to_tail,
+ s->free_add_partial,
+ (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc,
+ s->free_add_partial * 100 / total_free);
+ printf("Remove partial %8lu %8lu %3lu %3lu\n",
+ s->alloc_from_partial, s->free_remove_partial,
+ s->alloc_from_partial * 100 / total_alloc,
+ s->free_remove_partial * 100 / total_free);
+
+ printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
+ s->deactivate_remote_frees, s->free_frozen,
+ s->deactivate_remote_frees * 100 / total_alloc,
+ s->free_frozen * 100 / total_free);
+
+ printf("Total %8lu %8lu\n\n", total_alloc, total_free);
+
+ if (s->cpuslab_flush)
+ printf("Flushes %8lu\n", s->cpuslab_flush);
+
+ if (s->alloc_refill)
+ printf("Refill %8lu\n", s->alloc_refill);
+
+ total = s->deactivate_full + s->deactivate_empty +
+ s->deactivate_to_head + s->deactivate_to_tail;
+
+ if (total)
+ printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
+ "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
+ s->deactivate_full, (s->deactivate_full * 100) / total,
+ s->deactivate_empty, (s->deactivate_empty * 100) / total,
+ s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
+ s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
+}
+
+static void report(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n",
+ s->name, s->aliases, s->order, s->objects);
+ if (s->hwcache_align)
+ printf("** Hardware cacheline aligned\n");
+ if (s->cache_dma)
+ printf("** Memory is allocated in a special DMA zone\n");
+ if (s->destroy_by_rcu)
+ printf("** Slabs are destroyed via RCU\n");
+ if (s->reclaim_account)
+ printf("** Reclaim accounting active\n");
+
+ printf("\nSizes (bytes) Slabs Debug Memory\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
+ s->object_size, s->slabs, onoff(s->sanity_checks),
+ s->slabs * (page_size << s->order));
+ printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n",
+ s->slab_size, s->slabs - s->partial - s->cpu_slabs,
+ onoff(s->red_zone), s->objects * s->object_size);
+ printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n",
+ page_size << s->order, s->partial, onoff(s->poison),
+ s->slabs * (page_size << s->order) - s->objects * s->object_size);
+ printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n",
+ s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user),
+ (s->slab_size - s->object_size) * s->objects);
+ printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
+ s->align, s->objs_per_slab, onoff(s->trace),
+ ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+ s->slabs);
+
+ ops(s);
+ show_tracking(s);
+ slab_numa(s, 1);
+ slab_stats(s);
+}
+
+static void slabcache(struct slabinfo *s)
+{
+ char size_str[20];
+ char dist_str[40];
+ char flags[20];
+ char *p = flags;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (actual_slabs == 1) {
+ report(s);
+ return;
+ }
+
+ if (skip_zero && !show_empty && !s->slabs)
+ return;
+
+ if (show_empty && s->slabs)
+ return;
+
+ store_size(size_str, slab_size(s));
+ snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
+ s->partial, s->cpu_slabs);
+
+ if (!line++)
+ first_line();
+
+ if (s->aliases)
+ *p++ = '*';
+ if (s->cache_dma)
+ *p++ = 'd';
+ if (s->hwcache_align)
+ *p++ = 'A';
+ if (s->poison)
+ *p++ = 'P';
+ if (s->reclaim_account)
+ *p++ = 'a';
+ if (s->red_zone)
+ *p++ = 'Z';
+ if (s->sanity_checks)
+ *p++ = 'F';
+ if (s->store_user)
+ *p++ = 'U';
+ if (s->trace)
+ *p++ = 'T';
+
+ *p = 0;
+ if (show_activity) {
+ unsigned long total_alloc;
+ unsigned long total_free;
+
+ total_alloc = s->alloc_fastpath + s->alloc_slowpath;
+ total_free = s->free_fastpath + s->free_slowpath;
+
+ printf("%-21s %8ld %10ld %10ld %3ld %3ld %5ld %1d\n",
+ s->name, s->objects,
+ total_alloc, total_free,
+ total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
+ total_free ? (s->free_fastpath * 100 / total_free) : 0,
+ s->order_fallback, s->order);
+ }
+ else
+ printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
+ s->name, s->objects, s->object_size, size_str, dist_str,
+ s->objs_per_slab, s->order,
+ s->slabs ? (s->partial * 100) / s->slabs : 100,
+ s->slabs ? (s->objects * s->object_size * 100) /
+ (s->slabs * (page_size << s->order)) : 100,
+ flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+static int debug_opt_scan(char *opt)
+{
+ if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+ return 1;
+
+ if (strcasecmp(opt, "a") == 0) {
+ sanity = 1;
+ poison = 1;
+ redzone = 1;
+ tracking = 1;
+ return 1;
+ }
+
+ for ( ; *opt; opt++)
+ switch (*opt) {
+ case 'F' : case 'f':
+ if (sanity)
+ return 0;
+ sanity = 1;
+ break;
+ case 'P' : case 'p':
+ if (poison)
+ return 0;
+ poison = 1;
+ break;
+
+ case 'Z' : case 'z':
+ if (redzone)
+ return 0;
+ redzone = 1;
+ break;
+
+ case 'U' : case 'u':
+ if (tracking)
+ return 0;
+ tracking = 1;
+ break;
+
+ case 'T' : case 't':
+ if (tracing)
+ return 0;
+ tracing = 1;
+ break;
+ default:
+ return 0;
+ }
+ return 1;
+}
+
+static int slab_empty(struct slabinfo *s)
+{
+ if (s->objects > 0)
+ return 0;
+
+ /*
+ * We may still have slabs even if there are no objects. Shrinking will
+ * remove them.
+ */
+ if (s->slabs != 0)
+ set_obj(s, "shrink", 1);
+
+ return 1;
+}
+
+static void slab_debug(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (sanity && !s->sanity_checks) {
+ set_obj(s, "sanity", 1);
+ }
+ if (!sanity && s->sanity_checks) {
+ if (slab_empty(s))
+ set_obj(s, "sanity", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name);
+ }
+ if (redzone && !s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+ }
+ if (!redzone && s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+ }
+ if (poison && !s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+ }
+ if (!poison && s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+ }
+ if (tracking && !s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+ }
+ if (!tracking && s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+ }
+ if (tracing && !s->trace) {
+ if (slabs == 1)
+ set_obj(s, "trace", 1);
+ else
+ fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name);
+ }
+ if (!tracing && s->trace)
+ set_obj(s, "trace", 1);
+}
+
+static void totals(void)
+{
+ struct slabinfo *s;
+
+ int used_slabs = 0;
+ char b1[20], b2[20], b3[20], b4[20];
+ unsigned long long max = 1ULL << 63;
+
+ /* Object size */
+ unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+ /* Number of partial slabs in a slabcache */
+ unsigned long long min_partial = max, max_partial = 0,
+ avg_partial, total_partial = 0;
+
+ /* Number of slabs in a slab cache */
+ unsigned long long min_slabs = max, max_slabs = 0,
+ avg_slabs, total_slabs = 0;
+
+ /* Size of the whole slab */
+ unsigned long long min_size = max, max_size = 0,
+ avg_size, total_size = 0;
+
+ /* Bytes used for object storage in a slab */
+ unsigned long long min_used = max, max_used = 0,
+ avg_used, total_used = 0;
+
+ /* Waste: Bytes used for alignment and padding */
+ unsigned long long min_waste = max, max_waste = 0,
+ avg_waste, total_waste = 0;
+ /* Number of objects in a slab */
+ unsigned long long min_objects = max, max_objects = 0,
+ avg_objects, total_objects = 0;
+ /* Waste per object */
+ unsigned long long min_objwaste = max,
+ max_objwaste = 0, avg_objwaste,
+ total_objwaste = 0;
+
+ /* Memory per object */
+ unsigned long long min_memobj = max,
+ max_memobj = 0, avg_memobj,
+ total_objsize = 0;
+
+ /* Percentage of partial slabs per slab */
+ unsigned long min_ppart = 100, max_ppart = 0,
+ avg_ppart, total_ppart = 0;
+
+ /* Number of objects in partial slabs */
+ unsigned long min_partobj = max, max_partobj = 0,
+ avg_partobj, total_partobj = 0;
+
+ /* Percentage of partial objects of all objects in a slab */
+ unsigned long min_ppartobj = 100, max_ppartobj = 0,
+ avg_ppartobj, total_ppartobj = 0;
+
+
+ for (s = slabinfo; s < slabinfo + slabs; s++) {
+ unsigned long long size;
+ unsigned long used;
+ unsigned long long wasted;
+ unsigned long long objwaste;
+ unsigned long percentage_partial_slabs;
+ unsigned long percentage_partial_objs;
+
+ if (!s->slabs || !s->objects)
+ continue;
+
+ used_slabs++;
+
+ size = slab_size(s);
+ used = s->objects * s->object_size;
+ wasted = size - used;
+ objwaste = s->slab_size - s->object_size;
+
+ percentage_partial_slabs = s->partial * 100 / s->slabs;
+ if (percentage_partial_slabs > 100)
+ percentage_partial_slabs = 100;
+
+ percentage_partial_objs = s->objects_partial * 100
+ / s->objects;
+
+ if (percentage_partial_objs > 100)
+ percentage_partial_objs = 100;
+
+ if (s->object_size < min_objsize)
+ min_objsize = s->object_size;
+ if (s->partial < min_partial)
+ min_partial = s->partial;
+ if (s->slabs < min_slabs)
+ min_slabs = s->slabs;
+ if (size < min_size)
+ min_size = size;
+ if (wasted < min_waste)
+ min_waste = wasted;
+ if (objwaste < min_objwaste)
+ min_objwaste = objwaste;
+ if (s->objects < min_objects)
+ min_objects = s->objects;
+ if (used < min_used)
+ min_used = used;
+ if (s->objects_partial < min_partobj)
+ min_partobj = s->objects_partial;
+ if (percentage_partial_slabs < min_ppart)
+ min_ppart = percentage_partial_slabs;
+ if (percentage_partial_objs < min_ppartobj)
+ min_ppartobj = percentage_partial_objs;
+ if (s->slab_size < min_memobj)
+ min_memobj = s->slab_size;
+
+ if (s->object_size > max_objsize)
+ max_objsize = s->object_size;
+ if (s->partial > max_partial)
+ max_partial = s->partial;
+ if (s->slabs > max_slabs)
+ max_slabs = s->slabs;
+ if (size > max_size)
+ max_size = size;
+ if (wasted > max_waste)
+ max_waste = wasted;
+ if (objwaste > max_objwaste)
+ max_objwaste = objwaste;
+ if (s->objects > max_objects)
+ max_objects = s->objects;
+ if (used > max_used)
+ max_used = used;
+ if (s->objects_partial > max_partobj)
+ max_partobj = s->objects_partial;
+ if (percentage_partial_slabs > max_ppart)
+ max_ppart = percentage_partial_slabs;
+ if (percentage_partial_objs > max_ppartobj)
+ max_ppartobj = percentage_partial_objs;
+ if (s->slab_size > max_memobj)
+ max_memobj = s->slab_size;
+
+ total_partial += s->partial;
+ total_slabs += s->slabs;
+ total_size += size;
+ total_waste += wasted;
+
+ total_objects += s->objects;
+ total_used += used;
+ total_partobj += s->objects_partial;
+ total_ppart += percentage_partial_slabs;
+ total_ppartobj += percentage_partial_objs;
+
+ total_objwaste += s->objects * objwaste;
+ total_objsize += s->objects * s->slab_size;
+ }
+
+ if (!total_objects) {
+ printf("No objects\n");
+ return;
+ }
+ if (!used_slabs) {
+ printf("No slabs\n");
+ return;
+ }
+
+ /* Per slab averages */
+ avg_partial = total_partial / used_slabs;
+ avg_slabs = total_slabs / used_slabs;
+ avg_size = total_size / used_slabs;
+ avg_waste = total_waste / used_slabs;
+
+ avg_objects = total_objects / used_slabs;
+ avg_used = total_used / used_slabs;
+ avg_partobj = total_partobj / used_slabs;
+ avg_ppart = total_ppart / used_slabs;
+ avg_ppartobj = total_ppartobj / used_slabs;
+
+ /* Per object object sizes */
+ avg_objsize = total_used / total_objects;
+ avg_objwaste = total_objwaste / total_objects;
+ avg_partobj = total_partobj * 100 / total_objects;
+ avg_memobj = total_objsize / total_objects;
+
+ printf("Slabcache Totals\n");
+ printf("----------------\n");
+ printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n",
+ slabs, aliases, alias_targets, used_slabs);
+
+ store_size(b1, total_size);store_size(b2, total_waste);
+ store_size(b3, total_waste * 100 / total_used);
+ printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
+
+ store_size(b1, total_objects);store_size(b2, total_partobj);
+ store_size(b3, total_partobj * 100 / total_objects);
+ printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3);
+
+ printf("\n");
+ printf("Per Cache Average Min Max Total\n");
+ printf("---------------------------------------------------------\n");
+
+ store_size(b1, avg_objects);store_size(b2, min_objects);
+ store_size(b3, max_objects);store_size(b4, total_objects);
+ printf("#Objects %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_slabs);store_size(b2, min_slabs);
+ store_size(b3, max_slabs);store_size(b4, total_slabs);
+ printf("#Slabs %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_partial);store_size(b2, min_partial);
+ store_size(b3, max_partial);store_size(b4, total_partial);
+ printf("#PartSlab %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+ store_size(b1, avg_ppart);store_size(b2, min_ppart);
+ store_size(b3, max_ppart);
+ store_size(b4, total_partial * 100 / total_slabs);
+ printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_partobj);store_size(b2, min_partobj);
+ store_size(b3, max_partobj);
+ store_size(b4, total_partobj);
+ printf("PartObjs %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj);
+ store_size(b3, max_ppartobj);
+ store_size(b4, total_partobj * 100 / total_objects);
+ printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_size);store_size(b2, min_size);
+ store_size(b3, max_size);store_size(b4, total_size);
+ printf("Memory %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_used);store_size(b2, min_used);
+ store_size(b3, max_used);store_size(b4, total_used);
+ printf("Used %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_waste);store_size(b2, min_waste);
+ store_size(b3, max_waste);store_size(b4, total_waste);
+ printf("Loss %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ printf("\n");
+ printf("Per Object Average Min Max\n");
+ printf("---------------------------------------------\n");
+
+ store_size(b1, avg_memobj);store_size(b2, min_memobj);
+ store_size(b3, max_memobj);
+ printf("Memory %10s %10s %10s\n",
+ b1, b2, b3);
+ store_size(b1, avg_objsize);store_size(b2, min_objsize);
+ store_size(b3, max_objsize);
+ printf("User %10s %10s %10s\n",
+ b1, b2, b3);
+
+ store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+ store_size(b3, max_objwaste);
+ printf("Loss %10s %10s %10s\n",
+ b1, b2, b3);
+}
+
+static void sort_slabs(void)
+{
+ struct slabinfo *s1,*s2;
+
+ for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+ for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+ int result;
+
+ if (sort_size)
+ result = slab_size(s1) < slab_size(s2);
+ else if (sort_active)
+ result = slab_activity(s1) < slab_activity(s2);
+ else
+ result = strcasecmp(s1->name, s2->name);
+
+ if (show_inverted)
+ result = -result;
+
+ if (result > 0) {
+ struct slabinfo t;
+
+ memcpy(&t, s1, sizeof(struct slabinfo));
+ memcpy(s1, s2, sizeof(struct slabinfo));
+ memcpy(s2, &t, sizeof(struct slabinfo));
+ }
+ }
+ }
+}
+
+static void sort_aliases(void)
+{
+ struct aliasinfo *a1,*a2;
+
+ for (a1 = aliasinfo; a1 < aliasinfo + aliases; a1++) {
+ for (a2 = a1 + 1; a2 < aliasinfo + aliases; a2++) {
+ char *n1, *n2;
+
+ n1 = a1->name;
+ n2 = a2->name;
+ if (show_alias && !show_inverted) {
+ n1 = a1->ref;
+ n2 = a2->ref;
+ }
+ if (strcasecmp(n1, n2) > 0) {
+ struct aliasinfo t;
+
+ memcpy(&t, a1, sizeof(struct aliasinfo));
+ memcpy(a1, a2, sizeof(struct aliasinfo));
+ memcpy(a2, &t, sizeof(struct aliasinfo));
+ }
+ }
+ }
+}
+
+static void link_slabs(void)
+{
+ struct aliasinfo *a;
+ struct slabinfo *s;
+
+ for (a = aliasinfo; a < aliasinfo + aliases; a++) {
+
+ for (s = slabinfo; s < slabinfo + slabs; s++)
+ if (strcmp(a->ref, s->name) == 0) {
+ a->slab = s;
+ s->refs++;
+ break;
+ }
+ if (s == slabinfo + slabs)
+ fatal("Unresolved alias %s\n", a->ref);
+ }
+}
+
+static void alias(void)
+{
+ struct aliasinfo *a;
+ char *active = NULL;
+
+ sort_aliases();
+ link_slabs();
+
+ for(a = aliasinfo; a < aliasinfo + aliases; a++) {
+
+ if (!show_single_ref && a->slab->refs == 1)
+ continue;
+
+ if (!show_inverted) {
+ if (active) {
+ if (strcmp(a->slab->name, active) == 0) {
+ printf(" %s", a->name);
+ continue;
+ }
+ }
+ printf("\n%-12s <- %s", a->slab->name, a->name);
+ active = a->slab->name;
+ }
+ else
+ printf("%-20s -> %s\n", a->name, a->slab->name);
+ }
+ if (active)
+ printf("\n");
+}
+
+
+static void rename_slabs(void)
+{
+ struct slabinfo *s;
+ struct aliasinfo *a;
+
+ for (s = slabinfo; s < slabinfo + slabs; s++) {
+ if (*s->name != ':')
+ continue;
+
+ if (s->refs > 1 && !show_first_alias)
+ continue;
+
+ a = find_one_alias(s);
+
+ if (a)
+ s->name = a->name;
+ else {
+ s->name = "*";
+ actual_slabs--;
+ }
+ }
+}
+
+static int slab_mismatch(char *slab)
+{
+ return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+static void read_slab_dir(void)
+{
+ DIR *dir;
+ struct dirent *de;
+ struct slabinfo *slab = slabinfo;
+ struct aliasinfo *alias = aliasinfo;
+ char *p;
+ char *t;
+ int count;
+
+ if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+ fatal("SYSFS support for SLUB not active\n");
+
+ dir = opendir(".");
+ while ((de = readdir(dir))) {
+ if (de->d_name[0] == '.' ||
+ (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+ continue;
+ switch (de->d_type) {
+ case DT_LNK:
+ alias->name = strdup(de->d_name);
+ count = readlink(de->d_name, buffer, sizeof(buffer));
+
+ if (count < 0)
+ fatal("Cannot read symlink %s\n", de->d_name);
+
+ buffer[count] = 0;
+ p = buffer + count;
+ while (p > buffer && p[-1] != '/')
+ p--;
+ alias->ref = strdup(p);
+ alias++;
+ break;
+ case DT_DIR:
+ if (chdir(de->d_name))
+ fatal("Unable to access slab %s\n", slab->name);
+ slab->name = strdup(de->d_name);
+ slab->alias = 0;
+ slab->refs = 0;
+ slab->aliases = get_obj("aliases");
+ slab->align = get_obj("align");
+ slab->cache_dma = get_obj("cache_dma");
+ slab->cpu_slabs = get_obj("cpu_slabs");
+ slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+ slab->hwcache_align = get_obj("hwcache_align");
+ slab->object_size = get_obj("object_size");
+ slab->objects = get_obj("objects");
+ slab->objects_partial = get_obj("objects_partial");
+ slab->objects_total = get_obj("objects_total");
+ slab->objs_per_slab = get_obj("objs_per_slab");
+ slab->order = get_obj("order");
+ slab->partial = get_obj("partial");
+ slab->partial = get_obj_and_str("partial", &t);
+ decode_numa_list(slab->numa_partial, t);
+ free(t);
+ slab->poison = get_obj("poison");
+ slab->reclaim_account = get_obj("reclaim_account");
+ slab->red_zone = get_obj("red_zone");
+ slab->sanity_checks = get_obj("sanity_checks");
+ slab->slab_size = get_obj("slab_size");
+ slab->slabs = get_obj_and_str("slabs", &t);
+ decode_numa_list(slab->numa, t);
+ free(t);
+ slab->store_user = get_obj("store_user");
+ slab->trace = get_obj("trace");
+ slab->alloc_fastpath = get_obj("alloc_fastpath");
+ slab->alloc_slowpath = get_obj("alloc_slowpath");
+ slab->free_fastpath = get_obj("free_fastpath");
+ slab->free_slowpath = get_obj("free_slowpath");
+ slab->free_frozen= get_obj("free_frozen");
+ slab->free_add_partial = get_obj("free_add_partial");
+ slab->free_remove_partial = get_obj("free_remove_partial");
+ slab->alloc_from_partial = get_obj("alloc_from_partial");
+ slab->alloc_slab = get_obj("alloc_slab");
+ slab->alloc_refill = get_obj("alloc_refill");
+ slab->free_slab = get_obj("free_slab");
+ slab->cpuslab_flush = get_obj("cpuslab_flush");
+ slab->deactivate_full = get_obj("deactivate_full");
+ slab->deactivate_empty = get_obj("deactivate_empty");
+ slab->deactivate_to_head = get_obj("deactivate_to_head");
+ slab->deactivate_to_tail = get_obj("deactivate_to_tail");
+ slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
+ slab->order_fallback = get_obj("order_fallback");
+ chdir("..");
+ if (slab->name[0] == ':')
+ alias_targets++;
+ slab++;
+ break;
+ default :
+ fatal("Unknown file type %lx\n", de->d_type);
+ }
+ }
+ closedir(dir);
+ slabs = slab - slabinfo;
+ actual_slabs = slabs;
+ aliases = alias - aliasinfo;
+ if (slabs > MAX_SLABS)
+ fatal("Too many slabs\n");
+ if (aliases > MAX_ALIASES)
+ fatal("Too many aliases\n");
+}
+
+static void output_slabs(void)
+{
+ struct slabinfo *slab;
+
+ for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+ if (slab->alias)
+ continue;
+
+
+ if (show_numa)
+ slab_numa(slab, 0);
+ else if (show_track)
+ show_tracking(slab);
+ else if (validate)
+ slab_validate(slab);
+ else if (shrink)
+ slab_shrink(slab);
+ else if (set_debug)
+ slab_debug(slab);
+ else if (show_ops)
+ ops(slab);
+ else if (show_slab)
+ slabcache(slab);
+ else if (show_report)
+ report(slab);
+ }
+}
+
+struct option opts[] = {
+ { "aliases", 0, NULL, 'a' },
+ { "activity", 0, NULL, 'A' },
+ { "debug", 2, NULL, 'd' },
+ { "display-activity", 0, NULL, 'D' },
+ { "empty", 0, NULL, 'e' },
+ { "first-alias", 0, NULL, 'f' },
+ { "help", 0, NULL, 'h' },
+ { "inverted", 0, NULL, 'i'},
+ { "numa", 0, NULL, 'n' },
+ { "ops", 0, NULL, 'o' },
+ { "report", 0, NULL, 'r' },
+ { "shrink", 0, NULL, 's' },
+ { "slabs", 0, NULL, 'l' },
+ { "track", 0, NULL, 't'},
+ { "validate", 0, NULL, 'v' },
+ { "zero", 0, NULL, 'z' },
+ { "1ref", 0, NULL, '1'},
+ { NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+ int c;
+ int err;
+ char *pattern_source;
+
+ page_size = getpagesize();
+
+ while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
+ opts, NULL)) != -1)
+ switch (c) {
+ case '1':
+ show_single_ref = 1;
+ break;
+ case 'a':
+ show_alias = 1;
+ break;
+ case 'A':
+ sort_active = 1;
+ break;
+ case 'd':
+ set_debug = 1;
+ if (!debug_opt_scan(optarg))
+ fatal("Invalid debug option '%s'\n", optarg);
+ break;
+ case 'D':
+ show_activity = 1;
+ break;
+ case 'e':
+ show_empty = 1;
+ break;
+ case 'f':
+ show_first_alias = 1;
+ break;
+ case 'h':
+ usage();
+ return 0;
+ case 'i':
+ show_inverted = 1;
+ break;
+ case 'n':
+ show_numa = 1;
+ break;
+ case 'o':
+ show_ops = 1;
+ break;
+ case 'r':
+ show_report = 1;
+ break;
+ case 's':
+ shrink = 1;
+ break;
+ case 'l':
+ show_slab = 1;
+ break;
+ case 't':
+ show_track = 1;
+ break;
+ case 'v':
+ validate = 1;
+ break;
+ case 'z':
+ skip_zero = 0;
+ break;
+ case 'T':
+ show_totals = 1;
+ break;
+ case 'S':
+ sort_size = 1;
+ break;
+
+ default:
+ fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+ }
+
+ if (!show_slab && !show_alias && !show_track && !show_report
+ && !validate && !shrink && !set_debug && !show_ops)
+ show_slab = 1;
+
+ if (argc > optind)
+ pattern_source = argv[optind];
+ else
+ pattern_source = ".*";
+
+ err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+ if (err)
+ fatal("%s: Invalid pattern '%s' code %d\n",
+ argv[0], pattern_source, err);
+ read_slab_dir();
+ if (show_alias)
+ alias();
+ else
+ if (show_totals)
+ totals();
+ else {
+ link_slabs();
+ rename_slabs();
+ sort_slabs();
+ output_slabs();
+ }
+ return 0;
+}

Christoph Lameter

2010-10-06 16:21:53 UTC

Modify the slabinfo tool to report the queueing statistics

Signed-off-by: Christoph Lameter <***@linux.com>

---
tools/slub/slabinfo.c | 120 +++++++++++++++++++++++---------------------------
1 file changed, 57 insertions(+), 63 deletions(-)

Index: linux-2.6/tools/slub/slabinfo.c
===================================================================
--- linux-2.6.orig/tools/slub/slabinfo.c 2010-10-05 16:26:48.000000000 -0500
+++ linux-2.6/tools/slub/slabinfo.c 2010-10-06 11:17:40.000000000 -0500
@@ -27,18 +27,19 @@ struct slabinfo {
char *name;
int alias;
int refs;
- int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
+ int aliases, align, cache_dma, destroy_by_rcu;
int hwcache_align, object_size, objs_per_slab;
int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
unsigned long partial, objects, slabs, objects_partial, objects_total;
- unsigned long alloc_fastpath, alloc_slowpath;
- unsigned long free_fastpath, free_slowpath;
- unsigned long free_frozen, free_add_partial, free_remove_partial;
- unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill;
- unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
- unsigned long deactivate_to_head, deactivate_to_tail;
- unsigned long deactivate_remote_frees, order_fallback;
+ unsigned long alloc_fastpath, alloc_shared, alloc_alien, alloc_alien_slow;
+ unsigned long alloc_direct, alloc_slowpath;
+ unsigned long free_fastpath, free_shared, free_alien, free_alien_slow;
+ unsigned long free_direct, free_slowpath;
+ unsigned long free_add_partial, free_remove_partial;
+ unsigned long alloc_from_partial, alloc_remove_partial, alloc_free_partial;
+ unsigned long alloc_slab, free_slab;
+ unsigned long order_fallback, queue_flush;
int numa[MAX_NODES];
int numa_partial[MAX_NODES];
} slabinfo[MAX_SLABS];
@@ -99,7 +100,7 @@ static void fatal(const char *x, ...)

static void usage(void)
{
- printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+ printf("slabinfo 10/10/2010. (c) 2010 sgi/linux foundation.\n\n"
"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
"-a|--aliases Show aliases\n"
"-A|--activity Most active slabs first\n"
@@ -376,20 +377,17 @@ static void slab_numa(struct slabinfo *s

static void show_tracking(struct slabinfo *s)
{
- printf("\n%s: Kernel object allocation\n", s->name);
- printf("-----------------------------------------------------------------------\n");
- if (read_slab_obj(s, "alloc_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
- printf("\n%s: Kernel object freeing\n", s->name);
- printf("------------------------------------------------------------------------\n");
- if (read_slab_obj(s, "free_calls"))
- printf(buffer);
- else
- printf("No Data\n");
+ if (read_slab_obj(s, "alloc_calls")) {
+ printf("\n%s: Kernel object allocation\n", s->name);
+ printf("-----------------------------------------------------------------------\n");
+ printf("%s", buffer);
+ }

+ if (read_slab_obj(s, "free_calls")) {
+ printf("\n%s: Kernel object freeing\n", s->name);
+ printf("------------------------------------------------------------------------\n");
+ printf("%s", buffer);
+ }
}

static void ops(struct slabinfo *s)
@@ -400,7 +398,7 @@ static void ops(struct slabinfo *s)
if (read_slab_obj(s, "ops")) {
printf("\n%s: kmem_cache operations\n", s->name);
printf("--------------------------------------------\n");
- printf(buffer);
+ printf("%s", buffer);
} else
printf("\n%s has no kmem_cache operations\n", s->name);
}
@@ -421,8 +419,10 @@ static void slab_stats(struct slabinfo *
if (!s->alloc_slab)
return;

- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
+ total_alloc = s->alloc_fastpath + s->alloc_shared + s->alloc_alien
+ + s->alloc_alien_slow + s->alloc_slowpath + s->alloc_direct;
+ total_free = s->free_fastpath + s->free_shared + s->free_alien
+ + s->free_alien_slow + s->free_slowpath + s->free_direct;

if (!total_alloc)
return;
@@ -434,47 +434,44 @@ static void slab_stats(struct slabinfo *
s->alloc_fastpath, s->free_fastpath,
s->alloc_fastpath * 100 / total_alloc,
s->free_fastpath * 100 / total_free);
+ printf("Shared Cache %8lu %8lu %3lu %3lu\n",
+ s->alloc_shared, s->free_shared,
+ s->alloc_shared * 100 / total_alloc,
+ s->free_shared * 100 / total_free);
+ printf("Alien Cache %8lu %8lu %3lu %3lu\n",
+ s->alloc_alien, s->free_alien,
+ s->alloc_alien * 100 / total_alloc,
+ s->free_alien * 100 / total_free);
printf("Slowpath %8lu %8lu %3lu %3lu\n",
total_alloc - s->alloc_fastpath, s->free_slowpath,
(total_alloc - s->alloc_fastpath) * 100 / total_alloc,
s->free_slowpath * 100 / total_free);
+ printf("Alien Cache Slow %8lu %8lu %3lu %3lu\n",
+ s->alloc_alien_slow, s->free_alien_slow,
+ s->alloc_alien_slow * 100 / total_alloc,
+ s->free_alien_slow * 100 / total_free);
+ printf("Direct %8lu %8lu %3lu %3lu\n",
+ s->alloc_direct, s->free_direct,
+ s->alloc_direct * 100 / total_alloc,
+ s->free_direct * 100 / total_free);
printf("Page Alloc %8lu %8lu %3lu %3lu\n",
s->alloc_slab, s->free_slab,
s->alloc_slab * 100 / total_alloc,
s->free_slab * 100 / total_free);
printf("Add partial %8lu %8lu %3lu %3lu\n",
- s->deactivate_to_head + s->deactivate_to_tail,
+ s->alloc_free_partial,
s->free_add_partial,
- (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc,
+ s->alloc_free_partial * 100 / total_alloc,
s->free_add_partial * 100 / total_free);
printf("Remove partial %8lu %8lu %3lu %3lu\n",
s->alloc_from_partial, s->free_remove_partial,
s->alloc_from_partial * 100 / total_alloc,
s->free_remove_partial * 100 / total_free);

- printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
- s->deactivate_remote_frees, s->free_frozen,
- s->deactivate_remote_frees * 100 / total_alloc,
- s->free_frozen * 100 / total_free);
-
printf("Total %8lu %8lu\n\n", total_alloc, total_free);

- if (s->cpuslab_flush)
- printf("Flushes %8lu\n", s->cpuslab_flush);
-
- if (s->alloc_refill)
- printf("Refill %8lu\n", s->alloc_refill);
-
- total = s->deactivate_full + s->deactivate_empty +
- s->deactivate_to_head + s->deactivate_to_tail;
-
- if (total)
- printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
- "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
- s->deactivate_full, (s->deactivate_full * 100) / total,
- s->deactivate_empty, (s->deactivate_empty * 100) / total,
- s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
- s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
+ if (s->queue_flush)
+ printf("Flushes %8lu\n", s->queue_flush);
}

static void report(struct slabinfo *s)
@@ -499,13 +496,13 @@ static void report(struct slabinfo *s)
s->object_size, s->slabs, onoff(s->sanity_checks),
s->slabs * (page_size << s->order));
printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n",
- s->slab_size, s->slabs - s->partial - s->cpu_slabs,
+ s->slab_size, s->slabs - s->partial,
onoff(s->red_zone), s->objects * s->object_size);
printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n",
page_size << s->order, s->partial, onoff(s->poison),
s->slabs * (page_size << s->order) - s->objects * s->object_size);
- printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n",
- s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user),
+ printf("Loss : %7d Tracking : %s Lalig: %7ld\n",
+ s->slab_size - s->object_size, onoff(s->store_user),
(s->slab_size - s->object_size) * s->objects);
printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
s->align, s->objs_per_slab, onoff(s->trace),
@@ -540,8 +537,7 @@ static void slabcache(struct slabinfo *s
return;

store_size(size_str, slab_size(s));
- snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
- s->partial, s->cpu_slabs);
+ snprintf(dist_str, 40, "%lu/%lu", s->slabs, s->partial);

if (!line++)
first_line();
@@ -1149,7 +1145,6 @@ static void read_slab_dir(void)
slab->aliases = get_obj("aliases");
slab->align = get_obj("align");
slab->cache_dma = get_obj("cache_dma");
- slab->cpu_slabs = get_obj("cpu_slabs");
slab->destroy_by_rcu = get_obj("destroy_by_rcu");
slab->hwcache_align = get_obj("hwcache_align");
slab->object_size = get_obj("object_size");
@@ -1173,22 +1168,22 @@ static void read_slab_dir(void)
slab->store_user = get_obj("store_user");
slab->trace = get_obj("trace");
slab->alloc_fastpath = get_obj("alloc_fastpath");
+ slab->alloc_shared = get_obj("alloc_shared");
+ slab->alloc_alien = get_obj("alloc_alien");
+ slab->alloc_alien_slow = get_obj("alloc_alien_slow");
slab->alloc_slowpath = get_obj("alloc_slowpath");
+ slab->alloc_direct = get_obj("alloc_direct");
slab->free_fastpath = get_obj("free_fastpath");
+ slab->free_shared = get_obj("free_shared");
+ slab->free_alien = get_obj("free_alien");
+ slab->free_alien_slow = get_obj("free_alien_slow");
slab->free_slowpath = get_obj("free_slowpath");
- slab->free_frozen= get_obj("free_frozen");
+ slab->free_direct = get_obj("free_direct");
slab->free_add_partial = get_obj("free_add_partial");
slab->free_remove_partial = get_obj("free_remove_partial");
slab->alloc_from_partial = get_obj("alloc_from_partial");
slab->alloc_slab = get_obj("alloc_slab");
- slab->alloc_refill = get_obj("alloc_refill");
slab->free_slab = get_obj("free_slab");
- slab->cpuslab_flush = get_obj("cpuslab_flush");
- slab->deactivate_full = get_obj("deactivate_full");
- slab->deactivate_empty = get_obj("deactivate_empty");
- slab->deactivate_to_head = get_obj("deactivate_to_head");
- slab->deactivate_to_tail = get_obj("deactivate_to_tail");
- slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
slab->order_fallback = get_obj("order_fallback");
chdir("..");
if (slab->name[0] == ':')
@@ -1218,7 +1213,6 @@ static void output_slabs(void)
if (slab->alias)
continue;

-
if (show_numa)
slab_numa(slab, 0);
else if (show_track)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 20:56:13 UTC

Created a unified branch in my slab.git on kernel.org as well. Based on
Pekka's for-next branch. There was an additional conflict caused by
another merge to for-next that was fixed.

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git unified

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 16:00:25 UTC

Post by Richard Kennedy
Hi Christoph,
What tree are these patches against ? I'm getting patch failures on the
main tree.

The patches are against Pekkas for-next tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Wu Fengguang

2010-10-06 12:37:53 UTC

[add CC to Alex: he is now in charge of kernel performance tests]

Post by Pekka Enberg
(Adding more people who've taken interest in slab performance in the
past to CC.)

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).
- Implement object expiration via reclaim VM logic.
The following is a release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like in SLAB whereas much of the infrastructure
comes from SLUB.
After this patches SLUB will track the cpu cache contents
1. SLUB accurately tracks cpu caches instead of assuming that there
is only a single cpu cache per node or system.
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.
3. SLUB caches are dynamically configurable via the sysfs filesystem.
4. There is no per slab page metadata structure to maintain (aside
from the object bitmap that usually fits into the page struct).
5. Has all the resiliency and diagnostic features of SLUB.
The unified allocator is a merging of SLUB with some queuing concepts from
SLAB and a new way of managing objects in the slabs using bitmaps. Memory
wise this is slightly more inefficient than SLUB (due to the need to place
large bitmaps --sized a few words--in some slab pages if there are more
than BITS_PER_LONG objects in a slab) but in general does not increase space
use too much.
The SLAB scheme of not touching the object during management is adopted.
The unified allocator can efficiently free and allocate cache cold objects
without causing cache misses.
Some numbers using tcp_rr on localhost
Dell R910 128G RAM, 64 processors, 4 NUMA nodes
threads unified slub slab
64 4141798 3729037 3884939
128 4146587 3890993 4105276
192 4003063 3876570 4110971
256 3928857 3942806 4099249
320 3922623 3969042 4093283
384 3827603 4002833 4108420
448 4140345 4027251 4118534
512 4163741 4050130 4122644
576 4175666 4099934 4149355
640 4190332 4142570 4175618
704 4198779 4173177 4193657
768 4662216 4200462 4222686

Are there any stability problems left? Have you tried other benchmarks
(e.g. hackbench, sysbench)? Can we merge the series in smaller
batches? For example, if we leave out the NUMA parts in the first
stage, do we expect to see performance regressions?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Alex,Shi

2010-10-13 02:21:12 UTC

Post by Wu Fengguang
[add CC to Alex: he is now in charge of kernel performance tests]

I got the code from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git unified
on branch "origin/unified" and do a patch base on 36-rc7 kernel. Then I
tested the patch on our 2P/4P core2 machines and 2P NHM, 2P WSM
machines. Most of benchmark have no clear improvement or regression. The
testing benchmarks is listed here.
http://kernel-perf.sourceforge.net/about_tests.php

But the following results worth to care:
1, netperf loopback testing has a little improvement.
2, hackbench process testing has about 10% regression on both Core2
machine
3, fio testing has about 10~20% regression on mmap rand read testing. on
our raw mode JBOD that attached on NHM and WSM machine. The JBOD use
We use "numactl --interleave all " to do fio testing. one of testing
file as below:
=============
[global]
direct=0
ioengine=mmap
size=8G
bs=4k
numjobs=1
loops=5
runtime=600
group_reporting
invalidate=0
directory=/mnt/stp/fiodata
file_service_type=random:36

[job_sdb1_sub0]
startdelay=0
rw=randread
filename=data0/f1:data0/f2:data0/f3:data0/f4:data0/f5:data0/f6:data0/f7:data0/f8

[job_sdb1_sub1]
startdelay=0
rw=randread
filename=data0/f2:data0/f3:data0/f4:data0/f5:data0/f6:data0/f7:data0/f8:data0/f1

[job_sdb1_sub2]
startdelay=0
rw=randread
filename=data0/f3:data0/f4:data0/f5:data0/f6:data0/f7:data0/f8:data0/f1:data0/f2

[job_sdb1_sub3]
startdelay=0
rw=randread
filename=data0/f4:data0/f5:data0/f6:data0/f7:data0/f8:data0/f1:data0/f2:data0/f3

[job_sdb1_sub4]
startdelay=0
rw=randread
filename=data0/f5:data0/f6:data0/f7:data0/f8:data0/f1:data0/f2:data0/f3:data0/f4

[job_sdb1_sub5]
startdelay=0
rw=randread
filename=data0/f6:data0/f7:data0/f8:data0/f1:data0/f2:data0/f3:data0/f4:data0/f5

[job_sdb1_sub6]
startdelay=0
rw=randread
filename=data0/f7:data0/f8:data0/f1:data0/f2:data0/f3:data0/f4:data0/f5:data0/f6

[job_sdb1_sub7]
startdelay=0
rw=randread
filename=data0/f8:data0/f1:data0/f2:data0/f3:data0/f4:data0/f5:data0/f6:data0/f7

[job_sdb2_sub0]
startdelay=0
rw=randread
filename=data1/f1:data1/f2:data1/f3:data1/f4:data1/f5:data1/f6:data1/f7:data1/f8

[job_sdb2_sub1]
startdelay=0
rw=randread
filename=data1/f2:data1/f3:data1/f4:data1/f5:data1/f6:data1/f7:data1/f8:data1/f1

[job_sdb2_sub2]
startdelay=0
rw=randread
filename=data1/f3:data1/f4:data1/f5:data1/f6:data1/f7:data1/f8:data1/f1:data1/f2

[job_sdb2_sub3]
startdelay=0
rw=randread
filename=data1/f4:data1/f5:data1/f6:data1/f7:data1/f8:data1/f1:data1/f2:data1/f3

[job_sdb2_sub4]
startdelay=0
rw=randread
filename=data1/f5:data1/f6:data1/f7:data1/f8:data1/f1:data1/f2:data1/f3:data1/f4

[job_sdb2_sub5]
startdelay=0
rw=randread
filename=data1/f6:data1/f7:data1/f8:data1/f1:data1/f2:data1/f3:data1/f4:data1/f5

[job_sdb2_sub6]
startdelay=0
rw=randread
filename=data1/f7:data1/f8:data1/f1:data1/f2:data1/f3:data1/f4:data1/f5:data1/f6

[job_sdb2_sub7]
startdelay=0
rw=randread
filename=data1/f8:data1/f1:data1/f2:data1/f3:data1/f4:data1/f5:data1/f6:data1/f7

[job_sdc1_sub0]
startdelay=0
rw=randread
filename=data2/f1:data2/f2:data2/f3:data2/f4:data2/f5:data2/f6:data2/f7:data2/f8

[job_sdc1_sub1]
startdelay=0
rw=randread
filename=data2/f2:data2/f3:data2/f4:data2/f5:data2/f6:data2/f7:data2/f8:data2/f1

[job_sdc1_sub2]
startdelay=0
rw=randread
filename=data2/f3:data2/f4:data2/f5:data2/f6:data2/f7:data2/f8:data2/f1:data2/f2

[job_sdc1_sub3]
startdelay=0
rw=randread
filename=data2/f4:data2/f5:data2/f6:data2/f7:data2/f8:data2/f1:data2/f2:data2/f3

[job_sdc1_sub4]
startdelay=0
rw=randread
filename=data2/f5:data2/f6:data2/f7:data2/f8:data2/f1:data2/f2:data2/f3:data2/f4

[job_sdc1_sub5]
startdelay=0
rw=randread
filename=data2/f6:data2/f7:data2/f8:data2/f1:data2/f2:data2/f3:data2/f4:data2/f5

[job_sdc1_sub6]
startdelay=0
rw=randread
filename=data2/f7:data2/f8:data2/f1:data2/f2:data2/f3:data2/f4:data2/f5:data2/f6

[job_sdc1_sub7]
startdelay=0
rw=randread
filename=data2/f8:data2/f1:data2/f2:data2/f3:data2/f4:data2/f5:data2/f6:data2/f7

[job_sdc2_sub0]
startdelay=0
rw=randread
filename=data3/f1:data3/f2:data3/f3:data3/f4:data3/f5:data3/f6:data3/f7:data3/f8

[job_sdc2_sub1]
startdelay=0
rw=randread
filename=data3/f2:data3/f3:data3/f4:data3/f5:data3/f6:data3/f7:data3/f8:data3/f1

[job_sdc2_sub2]
startdelay=0
rw=randread
filename=data3/f3:data3/f4:data3/f5:data3/f6:data3/f7:data3/f8:data3/f1:data3/f2

[job_sdc2_sub3]
startdelay=0
rw=randread
filename=data3/f4:data3/f5:data3/f6:data3/f7:data3/f8:data3/f1:data3/f2:data3/f3

[job_sdc2_sub4]
startdelay=0
rw=randread
filename=data3/f5:data3/f6:data3/f7:data3/f8:data3/f1:data3/f2:data3/f3:data3/f4

[job_sdc2_sub5]
startdelay=0
rw=randread
filename=data3/f6:data3/f7:data3/f8:data3/f1:data3/f2:data3/f3:data3/f4:data3/f5

[job_sdc2_sub6]
startdelay=0
rw=randread
filename=data3/f7:data3/f8:data3/f1:data3/f2:data3/f3:data3/f4:data3/f5:data3/f6

[job_sdc2_sub7]
startdelay=0
rw=randread
filename=data3/f8:data3/f1:data3/f2:data3/f3:data3/f4:data3/f5:data3/f6:data3/f7

[job_sdd1_sub0]
startdelay=0
rw=randread
filename=data4/f1:data4/f2:data4/f3:data4/f4:data4/f5:data4/f6:data4/f7:data4/f8

[job_sdd1_sub1]
startdelay=0
rw=randread
filename=data4/f2:data4/f3:data4/f4:data4/f5:data4/f6:data4/f7:data4/f8:data4/f1

[job_sdd1_sub2]
startdelay=0
rw=randread
filename=data4/f3:data4/f4:data4/f5:data4/f6:data4/f7:data4/f8:data4/f1:data4/f2

[job_sdd1_sub3]
startdelay=0
rw=randread
filename=data4/f4:data4/f5:data4/f6:data4/f7:data4/f8:data4/f1:data4/f2:data4/f3

[job_sdd1_sub4]
startdelay=0
rw=randread
filename=data4/f5:data4/f6:data4/f7:data4/f8:data4/f1:data4/f2:data4/f3:data4/f4

[job_sdd1_sub5]
startdelay=0
rw=randread
filename=data4/f6:data4/f7:data4/f8:data4/f1:data4/f2:data4/f3:data4/f4:data4/f5

[job_sdd1_sub6]
startdelay=0
rw=randread
filename=data4/f7:data4/f8:data4/f1:data4/f2:data4/f3:data4/f4:data4/f5:data4/f6

[job_sdd1_sub7]
startdelay=0
rw=randread
filename=data4/f8:data4/f1:data4/f2:data4/f3:data4/f4:data4/f5:data4/f6:data4/f7

[job_sdd2_sub0]
startdelay=0
rw=randread
filename=data5/f1:data5/f2:data5/f3:data5/f4:data5/f5:data5/f6:data5/f7:data5/f8

[job_sdd2_sub1]
startdelay=0
rw=randread
filename=data5/f2:data5/f3:data5/f4:data5/f5:data5/f6:data5/f7:data5/f8:data5/f1

[job_sdd2_sub2]
startdelay=0
rw=randread
filename=data5/f3:data5/f4:data5/f5:data5/f6:data5/f7:data5/f8:data5/f1:data5/f2

[job_sdd2_sub3]
startdelay=0
rw=randread
filename=data5/f4:data5/f5:data5/f6:data5/f7:data5/f8:data5/f1:data5/f2:data5/f3

[job_sdd2_sub4]
startdelay=0
rw=randread
filename=data5/f5:data5/f6:data5/f7:data5/f8:data5/f1:data5/f2:data5/f3:data5/f4

[job_sdd2_sub5]
startdelay=0
rw=randread
filename=data5/f6:data5/f7:data5/f8:data5/f1:data5/f2:data5/f3:data5/f4:data5/f5

[job_sdd2_sub6]
startdelay=0
rw=randread
filename=data5/f7:data5/f8:data5/f1:data5/f2:data5/f3:data5/f4:data5/f5:data5/f6

[job_sdd2_sub7]
startdelay=0
rw=randread
filename=data5/f8:data5/f1:data5/f2:data5/f3:data5/f4:data5/f5:data5/f6:data5/f7

[job_sde1_sub0]
startdelay=0
rw=randread
filename=data6/f1:data6/f2:data6/f3:data6/f4:data6/f5:data6/f6:data6/f7:data6/f8

[job_sde1_sub1]
startdelay=0
rw=randread
filename=data6/f2:data6/f3:data6/f4:data6/f5:data6/f6:data6/f7:data6/f8:data6/f1

[job_sde1_sub2]
startdelay=0
rw=randread
filename=data6/f3:data6/f4:data6/f5:data6/f6:data6/f7:data6/f8:data6/f1:data6/f2

[job_sde1_sub3]
startdelay=0
rw=randread
filename=data6/f4:data6/f5:data6/f6:data6/f7:data6/f8:data6/f1:data6/f2:data6/f3

[job_sde1_sub4]
startdelay=0
rw=randread
filename=data6/f5:data6/f6:data6/f7:data6/f8:data6/f1:data6/f2:data6/f3:data6/f4

[job_sde1_sub5]
startdelay=0
rw=randread
filename=data6/f6:data6/f7:data6/f8:data6/f1:data6/f2:data6/f3:data6/f4:data6/f5

[job_sde1_sub6]
startdelay=0
rw=randread
filename=data6/f7:data6/f8:data6/f1:data6/f2:data6/f3:data6/f4:data6/f5:data6/f6

[job_sde1_sub7]
startdelay=0
rw=randread
filename=data6/f8:data6/f1:data6/f2:data6/f3:data6/f4:data6/f5:data6/f6:data6/f7

[job_sde2_sub0]
startdelay=0
rw=randread
filename=data7/f1:data7/f2:data7/f3:data7/f4:data7/f5:data7/f6:data7/f7:data7/f8

[job_sde2_sub1]
startdelay=0
rw=randread
filename=data7/f2:data7/f3:data7/f4:data7/f5:data7/f6:data7/f7:data7/f8:data7/f1

[job_sde2_sub2]
startdelay=0
rw=randread
filename=data7/f3:data7/f4:data7/f5:data7/f6:data7/f7:data7/f8:data7/f1:data7/f2

[job_sde2_sub3]
startdelay=0
rw=randread
filename=data7/f4:data7/f5:data7/f6:data7/f7:data7/f8:data7/f1:data7/f2:data7/f3

[job_sde2_sub4]
startdelay=0
rw=randread
filename=data7/f5:data7/f6:data7/f7:data7/f8:data7/f1:data7/f2:data7/f3:data7/f4

[job_sde2_sub5]
startdelay=0
rw=randread
filename=data7/f6:data7/f7:data7/f8:data7/f1:data7/f2:data7/f3:data7/f4:data7/f5

[job_sde2_sub6]
startdelay=0
rw=randread
filename=data7/f7:data7/f8:data7/f1:data7/f2:data7/f3:data7/f4:data7/f5:data7/f6

[job_sde2_sub7]
startdelay=0
rw=randread
filename=data7/f8:data7/f1:data7/f2:data7/f3:data7/f4:data7/f5:data7/f6:data7/f7

[job_sdf1_sub0]
startdelay=0
rw=randread
filename=data8/f1:data8/f2:data8/f3:data8/f4:data8/f5:data8/f6:data8/f7:data8/f8

[job_sdf1_sub1]
startdelay=0
rw=randread
filename=data8/f2:data8/f3:data8/f4:data8/f5:data8/f6:data8/f7:data8/f8:data8/f1

[job_sdf1_sub2]
startdelay=0
rw=randread
filename=data8/f3:data8/f4:data8/f5:data8/f6:data8/f7:data8/f8:data8/f1:data8/f2

[job_sdf1_sub3]
startdelay=0
rw=randread
filename=data8/f4:data8/f5:data8/f6:data8/f7:data8/f8:data8/f1:data8/f2:data8/f3

[job_sdf1_sub4]
startdelay=0
rw=randread
filename=data8/f5:data8/f6:data8/f7:data8/f8:data8/f1:data8/f2:data8/f3:data8/f4

[job_sdf1_sub5]
startdelay=0
rw=randread
filename=data8/f6:data8/f7:data8/f8:data8/f1:data8/f2:data8/f3:data8/f4:data8/f5

[job_sdf1_sub6]
startdelay=0
rw=randread
filename=data8/f7:data8/f8:data8/f1:data8/f2:data8/f3:data8/f4:data8/f5:data8/f6

[job_sdf1_sub7]
startdelay=0
rw=randread
filename=data8/f8:data8/f1:data8/f2:data8/f3:data8/f4:data8/f5:data8/f6:data8/f7

[job_sdf2_sub0]
startdelay=0
rw=randread
filename=data9/f1:data9/f2:data9/f3:data9/f4:data9/f5:data9/f6:data9/f7:data9/f8

[job_sdf2_sub1]
startdelay=0
rw=randread
filename=data9/f2:data9/f3:data9/f4:data9/f5:data9/f6:data9/f7:data9/f8:data9/f1

[job_sdf2_sub2]
startdelay=0
rw=randread
filename=data9/f3:data9/f4:data9/f5:data9/f6:data9/f7:data9/f8:data9/f1:data9/f2

[job_sdf2_sub3]
startdelay=0
rw=randread
filename=data9/f4:data9/f5:data9/f6:data9/f7:data9/f8:data9/f1:data9/f2:data9/f3

[job_sdf2_sub4]
startdelay=0
rw=randread
filename=data9/f5:data9/f6:data9/f7:data9/f8:data9/f1:data9/f2:data9/f3:data9/f4

[job_sdf2_sub5]
startdelay=0
rw=randread
filename=data9/f6:data9/f7:data9/f8:data9/f1:data9/f2:data9/f3:data9/f4:data9/f5

[job_sdf2_sub6]
startdelay=0
rw=randread
filename=data9/f7:data9/f8:data9/f1:data9/f2:data9/f3:data9/f4:data9/f5:data9/f6

[job_sdf2_sub7]
startdelay=0
rw=randread
filename=data9/f8:data9/f1:data9/f2:data9/f3:data9/f4:data9/f5:data9/f6:data9/f7

[job_sdg1_sub0]
startdelay=0
rw=randread
filename=data10/f1:data10/f2:data10/f3:data10/f4:data10/f5:data10/f6:data10/f7:data10/f8

[job_sdg1_sub1]
startdelay=0
rw=randread
filename=data10/f2:data10/f3:data10/f4:data10/f5:data10/f6:data10/f7:data10/f8:data10/f1

[job_sdg1_sub2]
startdelay=0
rw=randread
filename=data10/f3:data10/f4:data10/f5:data10/f6:data10/f7:data10/f8:data10/f1:data10/f2

[job_sdg1_sub3]
startdelay=0
rw=randread
filename=data10/f4:data10/f5:data10/f6:data10/f7:data10/f8:data10/f1:data10/f2:data10/f3

[job_sdg1_sub4]
startdelay=0
rw=randread
filename=data10/f5:data10/f6:data10/f7:data10/f8:data10/f1:data10/f2:data10/f3:data10/f4

[job_sdg1_sub5]
startdelay=0
rw=randread
filename=data10/f6:data10/f7:data10/f8:data10/f1:data10/f2:data10/f3:data10/f4:data10/f5

[job_sdg1_sub6]
startdelay=0
rw=randread
filename=data10/f7:data10/f8:data10/f1:data10/f2:data10/f3:data10/f4:data10/f5:data10/f6

[job_sdg1_sub7]
startdelay=0
rw=randread
filename=data10/f8:data10/f1:data10/f2:data10/f3:data10/f4:data10/f5:data10/f6:data10/f7

[job_sdg2_sub0]
startdelay=0
rw=randread
filename=data11/f1:data11/f2:data11/f3:data11/f4:data11/f5:data11/f6:data11/f7:data11/f8

[job_sdg2_sub1]
startdelay=0
rw=randread
filename=data11/f2:data11/f3:data11/f4:data11/f5:data11/f6:data11/f7:data11/f8:data11/f1

[job_sdg2_sub2]
startdelay=0
rw=randread
filename=data11/f3:data11/f4:data11/f5:data11/f6:data11/f7:data11/f8:data11/f1:data11/f2

[job_sdg2_sub3]
startdelay=0
rw=randread
filename=data11/f4:data11/f5:data11/f6:data11/f7:data11/f8:data11/f1:data11/f2:data11/f3

[job_sdg2_sub4]
startdelay=0
rw=randread
filename=data11/f5:data11/f6:data11/f7:data11/f8:data11/f1:data11/f2:data11/f3:data11/f4

[job_sdg2_sub5]
startdelay=0
rw=randread
filename=data11/f6:data11/f7:data11/f8:data11/f1:data11/f2:data11/f3:data11/f4:data11/f5

[job_sdg2_sub6]
startdelay=0
rw=randread
filename=data11/f7:data11/f8:data11/f1:data11/f2:data11/f3:data11/f4:data11/f5:data11/f6

[job_sdg2_sub7]
startdelay=0
rw=randread
filename=data11/f8:data11/f1:data11/f2:data11/f3:data11/f4:data11/f5:data11/f6:data11/f7

[job_sdh1_sub0]
startdelay=0
rw=randread
filename=data12/f1:data12/f2:data12/f3:data12/f4:data12/f5:data12/f6:data12/f7:data12/f8

[job_sdh1_sub1]
startdelay=0
rw=randread
filename=data12/f2:data12/f3:data12/f4:data12/f5:data12/f6:data12/f7:data12/f8:data12/f1

[job_sdh1_sub2]
startdelay=0
rw=randread
filename=data12/f3:data12/f4:data12/f5:data12/f6:data12/f7:data12/f8:data12/f1:data12/f2

[job_sdh1_sub3]
startdelay=0
rw=randread
filename=data12/f4:data12/f5:data12/f6:data12/f7:data12/f8:data12/f1:data12/f2:data12/f3

[job_sdh1_sub4]
startdelay=0
rw=randread
filename=data12/f5:data12/f6:data12/f7:data12/f8:data12/f1:data12/f2:data12/f3:data12/f4

[job_sdh1_sub5]
startdelay=0
rw=randread
filename=data12/f6:data12/f7:data12/f8:data12/f1:data12/f2:data12/f3:data12/f4:data12/f5

[job_sdh1_sub6]
startdelay=0
rw=randread
filename=data12/f7:data12/f8:data12/f1:data12/f2:data12/f3:data12/f4:data12/f5:data12/f6

[job_sdh1_sub7]
startdelay=0
rw=randread
filename=data12/f8:data12/f1:data12/f2:data12/f3:data12/f4:data12/f5:data12/f6:data12/f7

[job_sdh2_sub0]
startdelay=0
rw=randread
filename=data13/f1:data13/f2:data13/f3:data13/f4:data13/f5:data13/f6:data13/f7:data13/f8

[job_sdh2_sub1]
startdelay=0
rw=randread
filename=data13/f2:data13/f3:data13/f4:data13/f5:data13/f6:data13/f7:data13/f8:data13/f1

[job_sdh2_sub2]
startdelay=0
rw=randread
filename=data13/f3:data13/f4:data13/f5:data13/f6:data13/f7:data13/f8:data13/f1:data13/f2

[job_sdh2_sub3]
startdelay=0
rw=randread
filename=data13/f4:data13/f5:data13/f6:data13/f7:data13/f8:data13/f1:data13/f2:data13/f3

[job_sdh2_sub4]
startdelay=0
rw=randread
filename=data13/f5:data13/f6:data13/f7:data13/f8:data13/f1:data13/f2:data13/f3:data13/f4

[job_sdh2_sub5]
startdelay=0
rw=randread
filename=data13/f6:data13/f7:data13/f8:data13/f1:data13/f2:data13/f3:data13/f4:data13/f5

[job_sdh2_sub6]
startdelay=0
rw=randread
filename=data13/f7:data13/f8:data13/f1:data13/f2:data13/f3:data13/f4:data13/f5:data13/f6

[job_sdh2_sub7]
startdelay=0
rw=randread
filename=data13/f8:data13/f1:data13/f2:data13/f3:data13/f4:data13/f5:data13/f6:data13/f7

[job_sdi1_sub0]
startdelay=0
rw=randread
filename=data14/f1:data14/f2:data14/f3:data14/f4:data14/f5:data14/f6:data14/f7:data14/f8

[job_sdi1_sub1]
startdelay=0
rw=randread
filename=data14/f2:data14/f3:data14/f4:data14/f5:data14/f6:data14/f7:data14/f8:data14/f1

[job_sdi1_sub2]
startdelay=0
rw=randread
filename=data14/f3:data14/f4:data14/f5:data14/f6:data14/f7:data14/f8:data14/f1:data14/f2

[job_sdi1_sub3]
startdelay=0
rw=randread
filename=data14/f4:data14/f5:data14/f6:data14/f7:data14/f8:data14/f1:data14/f2:data14/f3

[job_sdi1_sub4]
startdelay=0
rw=randread
filename=data14/f5:data14/f6:data14/f7:data14/f8:data14/f1:data14/f2:data14/f3:data14/f4

[job_sdi1_sub5]
startdelay=0
rw=randread
filename=data14/f6:data14/f7:data14/f8:data14/f1:data14/f2:data14/f3:data14/f4:data14/f5

[job_sdi1_sub6]
startdelay=0
rw=randread
filename=data14/f7:data14/f8:data14/f1:data14/f2:data14/f3:data14/f4:data14/f5:data14/f6

[job_sdi1_sub7]
startdelay=0
rw=randread
filename=data14/f8:data14/f1:data14/f2:data14/f3:data14/f4:data14/f5:data14/f6:data14/f7

[job_sdi2_sub0]
startdelay=0
rw=randread
filename=data15/f1:data15/f2:data15/f3:data15/f4:data15/f5:data15/f6:data15/f7:data15/f8

[job_sdi2_sub1]
startdelay=0
rw=randread
filename=data15/f2:data15/f3:data15/f4:data15/f5:data15/f6:data15/f7:data15/f8:data15/f1

[job_sdi2_sub2]
startdelay=0
rw=randread
filename=data15/f3:data15/f4:data15/f5:data15/f6:data15/f7:data15/f8:data15/f1:data15/f2

[job_sdi2_sub3]
startdelay=0
rw=randread
filename=data15/f4:data15/f5:data15/f6:data15/f7:data15/f8:data15/f1:data15/f2:data15/f3

[job_sdi2_sub4]
startdelay=0
rw=randread
filename=data15/f5:data15/f6:data15/f7:data15/f8:data15/f1:data15/f2:data15/f3:data15/f4

[job_sdi2_sub5]
startdelay=0
rw=randread
filename=data15/f6:data15/f7:data15/f8:data15/f1:data15/f2:data15/f3:data15/f4:data15/f5

[job_sdi2_sub6]
startdelay=0
rw=randread
filename=data15/f7:data15/f8:data15/f1:data15/f2:data15/f3:data15/f4:data15/f5:data15/f6

[job_sdi2_sub7]
startdelay=0
rw=randread
filename=data15/f8:data15/f1:data15/f2:data15/f3:data15/f4:data15/f5:data15/f6:data15/f7

[job_sdj1_sub0]
startdelay=0
rw=randread
filename=data16/f1:data16/f2:data16/f3:data16/f4:data16/f5:data16/f6:data16/f7:data16/f8

[job_sdj1_sub1]
startdelay=0
rw=randread
filename=data16/f2:data16/f3:data16/f4:data16/f5:data16/f6:data16/f7:data16/f8:data16/f1

[job_sdj1_sub2]
startdelay=0
rw=randread
filename=data16/f3:data16/f4:data16/f5:data16/f6:data16/f7:data16/f8:data16/f1:data16/f2

[job_sdj1_sub3]
startdelay=0
rw=randread
filename=data16/f4:data16/f5:data16/f6:data16/f7:data16/f8:data16/f1:data16/f2:data16/f3

[job_sdj1_sub4]
startdelay=0
rw=randread
filename=data16/f5:data16/f6:data16/f7:data16/f8:data16/f1:data16/f2:data16/f3:data16/f4

[job_sdj1_sub5]
startdelay=0
rw=randread
filename=data16/f6:data16/f7:data16/f8:data16/f1:data16/f2:data16/f3:data16/f4:data16/f5

[job_sdj1_sub6]
startdelay=0
rw=randread
filename=data16/f7:data16/f8:data16/f1:data16/f2:data16/f3:data16/f4:data16/f5:data16/f6

[job_sdj1_sub7]
startdelay=0
rw=randread
filename=data16/f8:data16/f1:data16/f2:data16/f3:data16/f4:data16/f5:data16/f6:data16/f7

[job_sdj2_sub0]
startdelay=0
rw=randread
filename=data17/f1:data17/f2:data17/f3:data17/f4:data17/f5:data17/f6:data17/f7:data17/f8

[job_sdj2_sub1]
startdelay=0
rw=randread
filename=data17/f2:data17/f3:data17/f4:data17/f5:data17/f6:data17/f7:data17/f8:data17/f1

[job_sdj2_sub2]
startdelay=0
rw=randread
filename=data17/f3:data17/f4:data17/f5:data17/f6:data17/f7:data17/f8:data17/f1:data17/f2

[job_sdj2_sub3]
startdelay=0
rw=randread
filename=data17/f4:data17/f5:data17/f6:data17/f7:data17/f8:data17/f1:data17/f2:data17/f3

[job_sdj2_sub4]
startdelay=0
rw=randread
filename=data17/f5:data17/f6:data17/f7:data17/f8:data17/f1:data17/f2:data17/f3:data17/f4

[job_sdj2_sub5]
startdelay=0
rw=randread
filename=data17/f6:data17/f7:data17/f8:data17/f1:data17/f2:data17/f3:data17/f4:data17/f5

[job_sdj2_sub6]
startdelay=0
rw=randread
filename=data17/f7:data17/f8:data17/f1:data17/f2:data17/f3:data17/f4:data17/f5:data17/f6

[job_sdj2_sub7]
startdelay=0
rw=randread
filename=data17/f8:data17/f1:data17/f2:data17/f3:data17/f4:data17/f5:data17/f6:data17/f7

[job_sdk1_sub0]
startdelay=0
rw=randread
filename=data18/f1:data18/f2:data18/f3:data18/f4:data18/f5:data18/f6:data18/f7:data18/f8

[job_sdk1_sub1]
startdelay=0
rw=randread
filename=data18/f2:data18/f3:data18/f4:data18/f5:data18/f6:data18/f7:data18/f8:data18/f1

[job_sdk1_sub2]
startdelay=0
rw=randread
filename=data18/f3:data18/f4:data18/f5:data18/f6:data18/f7:data18/f8:data18/f1:data18/f2

[job_sdk1_sub3]
startdelay=0
rw=randread
filename=data18/f4:data18/f5:data18/f6:data18/f7:data18/f8:data18/f1:data18/f2:data18/f3

[job_sdk1_sub4]
startdelay=0
rw=randread
filename=data18/f5:data18/f6:data18/f7:data18/f8:data18/f1:data18/f2:data18/f3:data18/f4

[job_sdk1_sub5]
startdelay=0
rw=randread
filename=data18/f6:data18/f7:data18/f8:data18/f1:data18/f2:data18/f3:data18/f4:data18/f5

[job_sdk1_sub6]
startdelay=0
rw=randread
filename=data18/f7:data18/f8:data18/f1:data18/f2:data18/f3:data18/f4:data18/f5:data18/f6

[job_sdk1_sub7]
startdelay=0
rw=randread
filename=data18/f8:data18/f1:data18/f2:data18/f3:data18/f4:data18/f5:data18/f6:data18/f7

[job_sdk2_sub0]
startdelay=0
rw=randread
filename=data19/f1:data19/f2:data19/f3:data19/f4:data19/f5:data19/f6:data19/f7:data19/f8

[job_sdk2_sub1]
startdelay=0
rw=randread
filename=data19/f2:data19/f3:data19/f4:data19/f5:data19/f6:data19/f7:data19/f8:data19/f1

[job_sdk2_sub2]
startdelay=0
rw=randread
filename=data19/f3:data19/f4:data19/f5:data19/f6:data19/f7:data19/f8:data19/f1:data19/f2

[job_sdk2_sub3]
startdelay=0
rw=randread
filename=data19/f4:data19/f5:data19/f6:data19/f7:data19/f8:data19/f1:data19/f2:data19/f3

[job_sdk2_sub4]
startdelay=0
rw=randread
filename=data19/f5:data19/f6:data19/f7:data19/f8:data19/f1:data19/f2:data19/f3:data19/f4

[job_sdk2_sub5]
startdelay=0
rw=randread
filename=data19/f6:data19/f7:data19/f8:data19/f1:data19/f2:data19/f3:data19/f4:data19/f5

[job_sdk2_sub6]
startdelay=0
rw=randread
filename=data19/f7:data19/f8:data19/f1:data19/f2:data19/f3:data19/f4:data19/f5:data19/f6

[job_sdk2_sub7]
startdelay=0
rw=randread
filename=data19/f8:data19/f1:data19/f2:data19/f3:data19/f4:data19/f5:data19/f6:data19/f7

[job_sdl1_sub0]
startdelay=0
rw=randread
filename=data20/f1:data20/f2:data20/f3:data20/f4:data20/f5:data20/f6:data20/f7:data20/f8

[job_sdl1_sub1]
startdelay=0
rw=randread
filename=data20/f2:data20/f3:data20/f4:data20/f5:data20/f6:data20/f7:data20/f8:data20/f1

[job_sdl1_sub2]
startdelay=0
rw=randread
filename=data20/f3:data20/f4:data20/f5:data20/f6:data20/f7:data20/f8:data20/f1:data20/f2

[job_sdl1_sub3]
startdelay=0
rw=randread
filename=data20/f4:data20/f5:data20/f6:data20/f7:data20/f8:data20/f1:data20/f2:data20/f3

[job_sdl1_sub4]
startdelay=0
rw=randread
filename=data20/f5:data20/f6:data20/f7:data20/f8:data20/f1:data20/f2:data20/f3:data20/f4

[job_sdl1_sub5]
startdelay=0
rw=randread
filename=data20/f6:data20/f7:data20/f8:data20/f1:data20/f2:data20/f3:data20/f4:data20/f5

[job_sdl1_sub6]
startdelay=0
rw=randread
filename=data20/f7:data20/f8:data20/f1:data20/f2:data20/f3:data20/f4:data20/f5:data20/f6

[job_sdl1_sub7]
startdelay=0
rw=randread
filename=data20/f8:data20/f1:data20/f2:data20/f3:data20/f4:data20/f5:data20/f6:data20/f7

[job_sdl2_sub0]
startdelay=0
rw=randread
filename=data21/f1:data21/f2:data21/f3:data21/f4:data21/f5:data21/f6:data21/f7:data21/f8

[job_sdl2_sub1]
startdelay=0
rw=randread
filename=data21/f2:data21/f3:data21/f4:data21/f5:data21/f6:data21/f7:data21/f8:data21/f1

[job_sdl2_sub2]
startdelay=0
rw=randread
filename=data21/f3:data21/f4:data21/f5:data21/f6:data21/f7:data21/f8:data21/f1:data21/f2

[job_sdl2_sub3]
startdelay=0
rw=randread
filename=data21/f4:data21/f5:data21/f6:data21/f7:data21/f8:data21/f1:data21/f2:data21/f3

[job_sdl2_sub4]
startdelay=0
rw=randread
filename=data21/f5:data21/f6:data21/f7:data21/f8:data21/f1:data21/f2:data21/f3:data21/f4

[job_sdl2_sub5]
startdelay=0
rw=randread
filename=data21/f6:data21/f7:data21/f8:data21/f1:data21/f2:data21/f3:data21/f4:data21/f5

[job_sdl2_sub6]
startdelay=0
rw=randread
filename=data21/f7:data21/f8:data21/f1:data21/f2:data21/f3:data21/f4:data21/f5:data21/f6

[job_sdl2_sub7]
startdelay=0
rw=randread
filename=data21/f8:data21/f1:data21/f2:data21/f3:data21/f4:data21/f5:data21/f6:data21/f7

[job_sdm1_sub0]
startdelay=0
rw=randread
filename=data22/f1:data22/f2:data22/f3:data22/f4:data22/f5:data22/f6:data22/f7:data22/f8

[job_sdm1_sub1]
startdelay=0
rw=randread
filename=data22/f2:data22/f3:data22/f4:data22/f5:data22/f6:data22/f7:data22/f8:data22/f1

[job_sdm1_sub2]
startdelay=0
rw=randread
filename=data22/f3:data22/f4:data22/f5:data22/f6:data22/f7:data22/f8:data22/f1:data22/f2

[job_sdm1_sub3]
startdelay=0
rw=randread
filename=data22/f4:data22/f5:data22/f6:data22/f7:data22/f8:data22/f1:data22/f2:data22/f3

[job_sdm1_sub4]
startdelay=0
rw=randread
filename=data22/f5:data22/f6:data22/f7:data22/f8:data22/f1:data22/f2:data22/f3:data22/f4

[job_sdm1_sub5]
startdelay=0
rw=randread
filename=data22/f6:data22/f7:data22/f8:data22/f1:data22/f2:data22/f3:data22/f4:data22/f5

[job_sdm1_sub6]
startdelay=0
rw=randread
filename=data22/f7:data22/f8:data22/f1:data22/f2:data22/f3:data22/f4:data22/f5:data22/f6

[job_sdm1_sub7]
startdelay=0
rw=randread
filename=data22/f8:data22/f1:data22/f2:data22/f3:data22/f4:data22/f5:data22/f6:data22/f7

[job_sdm2_sub0]
startdelay=0
rw=randread
filename=data23/f1:data23/f2:data23/f3:data23/f4:data23/f5:data23/f6:data23/f7:data23/f8

[job_sdm2_sub1]
startdelay=0
rw=randread
filename=data23/f2:data23/f3:data23/f4:data23/f5:data23/f6:data23/f7:data23/f8:data23/f1

[job_sdm2_sub2]
startdelay=0
rw=randread
filename=data23/f3:data23/f4:data23/f5:data23/f6:data23/f7:data23/f8:data23/f1:data23/f2

[job_sdm2_sub3]
startdelay=0
rw=randread
filename=data23/f4:data23/f5:data23/f6:data23/f7:data23/f8:data23/f1:data23/f2:data23/f3

[job_sdm2_sub4]
startdelay=0
rw=randread
filename=data23/f5:data23/f6:data23/f7:data23/f8:data23/f1:data23/f2:data23/f3:data23/f4

[job_sdm2_sub5]
startdelay=0
rw=randread
filename=data23/f6:data23/f7:data23/f8:data23/f1:data23/f2:data23/f3:data23/f4:data23/f5

[job_sdm2_sub6]
startdelay=0
rw=randread
filename=data23/f7:data23/f8:data23/f1:data23/f2:data23/f3:data23/f4:data23/f5:data23/f6

[job_sdm2_sub7]
startdelay=0
rw=randread
filename=data23/f8:data23/f1:data23/f2:data23/f3:data23/f4:data23/f5:data23/f6:data23/f7

BTW, I save several time kernel panic in fio testing:
===================

Post by Wu Fengguang
Pid: 776, comm: kswapd0 Not tainted 2.6.36-rc7-unified #1 X8DTN/X8DTN

RIP: 0010:[<ffffffff810cc21c>] [<ffffffff810cc21c>] slab_alloc
+0x562/0x6f2
RSP: 0000:ffff88023dbebbc0 EFLAGS: 00010002
RAX: 0000000000000000 RBX: ffff88023fc02600 RCX: 0000000000000000
RDX: ffff88023e4746a0 RSI: 0000000000000046 RDI: ffffffff81d8f294
RBP: 0000000000000000 R08: 0000000000000012 R09: 0000000000000006
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880002076880
R13: 0000000000000000 R14: ffff88023fc012c0 R15: 0000000000000010
FS: 0000000000000000(0000) GS:ffff880002060000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000375d3c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 776, threadinfo ffff88023dbea000, task
ffff88023e4746a0)
0000000000000000 000000d000000e00 ffff88000000003c 0000000000000282
<0> ffff88023e4746a0 000080d03e4746a0 ffff880002076888 0000000000000286
<0> ffff88023e4746a0 0000005a3dbebfd8 ffff88023e474cb8 ffff88023fc012d0
[<ffffffff810cd97c>] ? shared_caches+0x2e/0xd6
[<ffffffff810cd4f7>] ? __kmalloc+0xb4/0x108
[<ffffffff810cd97c>] ? shared_caches+0x2e/0xd6
[<ffffffff810cda39>] ? expire_alien_caches+0x15/0x8a
[<ffffffff810c984f>] ? __kmem_cache_expire_all+0x27/0x65
[<ffffffff810cd249>] ? kmem_cache_expire_all+0x86/0x9c
[<ffffffff810a61f0>] ? balance_pgdat+0x2eb/0x4dc
[<ffffffff810a6612>] ? kswapd+0x231/0x247
[<ffffffff81055513>] ? autoremove_wake_function+0x0/0x2a
[<ffffffff810a63e1>] ? kswapd+0x0/0x247
[<ffffffff810a63e1>] ? kswapd+0x0/0x247
[<ffffffff810550c0>] ? kthread+0x7a/0x82
[<ffffffff810036d4>] ? kernel_thread_helper+0x4/0x10
[<ffffffff81055046>] ? kthread+0x0/0x82
[<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10

kswapd0: page allocation failure. order:0, mode:0xd0
Pid: 714, comm: kswapd0 Not tainted 2.6.36-rc7-unified #1
Call Trace:
[<ffffffff8109fcf4>] ? __alloc_pages_nodemask+0x63f/0x6c7
[<ffffffff8100328e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff810cc6f7>] ? new_slab+0xac/0x277
[<ffffffff810cce1e>] ? slab_alloc+0x55c/0x6e8
[<ffffffff810ce58b>] ? shared_caches+0x31/0xd9
[<ffffffff810ce110>] ? __kmalloc+0xb0/0xff
[<ffffffff810ce58b>] ? shared_caches+0x31/0xd9
[<ffffffff810ce649>] ? expire_alien_caches+0x16/0x8d
[<ffffffff810cde25>] ? kmem_cache_expire_all+0xf6/0x14d
[<ffffffff810a6aaf>] ? kswapd+0x5c2/0x7ea
[<ffffffff810556aa>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff810a64ed>] ? kswapd+0x0/0x7ea
[<ffffffff81055269>] ? kthread+0x7e/0x86
[<ffffffff810036d4>] ? kernel_thread_helper+0x4/0x10
[<ffffffff810551eb>] ? kthread+0x0/0x86
[<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
CPU 4: hi: 0, btch: 1 usd: 0
CPU 5: hi: 0, btch: 1 usd: 0
CPU 6: hi: 0, btch: 1 usd: 0
CPU 7: hi: 0, btch: 1 usd: 0
CPU 8: hi: 0, btch: 1 usd: 0
CPU 9: hi: 0, btch: 1 usd: 0
CPU 10: hi: 0, btch: 1 usd: 0
CPU 11: hi: 0, btch: 1 usd: 0
CPU 12: hi: 0, btch: 1 usd: 0
CPU 13: hi: 0, btch: 1 usd: 0
CPU 14: hi: 0, btch: 1 usd: 0
CPU 15: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
CPU 4: hi: 186, btch: 31 usd: 0
CPU 5: hi: 186, btch: 31 usd: 0
CPU 6: hi: 186, btch: 31 usd: 0
CPU 7: hi: 186, btch: 31 usd: 0
CPU 8: hi: 186, btch: 31 usd: 0
CPU 9: hi: 186, btch: 31 usd: 0
CPU 10: hi: 186, btch: 31 usd: 0
CPU 11: hi: 186, btch: 31 usd: 0
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 0
CPU 14: hi: 186, btch: 31 usd: 0
CPU 15: hi: 186, btch: 31 usd: 0
Node 1 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
CPU 4: hi: 186, btch: 31 usd: 0
CPU 5: hi: 186, btch: 31 usd: 0
CPU 6: hi: 186, btch: 31 usd: 0
CPU 7: hi: 186, btch: 31 usd: 0
CPU 8: hi: 186, btch: 31 usd: 0
CPU 9: hi: 186, btch: 31 usd: 0
CPU 10: hi: 186, btch: 31 usd: 0
CPU 11: hi: 186, btch: 31 usd: 0
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 0
CPU 14: hi: 186, btch: 31 usd: 0
CPU 15: hi: 186, btch: 31 usd: 0
active_anon:864 inactive_anon:1237 isolated_anon:0
active_file:178 inactive_file:60 isolated_file:32
unevictable:0 dirty:0 writeback:0 unstable:0
free:9 slab_reclaimable:2410 slab_unreclaimable:1513417
mapped:1 shmem:64 pagetables:346 bounce:0
Node 0 DMA free:0kB min:24kB low:28kB high:36kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15700kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:192kB
slab_unreclaimable:15636kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 3003 3003 3003
Node 0 DMA32 free:0kB min:4940kB low:6172kB high:7408kB
active_anon:604kB inactive_anon:3292kB active_file:0kB
inactive_file:128kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:3075164kB mlocked:0kB dirty:0kB writeback:0kB
mapped:4kB shmem:152kB slab_reclaimable:5840kB
slab_unreclaimable:2963060kB kernel_stack:1016kB pagetables:656kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:24kB min:4984kB low:6228kB high:7476kB
active_anon:2852kB inactive_anon:1656kB active_file:712kB
inactive_file:112kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:3102720kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:104kB slab_reclaimable:3608kB
slab_unreclaimable:3074972kB kernel_stack:312kB pagetables:728kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1272
all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 0kB
Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 0kB
Node 1 Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 0kB
305 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
1570800 pages RAM
42204 pages reserved
509 pages shared
1527199 pages non-shared

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810cce27>] slab_alloc+0x565/0x6e8
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/sdm/stat
CPU 2
Modules linked in: igb ixgbe mdio

Pid: 714, comm: kswapd0 Not tainted 2.6.36-rc7-unified #1 X8DTN/X8DTN
RIP: 0010:[<ffffffff810cce27>] [<ffffffff810cce27>] slab_alloc
+0x565/0x6e8
RSP: 0018:ffff8800bd377c00 EFLAGS: 00010002
RAX: 0000000000000000 RBX: ffff8800bd2f0408 RCX: 0000000000020000
RDX: ffff8800be328000 RSI: ffff8801030ff000 RDI: 000000000000005a
RBP: ffff8800bd2f0400 R08: 0000000000000012 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8800bec012d0 R14: 0000000000000010 R15: ffff8800bec02600
FS: 0000000000000000(0000) GS:ffff880002080000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001c53000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 714, threadinfo ffff8800bd376000, task
ffff8800be328000)
Stack:
ffff8800be328000 ffff8800be328618 ffff8800be328000 ffff8800bd377fd8
<0> 0000000000000000 ffffffff810ce58b ffff880000000000 000000d0810a57a3
<0> 0000000000000001 000080d002097a88 ffff8800bec012d0 ffff880002096888
Call Trace:
[<ffffffff810ce58b>] ? shared_caches+0x31/0xd9
[<ffffffff810ce110>] ? __kmalloc+0xb0/0xff
[<ffffffff810ce58b>] ? shared_caches+0x31/0xd9
[<ffffffff810ce649>] ? expire_alien_caches+0x16/0x8d
[<ffffffff810cde25>] ? kmem_cache_expire_all+0xf6/0x14d
[<ffffffff810a6aaf>] ? kswapd+0x5c2/0x7ea
[<ffffffff810556aa>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff810a64ed>] ? kswapd+0x0/0x7ea
[<ffffffff81055269>] ? kthread+0x7e/0x86
[<ffffffff810036d4>] ? kernel_thread_helper+0x4/0x10
[<ffffffff810551eb>] ? kthread+0x0/0x86
[<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
Code: 74 24 4c 41 83 e6 10 e9 46 01 00 00 45 85 f6 74 01 fb 8b 54 24 30
8b 74 24 4c 4c 89 ff e8 2d f8 ff ff 45 85 f6 49 89 c4 74 01 fa <49> 8b
04 24 65 48 8b 14 25 18 d4 00 00 48 c1 e8 3a 89 44 24 30
RIP [<ffffffff810cce27>] slab_alloc+0x565/0x6e8
RSP <ffff8800bd377c00>
CR2: 0000000000000000
---[ end trace dd9ddd336d3f686a ]---

Another panic:
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G D 2.6.36-rc7-unified #1
Call Trace:
<IRQ> [<ffffffff816139c0>] ? panic+0x92/0x198
[<ffffffff8161701f>] ? oops_end+0x9f/0xac
[<ffffffff81021da5>] ? no_context+0x1f2/0x201
[<ffffffff8103c05d>] ? __call_console_drivers+0x64/0x75
[<ffffffff81021f60>] ? __bad_area_nosemaphore+0x1ac/0x1d0
[<ffffffff81618f1e>] ? do_page_fault+0x1cb/0x3c0
[<ffffffff81613b06>] ? printk+0x40/0x45
[<ffffffff812d2f32>] ? show_mem+0x13a/0x17c
[<ffffffff8109fcf9>] ? __alloc_pages_nodemask+0x644/0x6c7
[<ffffffff8161651f>] ? page_fault+0x1f/0x30
[<ffffffff810cce27>] ? slab_alloc+0x565/0x6e8
[<ffffffff81520390>] ? sk_prot_alloc+0x29/0xe1
[<ffffffff8153de4b>] ? sch_direct_xmit+0x80/0x182
[<ffffffff810ce7bb>] ? kmem_cache_alloc+0x21/0x75
[<ffffffff81520390>] ? sk_prot_alloc+0x29/0xe1
[<ffffffff815204cb>] ? sk_clone+0x16/0x246
[<ffffffff8154f06b>] ? inet_csk_clone+0xf/0x7f
[<ffffffff81562e12>] ? tcp_create_openreq_child+0x23/0x476
[<ffffffff815618e1>] ? tcp_v4_syn_recv_sock+0x4e/0x18e
[<ffffffff81562c92>] ? tcp_check_req+0x21e/0x37b
[<ffffffff81561035>] ? tcp_v4_do_rcv+0xf7/0x22f
[<ffffffff815615b8>] ? tcp_v4_rcv+0x44b/0x726
[<ffffffff815b2cf9>] ? packet_rcv+0x2ea/0x2fd
[<ffffffff815473bb>] ? ip_local_deliver+0xd6/0x161
[<ffffffff815472bf>] ? ip_rcv+0x487/0x4ad
[<ffffffff8152aac2>] ? netif_receive_skb+0x67/0x6d
[<ffffffff8143cba0>] ? e100_poll+0x208/0x534
[<ffffffff8152ac29>] ? net_rx_action+0x72/0x1a3
[<ffffffff810587ff>] ? hrtimer_get_next_event+0x8b/0xa2
[<ffffffff8104192d>] ? __do_softirq+0xdb/0x19e
[<ffffffff810037cc>] ? call_softirq+0x1c/0x28
[<ffffffff81004bdd>] ? do_softirq+0x31/0x63
[<ffffffff810417b9>] ? irq_exit+0x36/0x78
[<ffffffff810042dd>] ? do_IRQ+0xa7/0xbd
[<ffffffff81616313>] ? ret_from_intr+0x0/0xa
<EOI> [<ffffffff81336c02>] ? acpi_idle_enter_c1+0x8c/0xf5
[<ffffffff81336bca>] ? acpi_idle_enter_c1+0x54/0xf5
[<ffffffff815012ce>] ? cpuidle_idle_call+0xa5/0x10b
[<ffffffff81001cf9>] ? cpu_idle+0x5c/0xc9
[<ffffffff81cf2c72>] ? start_kernel+0x355/0x361
[<ffffffff81d071f5>] ? __reserve_early+0xa4/0xba
[<ffffffff81cf2388>] ? x86_64_start_kernel+0xe8/0xee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-18 18:00:46 UTC

Post by Alex,Shi
I got the code from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git unified
on branch "origin/unified" and do a patch base on 36-rc7 kernel. Then I
tested the patch on our 2P/4P core2 machines and 2P NHM, 2P WSM
machines. Most of benchmark have no clear improvement or regression. The
testing benchmarks is listed here.
http://kernel-perf.sourceforge.net/about_tests.php

Ah. Thanks. The tests needs to show a clear benefit for this to be a
viable solution. They did earlier without all the NUMA queuing on SMP.

Post by Alex,Shi
===================

Post by Wu Fengguang
Pid: 776, comm: kswapd0 Not tainted 2.6.36-rc7-unified #1 X8DTN/X8DTN

RIP: 0010:[<ffffffff810cc21c>] [<ffffffff810cc21c>] slab_alloc
+0x562/0x6f2

Cannot see the error message? I guess this is the result of a BUG_ON()?
I'll try to run that fio test first.

Post by Alex,Shi
kswapd0: page allocation failure. order:0, mode:0xd0
Pid: 714, comm: kswapd0 Not tainted 2.6.36-rc7-unified #1
[<ffffffff8109fcf4>] ? __alloc_pages_nodemask+0x63f/0x6c7
[<ffffffff8100328e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff810cc6f7>] ? new_slab+0xac/0x277
[<ffffffff810cce1e>] ? slab_alloc+0x55c/0x6e8
[<ffffffff810ce58b>] ? shared_caches+0x31/0xd9
[<ffffffff810ce110>] ? __kmalloc+0xb0/0xff
[<ffffffff810ce58b>] ? shared_caches+0x31/0xd9
[<ffffffff810ce649>] ? expire_alien_caches+0x16/0x8d
[<ffffffff810cde25>] ? kmem_cache_expire_all+0xf6/0x14d

Expiration needs to get the gfp flags from the reclaim context. And we
now have more allocations in a reclaim context.

Post by Alex,Shi
slab_unreclaimable:2963060kB kernel_stack:1016kB pagetables:656kB

3GB unreclaimable.... Memory leak.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Alex,Shi

2010-10-19 00:01:44 UTC

Post by Christoph Lameter

Post by Alex,Shi
I got the code from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git unified
on branch "origin/unified" and do a patch base on 36-rc7 kernel. Then I
tested the patch on our 2P/4P core2 machines and 2P NHM, 2P WSM
machines. Most of benchmark have no clear improvement or regression. The
testing benchmarks is listed here.
http://kernel-perf.sourceforge.net/about_tests.php

Ah. Thanks. The tests needs to show a clear benefit for this to be a
viable solution. They did earlier without all the NUMA queuing on SMP.

Post by Alex,Shi
===================

Post by Wu Fengguang
Pid: 776, comm: kswapd0 Not tainted 2.6.36-rc7-unified #1 X8DTN/X8DTN

RIP: 0010:[<ffffffff810cc21c>] [<ffffffff810cc21c>] slab_alloc
+0x562/0x6f2

Cannot see the error message? I guess this is the result of a BUG_ON()?
I'll try to run that fio test first.

Can not see error messages since the machine hang when any ops appear.
And the panic ops just pop up randomly, don't know how to reproduce it
now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 15:56:47 UTC

Post by Pekka Enberg
Are there any stability problems left? Have you tried other benchmarks
(e.g. hackbench, sysbench)? Can we merge the series in smaller
batches? For example, if we leave out the NUMA parts in the first
stage, do we expect to see performance regressions?

I have tried hackbench but the number seem to be unstable on my system.
There may be various small optimizations still left to be done.

You cannot merge this without the patches up to the patch that implements
alien caches without performance issues. If you leave out the NUMA parts
then !NUMA is of course fine.

I would suggest to merge the cleanups first for the next upstream
merge cycle and give this patchset at least a whole -next cycle before
upstream merge.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mel Gorman

2010-10-13 14:14:55 UTC

Post by Christoph Lameter

Post by Pekka Enberg
Are there any stability problems left? Have you tried other benchmarks
(e.g. hackbench, sysbench)? Can we merge the series in smaller
batches? For example, if we leave out the NUMA parts in the first
stage, do we expect to see performance regressions?

I have tried hackbench but the number seem to be unstable on my system.
There may be various small optimizations still left to be done.

I still haven't reviewed this and I confess it will be some time before I
get the chance but I sent them towards some testing automation and have the
results from one machine below.

Minimally, I see the same sort of hackbench socket performance regression
as reported elsewhere (10-15% regression). Otherwise, it isn't particularly
exciting results. The machine is very basic - 2 socket, 4 cores, x86-64,
2G RAM. Macine model is an IBM BladeCenter HS20. Processor is Xeon but I'm
not sure exact what model. It appears to be from around the P4 times.

Tests were based on three kernels. I used this tarball of scripts
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.01-slabunified-0.01.tar.gz
. The scripts have been pulled from all over the place with varying quality
so try not cry too much if you look at scripts closely :)

This tarball is capable of multiple different types of tests and was
implemented in response to LSF/MM people asking for a suite to MM tests
"of interest". I don't think a single suite is possible without taking a
week to run a test but what is possible is to configure a bunch of tests to
answer questions about a particular series (the tarball is capable of starting
monitoring, profiling and running analysis after the fact to answer questions
related to my own recent patch series). In this case, it's configured to
run a number of benchmarks known to be sensitive to slab and page allocator
performance. Roughly speaking, to test allocators the following happens

1. Configure 2.6.36-rc7 for SLAB, no other patches, build + boot
2. ./run-slabunified.sh slab-vanilla
3. Configure 2.6.36-rc7 for SLUB, no other patches, build + boot
4. ./run-slabunified.sh slub-vanilla
5. Configure 2.6.36-rc7 for SLUB, for-next and Christophs patches
applied, build + boot
6. ./run-slabunified.sh unified-v4r1 (version 4 of patches, release
candidate 1)
7. cd work/log
8. ../../compare-kernels.sh

This should be enough to run a cross-section of tests against this
series.

Christoph, in particular while it tests netperf, it is not binding to any
particular CPU (although it can), server and client are running on the local
machine (which has particular performance characterisitcs of its own) and
the tests is STREAM, not RR so the tarball is not a replacement for more
targetting testing or workload-specific testing. Still, it should catch
some of the common snags before getting into specific workloads without
taking an extraordinary amount of time to complete. sysbench might take a
long time for many-core machines, limit the number of threads it tests with
OLTP_MAX_THREADS in the config file.

I'm not going to go into details of how the scripts work - it's as-is
only. That said, most test parameters are specified in the top-level
config file with somewhat self-explanatory names and there is a basic
README.

KERNBENCH
kernbench-slab-vanilla-kernbenchkernbench-slub-vanilla-kernbench kernbench-slub
slab-vanilla slub-vanilla unified-v4r1
Elapsed min 382.95 ( 0.00%) 383.44 (-0.13%) 385.36 (-0.63%)
Elapsed mean 383.39 ( 0.00%) 383.61 (-0.06%) 386.07 (-0.70%)
Elapsed stddev 0.32 ( 0.00%) 0.20 (64.11%) 0.76 (-57.53%)
Elapsed max 383.94 ( 0.00%) 383.97 (-0.01%) 387.17 (-0.83%)
User min 1291.99 ( 0.00%) 1290.63 ( 0.11%) 1296.50 (-0.35%)
User mean 1293.05 ( 0.00%) 1291.71 ( 0.10%) 1298.28 (-0.40%)
User stddev 1.06 ( 0.00%) 0.97 ( 8.76%) 1.56 (-32.28%)
User max 1295.01 ( 0.00%) 1293.10 ( 0.15%) 1301.16 (-0.47%)
System min 164.46 ( 0.00%) 166.34 (-1.13%) 167.82 (-2.00%)
System mean 165.50 ( 0.00%) 167.38 (-1.12%) 168.70 (-1.89%)
System stddev 0.83 ( 0.00%) 0.67 (22.71%) 0.92 (-10.53%)
System max 166.98 ( 0.00%) 168.17 (-0.71%) 170.29 (-1.94%)
CPU min 379.00 ( 0.00%) 379.00 ( 0.00%) 378.00 ( 0.26%)
CPU mean 379.80 ( 0.00%) 379.80 ( 0.00%) 379.40 ( 0.11%)
CPU stddev 0.40 ( 0.00%) 0.40 ( 0.00%) 0.80 (-50.00%)
CPU max 380.00 ( 0.00%) 380.00 ( 0.00%) 380.00 ( 0.00%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 8784.52 8788.55 8837.18
Total Elapsed Time (seconds) 2369.98 2361.80 2384.12

FS-Mark
fsmark-slab-vanilla-fsmarkfsmark-slub-vanilla-fsmark fsmark-slub
slab-vanilla slub-vanilla unified-v4r1
Files/s min 437.30 ( 0.00%) 437.80 ( 0.11%) 436.80 (-0.11%)
Files/s mean 440.38 ( 0.00%) 440.36 (-0.00%) 440.68 ( 0.07%)
Files/s stddev 1.79 ( 0.00%) 1.91 ( 6.06%) 2.95 (39.25%)
Files/s max 442.60 ( 0.00%) 443.20 ( 0.14%) 445.90 ( 0.74%)
Overhead min 2851289.00 ( 0.00%) 2961679.00 (-3.73%) 2946715.00 (-3.24%)
Overhead mean 2964541.00 ( 0.00%) 3124801.80 (-5.13%) 3172446.40 (-6.55%)
Overhead stddev 64216.04 ( 0.00%) 115096.40 (-44.21%) 145393.19 (-55.83%)
Overhead max 3033464.00 ( 0.00%) 3269057.00 (-7.21%) 3386053.00 (-10.41%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 2250.39 2246.04 2252.47
Total Elapsed Time (seconds) 1187.28 1184.36 1184.45

IOZone
iozone-slab-vanilla-iozoneiozone-slub-vanilla-iozone iozone-slub
slab-vanilla slub-vanilla unified-v4r1
write-64 254006 ( 0.00%) 244297 (-3.97%) 246089 (-3.22%)
write-128 268327 ( 0.00%) 268327 ( 0.00%) 262296 (-2.30%)
write-256 277969 ( 0.00%) 275544 (-0.88%) 271777 (-2.28%)
write-512 276175 ( 0.00%) 272184 (-1.47%) 272460 (-1.36%)
write-1024 270893 ( 0.00%) 269838 (-0.39%) 264178 (-2.54%)
write-2048 263917 ( 0.00%) 264698 ( 0.30%) 258976 (-1.91%)
write-4096 261593 ( 0.00%) 261108 (-0.19%) 256141 (-2.13%)
write-8192 258725 ( 0.00%) 257950 (-0.30%) 253497 (-2.06%)
write-16384 256661 ( 0.00%) 256059 (-0.24%) 252683 (-1.57%)
write-32768 255359 ( 0.00%) 255866 ( 0.20%) 251637 (-1.48%)
write-65536 254740 ( 0.00%) 255058 ( 0.12%) 250648 (-1.63%)
write-131072 254432 ( 0.00%) 254863 ( 0.17%) 251498 (-1.17%)
write-262144 251304 ( 0.00%) 252769 ( 0.58%) 246666 (-1.88%)
write-524288 102207 ( 0.00%) 103622 ( 1.37%) 106770 ( 4.27%)
rewrite-64 889431 ( 0.00%) 969761 ( 8.28%) 942521 ( 5.63%)
rewrite-128 1007629 ( 0.00%) 985435 (-2.25%) 1007629 ( 0.00%)
rewrite-256 1007446 ( 0.00%) 1027695 ( 1.97%) 1016025 ( 0.84%)
rewrite-512 1009722 ( 0.00%) 1012101 ( 0.24%) 982917 (-2.73%)
rewrite-1024 933525 ( 0.00%) 939446 ( 0.63%) 923488 (-1.09%)
rewrite-2048 808859 ( 0.00%) 816626 ( 0.95%) 822175 ( 1.62%)
rewrite-4096 781834 ( 0.00%) 788472 ( 0.84%) 783474 ( 0.21%)
rewrite-8192 756475 ( 0.00%) 771023 ( 1.89%) 764385 ( 1.03%)
rewrite-16384 754911 ( 0.00%) 766865 ( 1.56%) 756490 ( 0.21%)
rewrite-32768 752752 ( 0.00%) 760559 ( 1.03%) 758942 ( 0.82%)
rewrite-65536 747709 ( 0.00%) 720857 (-3.73%) 758353 ( 1.40%)
rewrite-131072 748085 ( 0.00%) 757120 ( 1.19%) 759824 ( 1.54%)
rewrite-262144 727700 ( 0.00%) 733333 ( 0.77%) 732503 ( 0.66%)
rewrite-524288 105288 ( 0.00%) 104058 (-1.18%) 110624 ( 4.82%)
read-64 1562436 ( 0.00%) 1879725 (16.88%) 1526887 (-2.33%)
read-128 1526043 ( 0.00%) 1111981 (-37.24%) 1557024 ( 1.99%)
read-256 1729594 ( 0.00%) 1570244 (-10.15%) 1718521 (-0.64%)
read-512 1882427 ( 0.00%) 1672748 (-12.54%) 1917728 ( 1.84%)
read-1024 2031864 ( 0.00%) 2019445 (-0.61%) 1968537 (-3.22%)
read-2048 2298737 ( 0.00%) 2216868 (-3.69%) 2204352 (-4.28%)
read-4096 2356697 ( 0.00%) 2346077 (-0.45%) 2340643 (-0.69%)
read-8192 2480882 ( 0.00%) 2418030 (-2.60%) 2511345 ( 1.21%)
read-16384 2543770 ( 0.00%) 2522203 (-0.86%) 2621399 ( 2.96%)
read-32768 2573242 ( 0.00%) 2572664 (-0.02%) 2655424 ( 3.09%)
read-65536 2597014 ( 0.00%) 2598880 ( 0.07%) 2676570 ( 2.97%)
read-131072 2606221 ( 0.00%) 2607210 ( 0.04%) 2711057 ( 3.87%)
read-262144 2623777 ( 0.00%) 2627375 ( 0.14%) 2711992 ( 3.25%)
read-524288 2626615 ( 0.00%) 2611723 (-0.57%) 2711613 ( 3.13%)
reread-64 2278628 ( 0.00%) 4274062 (46.69%) 2467108 ( 7.64%)
reread-128 3277486 ( 0.00%) 3895854 (15.87%) 2969325 (-10.38%)
reread-256 3770085 ( 0.00%) 3879045 ( 2.81%) 3159869 (-19.31%)
reread-512 3580298 ( 0.00%) 3659616 ( 2.17%) 3220553 (-11.17%)
reread-1024 2877110 ( 0.00%) 2813041 (-2.28%) 2715230 (-5.96%)
reread-2048 2608697 ( 0.00%) 2602375 (-0.24%) 2653011 ( 1.67%)
reread-4096 2578086 ( 0.00%) 2592481 ( 0.56%) 2344476 (-9.96%)
reread-8192 2610564 ( 0.00%) 2598128 (-0.48%) 2717085 ( 3.92%)
reread-16384 2606781 ( 0.00%) 2612629 ( 0.22%) 2743532 ( 4.98%)
reread-32768 2625646 ( 0.00%) 2616449 (-0.35%) 2655014 ( 1.11%)
reread-65536 2628805 ( 0.00%) 2623110 (-0.22%) 2722026 ( 3.42%)
reread-131072 2611458 ( 0.00%) 2635027 ( 0.89%) 2724020 ( 4.13%)
reread-262144 2631362 ( 0.00%) 2630644 (-0.03%) 2751359 ( 4.36%)
reread-524288 2627836 ( 0.00%) 2626339 (-0.06%) 2724960 ( 3.56%)
randread-64 1768283 ( 0.00%) 1638743 (-7.90%) 1599680 (-10.54%)
randread-128 2098744 ( 0.00%) 2784517 (24.63%) 1911894 (-9.77%)
randread-256 2371308 ( 0.00%) 1704877 (-39.09%) 2065659 (-14.80%)
randread-512 2416145 ( 0.00%) 2438090 ( 0.90%) 2294796 (-5.29%)
randread-1024 2110750 ( 0.00%) 2106609 (-0.20%) 1943594 (-8.60%)
randread-2048 2036105 ( 0.00%) 1989882 (-2.32%) 1958129 (-3.98%)
randread-4096 2060231 ( 0.00%) 2006805 (-2.66%) 1979748 (-4.07%)
randread-8192 1931211 ( 0.00%) 2022730 ( 4.52%) 1982009 ( 2.56%)
randread-16384 1994886 ( 0.00%) 1988594 (-0.32%) 1978688 (-0.82%)
randread-32768 1953151 ( 0.00%) 1964148 ( 0.56%) 1944584 (-0.44%)
randread-65536 1917719 ( 0.00%) 1931844 ( 0.73%) 1906665 (-0.58%)
randread-131072 1900225 ( 0.00%) 1908756 ( 0.45%) 1894378 (-0.31%)
randread-262144 1890443 ( 0.00%) 1888592 (-0.10%) 1868879 (-1.15%)
randread-524288 1859164 ( 0.00%) 1855099 (-0.22%) 1843896 (-0.83%)
randwrite-64 1204796 ( 0.00%) 886494 (-35.91%) 1049372 (-14.81%)
randwrite-128 1254941 ( 0.00%) 1306873 ( 3.97%) 1162547 (-7.95%)
randwrite-256 1219045 ( 0.00%) 1286217 ( 5.22%) 1035624 (-17.71%)
randwrite-512 1171691 ( 0.00%) 1224470 ( 4.31%) 1130370 (-3.66%)
randwrite-1024 953418 ( 0.00%) 1001903 ( 4.84%) 1035229 ( 7.90%)
randwrite-2048 781058 ( 0.00%) 853377 ( 8.47%) 840682 ( 7.09%)
randwrite-4096 789341 ( 0.00%) 770646 (-2.43%) 760514 (-3.79%)
randwrite-8192 737352 ( 0.00%) 762824 ( 3.34%) 760208 ( 3.01%)
randwrite-16384 721920 ( 0.00%) 726622 ( 0.65%) 742698 ( 2.80%)
randwrite-32768 713991 ( 0.00%) 716963 ( 0.41%) 711835 (-0.30%)
randwrite-65536 703869 ( 0.00%) 707189 ( 0.47%) 702806 (-0.15%)
randwrite-131072 697603 ( 0.00%) 700211 ( 0.37%) 694241 (-0.48%)
randwrite-262144 674226 ( 0.00%) 688917 ( 2.13%) 682725 ( 1.24%)
randwrite-524288 3862 ( 0.00%) 3290 (-17.39%) 3678 (-5.00%)
bkwdread-64 1255511 ( 0.00%) 1879725 (33.21%) 1780008 (29.47%)
bkwdread-128 1487977 ( 0.00%) 947186 (-57.09%) 887675 (-67.63%)
bkwdread-256 1718521 ( 0.00%) 1219045 (-40.97%) 1254656 (-36.97%)
bkwdread-512 1772135 ( 0.00%) 1455126 (-21.79%) 1689859 (-4.87%)
bkwdread-1024 1825466 ( 0.00%) 1796451 (-1.62%) 1758930 (-3.78%)
bkwdread-2048 1705432 ( 0.00%) 1937372 (11.97%) 1892971 ( 9.91%)
bkwdread-4096 2004931 ( 0.00%) 2010798 ( 0.29%) 1963457 (-2.11%)
bkwdread-8192 2123881 ( 0.00%) 2090796 (-1.58%) 2099355 (-1.17%)
bkwdread-16384 2184219 ( 0.00%) 2167136 (-0.79%) 2156053 (-1.31%)
bkwdread-32768 2183067 ( 0.00%) 2202448 ( 0.88%) 2176705 (-0.29%)
bkwdread-65536 2199044 ( 0.00%) 2217637 ( 0.84%) 2202656 ( 0.16%)
bkwdread-131072 2224130 ( 0.00%) 2232222 ( 0.36%) 2218100 (-0.27%)
bkwdread-262144 2240966 ( 0.00%) 2247557 ( 0.29%) 2216583 (-1.10%)
bkwdread-524288 2236828 ( 0.00%) 2238212 ( 0.06%) 2152063 (-3.94%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 23.08 23.01 23.12
Total Elapsed Time (seconds) 303.03 329.62 303.42

NETPERF UDP
netperf-udp netperf-udp udp-slub
slab-vanilla slub-vanilla unified-v4r1
64 52.23 ( 0.00%)* 53.80 ( 2.92%) 50.56 (-3.30%)
1.36% 1.00% 1.00%
128 103.70 ( 0.00%) 107.43 ( 3.47%) 101.23 (-2.44%)
256 208.62 ( 0.00%)* 212.15 ( 1.66%) 202.35 (-3.10%)
1.73% 1.00% 1.00%
1024 814.86 ( 0.00%) 827.42 ( 1.52%) 799.13 (-1.97%)
2048 1585.65 ( 0.00%) 1614.76 ( 1.80%) 1563.52 (-1.42%)
3312 2512.44 ( 0.00%) 2556.70 ( 1.73%) 2460.37 (-2.12%)
4096 3016.81 ( 0.00%)* 3058.16 ( 1.35%) 2901.87 (-3.96%)
1.15% 1.00% 1.00%
8192 5384.46 ( 0.00%) 5092.95 (-5.72%) 4912.71 (-9.60%)
16384 8091.96 ( 0.00%)* 8249.26 ( 1.91%) 8004.40 (-1.09%)
1.70% 1.00% 1.00%
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3318.1 1759.32 2037.7
Total Elapsed Time (seconds) 3986.21 2114.19 2450.19

NETPERF TCP
netperf-tcp netperf-tcp tcp-slub
slab-vanilla slub-vanilla unified-v4r1
64 559.86 ( 0.00%) 561.45 ( 0.28%) 553.43 (-1.16%)
128 1015.34 ( 0.00%) 1023.43 ( 0.79%) 1010.13 (-0.52%)
256 1758.20 ( 0.00%) 1790.91 ( 1.83%) 1761.10 ( 0.16%)
1024 3657.40 ( 0.00%) 3749.93 ( 2.47%) 3617.72 (-1.10%)
2048 4237.05 ( 0.00%)* 4338.38 ( 2.34%) 4214.48 (-0.54%)*
1.05% 1.00% 1.13%
3312 4490.72 ( 0.00%)* 4469.92 (-0.47%)* 4293.97 (-4.58%)*
2.56% 1.47% 2.32%
4096 4977.91 ( 0.00%) 5158.15 ( 3.49%) 4882.70 (-1.95%)
8192 5574.82 ( 0.00%) 5629.75 ( 0.98%) 5442.13 (-2.44%)
16384 7549.99 ( 0.00%)* 7839.95 ( 3.70%)* 7582.45 ( 0.43%)*
6.34% 6.80% 4.80%
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 2981.5 2171.64 2817.11
Total Elapsed Time (seconds) 3059.42 2218.98 2879.69

HACKBENCH PIPES
1 0.07 ( 0.00%) 0.06 ( 1.92%) 0.07 (-5.89%)
4 0.13 ( 0.00%) 0.14 (-8.39%) 0.14 (-7.73%)
8 0.24 ( 0.00%) 0.23 ( 3.26%) 0.23 ( 4.43%)
12 0.37 ( 0.00%) 0.37 (-1.97%) 0.37 (-1.20%)
16 0.42 ( 0.00%) 0.41 ( 2.42%) 0.44 (-4.16%)
20 0.51 ( 0.00%) 0.53 (-2.78%) 0.56 (-8.09%)
24 0.65 ( 0.00%) 0.61 ( 7.45%) 0.61 ( 7.84%)
28 0.72 ( 0.00%) 0.73 (-0.72%) 0.77 (-6.59%)
32 0.80 ( 0.00%) 0.80 (-0.12%) 0.81 (-1.32%)
36 0.97 ( 0.00%) 0.90 ( 7.69%) 0.92 ( 5.40%)
40 1.00 ( 0.00%) 0.98 ( 2.11%) 1.03 (-2.40%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 116.79 121.81 133.8
Total Elapsed Time (seconds) 143.91 142.96 147.49

HACKBENCH SOCKETS
1 0.09 ( 0.00%) 0.10 ( 7.91%) 0.09 ( 3.37%)
4 0.31 ( 0.00%) 0.31 ( 0.34%) 0.27 (-11.78%)
8 0.58 ( 0.00%) 0.58 (-0.64%) 0.53 (-10.77%)
12 0.90 ( 0.00%) 0.86 (-4.19%) 0.78 (-14.80%)
16 1.17 ( 0.00%) 1.13 (-3.53%) 1.02 (-14.57%)
20 1.46 ( 0.00%) 1.42 (-2.63%) 1.27 (-14.77%)
24 1.75 ( 0.00%) 1.70 (-3.13%) 1.54 (-13.98%)
28 2.04 ( 0.00%) 1.96 (-4.13%) 1.75 (-16.31%)
32 2.32 ( 0.00%) 2.25 (-3.11%) 2.00 (-16.34%)
36 2.60 ( 0.00%) 2.52 (-3.00%) 2.26 (-15.21%)
40 2.90 ( 0.00%) 2.81 (-3.35%) 2.52 (-15.03%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1007.63 841.49 765.63
Total Elapsed Time (seconds) 348.97 339.07 309.28

CREATEDELETE

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 2287.66 2174.03 2305.63
Total Elapsed Time (seconds) 1053.00 1012.81 1052.59

CACHEEFFECTS

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 0.3 0.27 0.31
Total Elapsed Time (seconds) 3.87 4.35 3.96

STREAM
vmr-stream vmr-stream stream-slub
slab-vanilla slub-vanilla unified-v4r1
Add-static-small-17.0 12064.30 ( 0.00%) 12064.30 ( 0.00%) 11950.48 (-0.95%)
Add-static-small-17.33 14304.10 ( 0.00%) 14658.59 ( 2.42%) 14412.68 ( 0.75%)
Add-static-small-17.66 13037.24 ( 0.00%) 12728.52 (-2.43%) 12520.79 (-4.12%)
Add-static-small-18.0 13899.62 ( 0.00%) 13971.95 ( 0.52%) 13971.95 ( 0.52%)
Add-static-small-18.33 13602.88 ( 0.00%) 13630.44 ( 0.20%) 13562.15 (-0.30%)
Add-static-small-18.66 13140.73 ( 0.00%) 13254.87 ( 0.86%) 13588.01 ( 3.29%)
Add-static-small-19.0 11343.07 ( 0.00%) 9744.24 (-16.41%) 12011.02 ( 5.56%)
Add-static-small-19.33 10266.32 ( 0.00%) 9499.10 (-8.08%) 9659.62 (-6.28%)
Add-static-small-19.66 6755.96 ( 0.00%) 7122.05 ( 5.14%) 7167.80 ( 5.75%)
Add-static-small-20.0 5146.37 ( 0.00%) 4932.07 (-4.35%) 5014.26 (-2.63%)
Add-static-small-20.33 2797.35 ( 0.00%) 2841.12 ( 1.54%) 2913.44 ( 3.98%)
Add-static-small-20.66 2410.63 ( 0.00%) 2409.50 (-0.05%) 2444.73 ( 1.39%)
Add-static-small-21.0 2558.76 ( 0.00%) 2578.45 ( 0.76%) 2544.36 (-0.57%)
Add-static-small-21.33 2374.19 ( 0.00%) 2375.22 ( 0.04%) 2381.77 ( 0.32%)
Add-static-small-21.66 2542.64 ( 0.00%) 2548.12 ( 0.21%) 2552.61 ( 0.39%)
Add-static-small-22.0 2661.86 ( 0.00%) 2647.16 (-0.56%) 2655.80 (-0.23%)
Add-static-small-22.33 2671.42 ( 0.00%) 2677.47 ( 0.23%) 2674.40 ( 0.11%)
Add-static-small-22.66 2473.47 ( 0.00%) 2481.66 ( 0.33%) 2477.52 ( 0.16%)
Add-static-small-23.0 2709.03 ( 0.00%) 2703.36 (-0.21%) 2700.99 (-0.30%)
Add-static-small-23.33 2400.59 ( 0.00%) 2398.40 (-0.09%) 2398.53 (-0.09%)
Add-static-small-23.66 2311.42 ( 0.00%) 2314.08 ( 0.11%) 2315.33 ( 0.17%)
Add-static-small-24.0 2728.27 ( 0.00%) 2732.43 ( 0.15%) 2731.14 ( 0.11%)
Add-static-small-24.33 2671.22 ( 0.00%) 2671.66 ( 0.02%) 2667.95 (-0.12%)
Add-static-small-24.66 2537.32 ( 0.00%) 2540.08 ( 0.11%) 2540.08 ( 0.11%)
Add-static-small-25.0 2749.01 ( 0.00%) 2749.38 ( 0.01%) 2748.98 (-0.00%)
Add-static-small-25.33 2541.34 ( 0.00%) 2542.30 ( 0.04%) 2540.70 (-0.02%)
Add-static-small-25.66 2767.26 ( 0.00%) 2766.62 (-0.02%) 2766.04 (-0.04%)
Add-static-small-26.0 2762.67 ( 0.00%) 2763.35 ( 0.02%) 2763.48 ( 0.03%)
Add-static-small-26.33 2511.43 ( 0.00%) 2510.95 (-0.02%) 2511.01 (-0.02%)
Add-static-small-26.66 2612.54 ( 0.00%) 2611.50 (-0.04%) 2611.98 (-0.02%)
Add-static-small-27.0 2781.28 ( 0.00%) 2780.14 (-0.04%) 2781.63 ( 0.01%)
Add-static-small-27.33 2695.21 ( 0.00%) 2693.52 (-0.06%) 2692.99 (-0.08%)
Add-static-small-27.66 2769.12 ( 0.00%) 2768.34 (-0.03%) 2768.79 (-0.01%)
Add-static-small-28.0 2790.06 ( 0.00%) 2788.43 (-0.06%) 2791.53 ( 0.05%)
Add-static-small-28.33 2817.07 ( 0.00%) 2816.11 (-0.03%) 2816.23 (-0.03%)
Add-static-small-28.66 2733.30 ( 0.00%) 2733.20 (-0.00%) 2733.60 ( 0.01%)
Add-static-small-29.0 2797.55 ( 0.00%) 2796.75 (-0.03%) 2797.62 ( 0.00%)
Add-static-small-29.33 2747.68 ( 0.00%) 2747.16 (-0.02%) 2746.53 (-0.04%)
Add-static-small-29.66 2696.02 ( 0.00%) 2693.31 (-0.10%) 2693.99 (-0.08%)
Add-static-small-30.0 2781.97 ( 0.00%) 2779.25 (-0.10%) 2784.71 ( 0.10%)
Copy-static-small-17.0 14659.26 ( 0.00%) 14414.94 (-1.69%) 14142.10 (-3.66%)
Copy-static-small-17.33 14437.64 ( 0.00%) 14442.96 ( 0.04%) 14515.94 ( 0.54%)
Copy-static-small-17.66 11311.57 ( 0.00%) 11378.11 ( 0.58%) 11311.57 ( 0.00%)
Copy-static-small-18.0 11765.69 ( 0.00%) 11690.63 (-0.64%) 11784.45 ( 0.16%)
Copy-static-small-18.33 11623.77 ( 0.00%) 11522.60 (-0.88%) 11623.77 ( 0.00%)
Copy-static-small-18.66 11321.31 ( 0.00%) 11279.20 (-0.37%) 11443.65 ( 1.07%)
Copy-static-small-19.0 11222.00 ( 0.00%) 9620.40 (-16.65%) 11354.70 ( 1.17%)
Copy-static-small-19.33 10892.85 ( 0.00%) 9996.77 (-8.96%) 10127.48 (-7.56%)
Copy-static-small-19.66 8650.33 ( 0.00%) 8544.79 (-1.24%) 8292.51 (-4.31%)
Copy-static-small-20.0 6776.33 ( 0.00%) 5775.60 (-17.33%) 6393.53 (-5.99%)
Copy-static-small-20.33 3868.82 ( 0.00%) 3903.18 ( 0.88%) 3791.15 (-2.05%)
Copy-static-small-20.66 2962.61 ( 0.00%) 2997.96 ( 1.18%) 2945.59 (-0.58%)
Copy-static-small-21.0 2776.75 ( 0.00%) 2745.67 (-1.13%) 2767.47 (-0.34%)
Copy-static-small-21.33 2665.36 ( 0.00%) 2668.46 ( 0.12%) 2653.65 (-0.44%)
Copy-static-small-21.66 2581.35 ( 0.00%) 2568.24 (-0.51%) 2580.30 (-0.04%)
Copy-static-small-22.0 2449.33 ( 0.00%) 2448.14 (-0.05%) 2447.34 (-0.08%)
Copy-static-small-22.33 2233.25 ( 0.00%) 2229.02 (-0.19%) 2226.57 (-0.30%)
Copy-static-small-22.66 2083.14 ( 0.00%) 2090.15 ( 0.34%) 2084.96 ( 0.09%)
Copy-static-small-23.0 2376.04 ( 0.00%) 2376.96 ( 0.04%) 2374.23 (-0.08%)
Copy-static-small-23.33 2252.81 ( 0.00%) 2255.55 ( 0.12%) 2253.88 ( 0.05%)
Copy-static-small-23.66 2192.97 ( 0.00%) 2196.61 ( 0.17%) 2194.06 ( 0.05%)
Copy-static-small-24.0 2321.02 ( 0.00%) 2326.15 ( 0.22%) 2321.40 ( 0.02%)
Copy-static-small-24.33 2399.02 ( 0.00%) 2397.62 (-0.06%) 2397.38 (-0.07%)
Copy-static-small-24.66 2407.21 ( 0.00%) 2406.11 (-0.05%) 2407.45 ( 0.01%)
Copy-static-small-25.0 2312.30 ( 0.00%) 2312.73 ( 0.02%) 2313.24 ( 0.04%)
Copy-static-small-25.33 2009.71 ( 0.00%) 2008.83 (-0.04%) 2009.39 (-0.02%)
Copy-static-small-25.66 2134.22 ( 0.00%) 2132.93 (-0.06%) 2133.65 (-0.03%)
Copy-static-small-26.0 2306.28 ( 0.00%) 2306.54 ( 0.01%) 2307.05 ( 0.03%)
Copy-static-small-26.33 2184.55 ( 0.00%) 2183.33 (-0.06%) 2183.96 (-0.03%)
Copy-static-small-26.66 2234.97 ( 0.00%) 2234.81 (-0.01%) 2234.18 (-0.03%)
Copy-static-small-27.0 2317.53 ( 0.00%) 2317.06 (-0.02%) 2318.87 ( 0.06%)
Copy-static-small-27.33 2402.66 ( 0.00%) 2401.76 (-0.04%) 2402.54 (-0.00%)
Copy-static-small-27.66 2396.42 ( 0.00%) 2395.54 (-0.04%) 2395.94 (-0.02%)
Copy-static-small-28.0 2334.59 ( 0.00%) 2333.69 (-0.04%) 2334.68 ( 0.00%)
Copy-static-small-28.33 2228.03 ( 0.00%) 2226.75 (-0.06%) 2226.41 (-0.07%)
Copy-static-small-28.66 2168.20 ( 0.00%) 2168.81 ( 0.03%) 2169.26 ( 0.05%)
Copy-static-small-29.0 2343.31 ( 0.00%) 2343.03 (-0.01%) 2344.22 ( 0.04%)
Copy-static-small-29.33 2293.51 ( 0.00%) 2292.53 (-0.04%) 2293.24 (-0.01%)
Copy-static-small-29.66 2296.00 ( 0.00%) 2294.23 (-0.08%) 2296.50 ( 0.02%)
Copy-static-small-30.0 2349.76 ( 0.00%) 2348.91 (-0.04%) 2349.01 (-0.03%)
Scale-static-small-17.0 14254.87 ( 0.00%) 14198.49 (-0.40%) 14659.26 ( 2.76%)
Scale-static-small-17.33 13846.22 ( 0.00%) 13331.26 (-3.86%) 13919.20 ( 0.52%)
Scale-static-small-17.66 10460.06 ( 0.00%) 10477.91 ( 0.17%) 10477.91 ( 0.17%)
Scale-static-small-18.0 9176.40 ( 0.00%) 8959.46 (-2.42%) 9153.79 (-0.25%)
Scale-static-small-18.33 8031.20 ( 0.00%) 8043.81 ( 0.16%) 8051.07 ( 0.25%)
Scale-static-small-18.66 7882.66 ( 0.00%) 7882.66 ( 0.00%) 7923.37 ( 0.51%)
Scale-static-small-19.0 9827.80 ( 0.00%) 9196.63 (-6.86%) 9847.38 ( 0.20%)
Scale-static-small-19.33 9140.09 ( 0.00%) 9112.74 (-0.30%) 8947.76 (-2.15%)
Scale-static-small-19.66 7448.02 ( 0.00%) 7609.60 ( 2.12%) 7634.98 ( 2.45%)
Scale-static-small-20.0 5736.04 ( 0.00%) 5646.54 (-1.59%) 5499.92 (-4.29%)
Scale-static-small-20.33 3767.07 ( 0.00%) 3466.29 (-8.68%) 3731.70 (-0.95%)
Scale-static-small-20.66 2605.20 ( 0.00%) 2586.14 (-0.74%) 2591.36 (-0.53%)
Scale-static-small-21.0 2458.99 ( 0.00%) 2473.66 ( 0.59%) 2425.00 (-1.40%)
Scale-static-small-21.33 2301.60 ( 0.00%) 2316.69 ( 0.65%) 2302.14 ( 0.02%)
Scale-static-small-21.66 2113.64 ( 0.00%) 2116.43 ( 0.13%) 2111.01 (-0.12%)
Scale-static-small-22.0 2334.94 ( 0.00%) 2346.44 ( 0.49%) 2342.90 ( 0.34%)
Scale-static-small-22.33 2355.52 ( 0.00%) 2355.57 ( 0.00%) 2353.04 (-0.11%)
Scale-static-small-22.66 2374.13 ( 0.00%) 2368.80 (-0.23%) 2373.85 (-0.01%)
Scale-static-small-23.0 2193.07 ( 0.00%) 2192.18 (-0.04%) 2198.93 ( 0.27%)
Scale-static-small-23.33 1961.98 ( 0.00%) 1963.48 ( 0.08%) 1963.58 ( 0.08%)
Scale-static-small-23.66 2087.14 ( 0.00%) 2084.43 (-0.13%) 2086.74 (-0.02%)
Scale-static-small-24.0 2317.82 ( 0.00%) 2316.05 (-0.08%) 2319.08 ( 0.05%)
Scale-static-small-24.33 2088.54 ( 0.00%) 2090.38 ( 0.09%) 2090.72 ( 0.10%)
Scale-static-small-24.66 2249.70 ( 0.00%) 2248.20 (-0.07%) 2247.32 (-0.11%)
Scale-static-small-25.0 2191.54 ( 0.00%) 2191.17 (-0.02%) 2191.90 ( 0.02%)
Scale-static-small-25.33 2365.61 ( 0.00%) 2365.82 ( 0.01%) 2365.53 (-0.00%)
Scale-static-small-25.66 2353.54 ( 0.00%) 2355.43 ( 0.08%) 2353.49 (-0.00%)
Scale-static-small-26.0 2328.23 ( 0.00%) 2327.04 (-0.05%) 2327.79 (-0.02%)
Scale-static-small-26.33 2174.75 ( 0.00%) 2176.11 ( 0.06%) 2175.69 ( 0.04%)
Scale-static-small-26.66 2109.83 ( 0.00%) 2108.76 (-0.05%) 2109.16 (-0.03%)
Scale-static-small-27.0 2232.60 ( 0.00%) 2231.80 (-0.04%) 2233.96 ( 0.06%)
Scale-static-small-27.33 2298.19 ( 0.00%) 2298.13 (-0.00%) 2297.96 (-0.01%)
Scale-static-small-27.66 2210.86 ( 0.00%) 2209.71 (-0.05%) 2210.47 (-0.02%)
Scale-static-small-28.0 2347.95 ( 0.00%) 2347.62 (-0.01%) 2348.78 ( 0.04%)
Scale-static-small-28.33 2371.16 ( 0.00%) 2370.16 (-0.04%) 2370.52 (-0.03%)
Scale-static-small-28.66 2373.70 ( 0.00%) 2373.81 ( 0.00%) 2374.67 ( 0.04%)
Scale-static-small-29.0 2289.08 ( 0.00%) 2288.95 (-0.01%) 2290.30 ( 0.05%)
Scale-static-small-29.33 2251.49 ( 0.00%) 2250.53 (-0.04%) 2251.88 ( 0.02%)
Scale-static-small-29.66 2264.59 ( 0.00%) 2262.57 (-0.09%) 2263.30 (-0.06%)
Scale-static-small-30.0 2368.20 ( 0.00%) 2367.55 (-0.03%) 2368.02 (-0.01%)
Triad-static-small-17.0 12462.37 ( 0.00%) 12455.87 (-0.05%) 12405.74 (-0.46%)
Triad-static-small-17.33 13548.09 ( 0.00%) 13572.77 ( 0.18%) 13572.77 ( 0.18%)
Triad-static-small-17.66 12491.80 ( 0.00%) 12405.05 (-0.70%) 12506.85 ( 0.12%)
Triad-static-small-18.0 10050.94 ( 0.00%) 10086.65 ( 0.35%) 10050.94 ( 0.00%)
Triad-static-small-18.33 10622.90 ( 0.00%) 10408.87 (-2.06%) 10506.39 (-1.11%)
Triad-static-small-18.66 10297.50 ( 0.00%) 10025.29 (-2.72%) 10380.22 ( 0.80%)
Triad-static-small-19.0 10984.20 ( 0.00%) 9572.00 (-14.75%) 11701.99 ( 6.13%)
Triad-static-small-19.33 9854.11 ( 0.00%) 9090.86 (-8.40%) 8933.28 (-10.31%)
Triad-static-small-19.66 6082.61 ( 0.00%) 6673.00 ( 8.85%) 6430.92 ( 5.42%)
Triad-static-small-20.0 4571.62 ( 0.00%) 4482.53 (-1.99%) 4473.38 (-2.20%)
Triad-static-small-20.33 2880.73 ( 0.00%) 2891.05 ( 0.36%) 2938.72 ( 1.97%)
Triad-static-small-20.66 2326.63 ( 0.00%) 2320.30 (-0.27%) 2316.88 (-0.42%)
Triad-static-small-21.0 2759.25 ( 0.00%) 2764.61 ( 0.19%) 2744.24 (-0.55%)
Triad-static-small-21.33 2696.68 ( 0.00%) 2694.26 (-0.09%) 2700.09 ( 0.13%)
Triad-static-small-21.66 2595.89 ( 0.00%) 2590.86 (-0.19%) 2589.48 (-0.25%)
Triad-static-small-22.0 2715.27 ( 0.00%) 2709.88 (-0.20%) 2712.39 (-0.11%)
Triad-static-small-22.33 2559.21 ( 0.00%) 2560.17 ( 0.04%) 2559.76 ( 0.02%)
Triad-static-small-22.66 2780.62 ( 0.00%) 2771.40 (-0.33%) 2777.04 (-0.13%)
Triad-static-small-23.0 2725.92 ( 0.00%) 2735.80 ( 0.36%) 2734.08 ( 0.30%)
Triad-static-small-23.33 2197.15 ( 0.00%) 2197.16 ( 0.00%) 2198.22 ( 0.05%)
Triad-static-small-23.66 2491.03 ( 0.00%) 2492.60 ( 0.06%) 2490.21 (-0.03%)
Triad-static-small-24.0 2698.54 ( 0.00%) 2696.26 (-0.08%) 2696.67 (-0.07%)
Triad-static-small-24.33 2572.62 ( 0.00%) 2574.74 ( 0.08%) 2576.27 ( 0.14%)
Triad-static-small-24.66 2693.72 ( 0.00%) 2692.98 (-0.03%) 2694.17 ( 0.02%)
Triad-static-small-25.0 2727.56 ( 0.00%) 2726.54 (-0.04%) 2726.28 (-0.05%)
Triad-static-small-25.33 2774.01 ( 0.00%) 2773.77 (-0.01%) 2773.23 (-0.03%)
Triad-static-small-25.66 2568.71 ( 0.00%) 2569.79 ( 0.04%) 2569.52 ( 0.03%)
Triad-static-small-26.0 2717.27 ( 0.00%) 2717.70 ( 0.02%) 2717.99 ( 0.03%)
Triad-static-small-26.33 2627.58 ( 0.00%) 2627.41 (-0.01%) 2627.89 ( 0.01%)
Triad-static-small-26.66 2469.62 ( 0.00%) 2469.00 (-0.02%) 2468.88 (-0.03%)
Triad-static-small-27.0 2757.61 ( 0.00%) 2756.50 (-0.04%) 2757.91 ( 0.01%)
Triad-static-small-27.33 2756.80 ( 0.00%) 2756.63 (-0.01%) 2757.00 ( 0.01%)
Triad-static-small-27.66 2730.51 ( 0.00%) 2729.97 (-0.02%) 2730.23 (-0.01%)
Triad-static-small-28.0 2760.07 ( 0.00%) 2759.34 (-0.03%) 2760.34 ( 0.01%)
Triad-static-small-28.33 2724.95 ( 0.00%) 2723.63 (-0.05%) 2724.47 (-0.02%)
Triad-static-small-28.66 2818.62 ( 0.00%) 2819.25 ( 0.02%) 2820.00 ( 0.05%)
Triad-static-small-29.0 2787.79 ( 0.00%) 2787.43 (-0.01%) 2788.62 ( 0.03%)
Triad-static-small-29.33 2699.87 ( 0.00%) 2698.92 (-0.04%) 2698.81 (-0.04%)
Triad-static-small-29.66 2764.21 ( 0.00%) 2761.20 (-0.11%) 2762.21 (-0.07%)
Triad-static-small-30.0 2767.82 ( 0.00%) 2766.04 (-0.06%) 2767.55 (-0.01%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 833.15 833.07 833.27
Total Elapsed Time (seconds) 840.38 839.78 839.63

SYSBENCH
sysbench-slab-vanilla-sysbenchsysbench-slub-vanilla-sysbench sysbench-slub
slab-vanilla slub-vanilla unified-v4r1
1 7521.24 ( 0.00%) 7719.38 ( 2.57%) 7589.13 ( 0.89%)
2 14872.85 ( 0.00%) 15275.09 ( 2.63%) 15054.08 ( 1.20%)
3 16502.53 ( 0.00%) 16676.53 ( 1.04%) 16465.69 (-0.22%)
4 17831.19 ( 0.00%) 17900.09 ( 0.38%) 17819.03 (-0.07%)
5 18158.40 ( 0.00%) 18432.74 ( 1.49%) 18341.99 ( 1.00%)
6 18673.68 ( 0.00%) 18878.41 ( 1.08%) 18614.92 (-0.32%)
7 17689.75 ( 0.00%) 17871.89 ( 1.02%) 17633.19 (-0.32%)
8 16885.68 ( 0.00%) 16838.37 (-0.28%) 16498.41 (-2.35%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 2362.85 2367.19 2430.63
Total Elapsed Time (seconds) 2932.91 2936.30 2932.22
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-18 18:13:42 UTC

Post by Mel Gorman
Minimally, I see the same sort of hackbench socket performance regression
as reported elsewhere (10-15% regression). Otherwise, it isn't particularly
exciting results. The machine is very basic - 2 socket, 4 cores, x86-64,
2G RAM. Macine model is an IBM BladeCenter HS20. Processor is Xeon but I'm
not sure exact what model. It appears to be from around the P4 times.

Looks not good. Something must still be screwed up. Trouble is to find
time to do this work. When working on SLAB we had a team to implement the
NUMA stuff and deal with the performance issues.

Post by Mel Gorman
Christoph, in particular while it tests netperf, it is not binding to any
particular CPU (although it can), server and client are running on the local
machine (which has particular performance characterisitcs of its own) and
the tests is STREAM, not RR so the tarball is not a replacement for more
targetting testing or workload-specific testing. Still, it should catch
some of the common snags before getting into specific workloads without
taking an extraordinary amount of time to complete. sysbench might take a
long time for many-core machines, limit the number of threads it tests with
OLTP_MAX_THREADS in the config file.

That should not matter too much. The performance results should replicate
SLABs caching behavior and I do not see that in the tests.

Post by Mel Gorman
NETPERF UDP
netperf-udp netperf-udp udp-slub
slab-vanilla slub-vanilla unified-v4r1
64 52.23 ( 0.00%)* 53.80 ( 2.92%) 50.56 (-3.30%) 1.36% 1.00% 1.00%
128 103.70 ( 0.00%) 107.43 ( 3.47%) 101.23 (-2.44%)
256 208.62 ( 0.00%)* 212.15 ( 1.66%) 202.35 (-3.10%) 1.73% 1.00% 1.00%
1024 814.86 ( 0.00%) 827.42 ( 1.52%) 799.13 (-1.97%)
2048 1585.65 ( 0.00%) 1614.76 ( 1.80%) 1563.52 (-1.42%)
3312 2512.44 ( 0.00%) 2556.70 ( 1.73%) 2460.37 (-2.12%)
4096 3016.81 ( 0.00%)* 3058.16 ( 1.35%) 2901.87 (-3.96%) 1.15% 1.00% 1.00%
8192 5384.46 ( 0.00%) 5092.95 (-5.72%) 4912.71 (-9.60%)
16384 8091.96 ( 0.00%)* 8249.26 ( 1.91%) 8004.40 (-1.09%) 1.70% 1.00% 1.00%

Seems that we lost some of the netperf wins.

Post by Mel Gorman
SYSBENCH
sysbench-slab-vanilla-sysbenchsysbench-slub-vanilla-sysbench sysbench-slub
slab-vanilla slub-vanilla unified-v4r1
1 7521.24 ( 0.00%) 7719.38 ( 2.57%) 7589.13 ( 0.89%)
2 14872.85 ( 0.00%) 15275.09 ( 2.63%) 15054.08 ( 1.20%)
3 16502.53 ( 0.00%) 16676.53 ( 1.04%) 16465.69 (-0.22%)
4 17831.19 ( 0.00%) 17900.09 ( 0.38%) 17819.03 (-0.07%)
5 18158.40 ( 0.00%) 18432.74 ( 1.49%) 18341.99 ( 1.00%)
6 18673.68 ( 0.00%) 18878.41 ( 1.08%) 18614.92 (-0.32%)
7 17689.75 ( 0.00%) 17871.89 ( 1.02%) 17633.19 (-0.32%)
8 16885.68 ( 0.00%) 16838.37 (-0.28%) 16498.41 (-2.35%)

Same here. Seems that we combined the worst of both.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mel Gorman

2010-10-19 09:23:38 UTC

Post by Christoph Lameter

Post by Mel Gorman
Minimally, I see the same sort of hackbench socket performance regression
as reported elsewhere (10-15% regression). Otherwise, it isn't particularly
exciting results. The machine is very basic - 2 socket, 4 cores, x86-64,
2G RAM. Macine model is an IBM BladeCenter HS20. Processor is Xeon but I'm
not sure exact what model. It appears to be from around the P4 times.

Looks not good. Something must still be screwed up. Trouble is to find
time to do this work. When working on SLAB we had a team to implement the
NUMA stuff and deal with the performance issues.

Post by Mel Gorman
Christoph, in particular while it tests netperf, it is not binding to any
particular CPU (although it can), server and client are running on the local
machine (which has particular performance characterisitcs of its own) and
the tests is STREAM, not RR so the tarball is not a replacement for more
targetting testing or workload-specific testing. Still, it should catch
some of the common snags before getting into specific workloads without
taking an extraordinary amount of time to complete. sysbench might take a
long time for many-core machines, limit the number of threads it tests with
OLTP_MAX_THREADS in the config file.

That should not matter too much. The performance results should replicate
SLABs caching behavior and I do not see that in the tests.

On the other hand, the unified figures are very close to slab in terms of
behaviour. Very small gains and losses. Considering that the server and
clients are not bound to any particular CPU either and the data set it is
working on is quite large, a small amount of noise is expected.

Post by Christoph Lameter

Post by Mel Gorman
NETPERF UDP
netperf-udp netperf-udp udp-slub
slab-vanilla slub-vanilla unified-v4r1
64 52.23 ( 0.00%)* 53.80 ( 2.92%) 50.56 (-3.30%) 1.36% 1.00% 1.00%
128 103.70 ( 0.00%) 107.43 ( 3.47%) 101.23 (-2.44%)
256 208.62 ( 0.00%)* 212.15 ( 1.66%) 202.35 (-3.10%) 1.73% 1.00% 1.00%
1024 814.86 ( 0.00%) 827.42 ( 1.52%) 799.13 (-1.97%)
2048 1585.65 ( 0.00%) 1614.76 ( 1.80%) 1563.52 (-1.42%)
3312 2512.44 ( 0.00%) 2556.70 ( 1.73%) 2460.37 (-2.12%)
4096 3016.81 ( 0.00%)* 3058.16 ( 1.35%) 2901.87 (-3.96%) 1.15% 1.00% 1.00%
8192 5384.46 ( 0.00%) 5092.95 (-5.72%) 4912.71 (-9.60%)
16384 8091.96 ( 0.00%)* 8249.26 ( 1.91%) 8004.40 (-1.09%) 1.70% 1.00% 1.00%

Seems that we lost some of the netperf wins.

It's a different test being run here. UDP_STREAM versus UDP_RR and that could
be one factor in the differences between my results and your own. I'll look
into redoing these for *_RR to rule that out as one factor. The results
are outside statistical noise though.

Post by Christoph Lameter

Post by Mel Gorman
SYSBENCH
sysbench-slab-vanilla-sysbenchsysbench-slub-vanilla-sysbench sysbench-slub
slab-vanilla slub-vanilla unified-v4r1
1 7521.24 ( 0.00%) 7719.38 ( 2.57%) 7589.13 ( 0.89%)
2 14872.85 ( 0.00%) 15275.09 ( 2.63%) 15054.08 ( 1.20%)
3 16502.53 ( 0.00%) 16676.53 ( 1.04%) 16465.69 (-0.22%)
4 17831.19 ( 0.00%) 17900.09 ( 0.38%) 17819.03 (-0.07%)
5 18158.40 ( 0.00%) 18432.74 ( 1.49%) 18341.99 ( 1.00%)
6 18673.68 ( 0.00%) 18878.41 ( 1.08%) 18614.92 (-0.32%)
7 17689.75 ( 0.00%) 17871.89 ( 1.02%) 17633.19 (-0.32%)
8 16885.68 ( 0.00%) 16838.37 (-0.28%) 16498.41 (-2.35%)

Same here. Seems that we combined the worst of both.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Mel Gorman

2010-10-12 18:25:31 UTC

Post by Pekka Enberg
(Adding more people who've taken interest in slab performance in the
past to CC.)

I have not come even close to reviewing this yet but I made a start on
putting it through a series of tests. It fails to build on ppc64

CC mm/slub.o
mm/slub.c:1477: warning: 'drain_alien_caches' declared inline after being called
mm/slub.c:1477: warning: previous declaration of 'drain_alien_caches' was here
mm/slub.c: In function `alloc_shared_caches':
mm/slub.c:1748: error: `cpu_info' undeclared (first use in this function)
mm/slub.c:1748: error: (Each undeclared identifier is reported only once
mm/slub.c:1748: error: for each function it appears in.)
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: error: invalid type argument of `unary *'
make[1]: *** [mm/slub.o] Error 1
make: *** [mm] Error 2

I didn't look closely yet but cpu_info is an arch-specific variable.
Checking to see if there is a known fix yet before setting aside time to
dig deeper.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Pekka Enberg

2010-10-13 07:16:45 UTC

Post by Mel Gorman

Post by Pekka Enberg
(Adding more people who've taken interest in slab performance in the
past to CC.)

I have not come even close to reviewing this yet but I made a start on
putting it through a series of tests. It fails to build on ppc64
CC mm/slub.o
mm/slub.c:1477: warning: 'drain_alien_caches' declared inline after being called
mm/slub.c:1477: warning: previous declaration of 'drain_alien_caches' was here

Can you try the attached patch to see if it fixes the problem?

Post by Mel Gorman
mm/slub.c:1748: error: `cpu_info' undeclared (first use in this function)
mm/slub.c:1748: error: (Each undeclared identifier is reported only once
mm/slub.c:1748: error: for each function it appears in.)
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: warning: type defaults to `int' in declaration of `type name'
mm/slub.c:1748: error: invalid type argument of `unary *'
make[1]: *** [mm/slub.o] Error 1
make: *** [mm] Error 2
I didn't look closely yet but cpu_info is an arch-specific variable.
Checking to see if there is a known fix yet before setting aside time to
dig deeper.

Yeah, cpu_info.llc_shared_map is an x86ism. Christoph?

Pekka

Mel Gorman

2010-10-13 13:46:56 UTC

Post by Pekka Enberg

Post by Mel Gorman

Post by Pekka Enberg
(Adding more people who've taken interest in slab performance in the
past to CC.)

I have not come even close to reviewing this yet but I made a start on
putting it through a series of tests. It fails to build on ppc64
CC mm/slub.o
mm/slub.c:1477: warning: 'drain_alien_caches' declared inline after being called
mm/slub.c:1477: warning: previous declaration of 'drain_alien_caches' was here

Can you try the attached patch to see if it fixes the problem?

I didn't resend it though testing and I don't have my hands on the tree
right now but your patch looks reasonable.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-13 16:10:29 UTC

Post by Mel Gorman
mm/slub.c:1748: error: `cpu_info' undeclared (first use in this function)
I didn't look closely yet but cpu_info is an arch-specific variable.
Checking to see if there is a known fix yet before setting aside time to
dig deeper.

Argh we have no arch independant way to figuring out the shared cpu mask?
The scheduler at least needs it. Need to look at it when I get back from
the conference next week.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Andi Kleen

2010-10-06 10:47:21 UTC

Christoph Lameter <***@linux.com> writes:

Not looked at code so far, but just comments based on the
description. But thanks for working on this, it's good
to have alternatives to the ugly slab.c

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).

Is that really a good idea? Nodes (= sockets) are getting larger and
larger and they are quite substantial SMPs by themselves now.
On Xeon 75xx you have 16 virtual CPUs per node.

Post by Christoph Lameter
1. SLUB accurately tracks cpu caches instead of assuming that there
is only a single cpu cache per node or system.
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.

Hmm, but that means that you could fill a lot of memory with caches
before they get pruned right? Is there another limit too?

-Andi

--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 15:59:55 UTC

Post by Andi Kleen
Not looked at code so far, but just comments based on the
description. But thanks for working on this, it's good
to have alternatives to the ugly slab.c

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).

Is that really a good idea? Nodes (= sockets) are getting larger and
larger and they are quite substantial SMPs by themselves now.
On Xeon 75xx you have 16 virtual CPUs per node.

True. The shared caches can compensate for that. Without this I got
regression because of too many atomic operations during draining and
refilling.

The other alternative is to stay with the current approach
which minimizes the queuing etc overhead and can affort to have the
overhead.

Post by Andi Kleen

Post by Christoph Lameter
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.

Hmm, but that means that you could fill a lot of memory with caches
before they get pruned right? Is there another limit too?

The cache all have an limit on the number of objects in them (like SLAB).
If you want less you can limit the sizes of the queues.
Otherwise there is no other limit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Andi Kleen

2010-10-06 16:25:47 UTC

Post by Christoph Lameter

Post by Andi Kleen
Not looked at code so far, but just comments based on the
description. But thanks for working on this, it's good
to have alternatives to the ugly slab.c

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).

Is that really a good idea? Nodes (= sockets) are getting larger and
larger and they are quite substantial SMPs by themselves now.
On Xeon 75xx you have 16 virtual CPUs per node.

True. The shared caches can compensate for that. Without this I got
regression because of too many atomic operations during draining and
refilling.

Could you just do it by smaller units? (e.g. cores on SMT systems)

I agree some sharing is a good idea, just a node is likely too large.

Post by Christoph Lameter

Post by Andi Kleen

Post by Christoph Lameter
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.

Hmm, but that means that you could fill a lot of memory with caches
before they get pruned right? Is there another limit too?

The cache all have an limit on the number of objects in them (like SLAB).
If you want less you can limit the sizes of the queues.
Otherwise there is no other limit.

So it would depend on that total number of caches in the system?
-Andi

--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 16:37:12 UTC

Post by Andi Kleen

Post by Christoph Lameter
True. The shared caches can compensate for that. Without this I got
regression because of too many atomic operations during draining and
refilling.

Could you just do it by smaller units? (e.g. cores on SMT systems)

The shared caches are not per node but per sharing domain (l3).

The difficulty with making the partial lists work for a smaller unit is
that this would require a mechanism to fallback to other partial lists for
the same node if one would be exhausted?

Also how does one figure out which partial list a slab belongs to? Right
now this is by node. We would have to store the partial list number in the
page struct.

Post by Andi Kleen
I agree some sharing is a good idea, just a node is likely too large.

You can increase the batching in order to reduce the load on the node
locks. The sharing caches will take care of a lot of the intra node
movement also.

Post by Andi Kleen

Post by Christoph Lameter

Post by Andi Kleen

Post by Christoph Lameter
2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.

Hmm, but that means that you could fill a lot of memory with caches
before they get pruned right? Is there another limit too?

The cache all have an limit on the number of objects in them (like SLAB).
If you want less you can limit the sizes of the queues.
Otherwise there is no other limit.

So it would depend on that total number of caches in the system?

Yes. Also the expiration is triggerable from user space. You can set up a
cron job that triggers cache expiration every minute or so.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Andi Kleen

2010-10-06 16:43:26 UTC

Post by Christoph Lameter

Post by Andi Kleen

Post by Christoph Lameter
True. The shared caches can compensate for that. Without this I got
regression because of too many atomic operations during draining and
refilling.

Could you just do it by smaller units? (e.g. cores on SMT systems)

The shared caches are not per node but per sharing domain (l3).

That's the same at least on Intel servers.

Post by Christoph Lameter

Post by Andi Kleen
So it would depend on that total number of caches in the system?

Yes. Also the expiration is triggerable from user space. You can set up a
cron job that triggers cache expiration every minute or so.
movement also.

That doesn't seem like a good way to do this to me. Such things should work
without special cron jobs.

-Andi

--
***@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 16:49:38 UTC

Post by Andi Kleen

Post by Christoph Lameter

Post by Andi Kleen
So it would depend on that total number of caches in the system?

Yes. Also the expiration is triggerable from user space. You can set up a
cron job that triggers cache expiration every minute or so.
movement also.

That doesn't seem like a good way to do this to me. Such things should work
without special cron jobs.

Its trivial to add a 2 second timer (or another variant) if we want the
exact slab cleanup behavior. However, then you have the disturbances again
of running code by checking all the caches in the system on all cpus.
Running the cleaning from reclaim avoids that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-06 16:52:27 UTC

Post by Andi Kleen

Post by Christoph Lameter
The shared caches are not per node but per sharing domain (l3).

That's the same at least on Intel servers.

Only true for recent intel servers. My old Dell 1950 has sharing domains
for each 2 cpus.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Rientjes

2010-10-19 20:39:55 UTC

Post by Christoph Lameter
- Lots of debugging
- Performance optimizations (more would be good)...
- Drop per slab locking in favor of per node locking for
partial lists (queuing implies freeing large amounts of objects
to per node lists of slab).
- Implement object expiration via reclaim VM logic.

I applied this set on top of Pekka's for-next tree reverted back to
5d1f57e4 since it doesn't apply later then that.

Overall, the results are _much_ better than the vanilla slub allocator
that I frequently saw ~20% regressions with the TCP_RR netperf benchmark
on a couple of my machines with larger cpu counts. However, there still
is a significant performance degradation compared to slab.

When running this patchset on two (client and server running
netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
here're the results:

threads SLAB SLUB diff
16 207038 184389 -10.9%
32 266105 234386 -11.9%
48 287989 252733 -12.2%
64 307572 277221 - 9.9%
80 309802 284199 - 8.3%
96 302959 291743 - 3.7%
112 307381 297459 - 3.2%
128 314582 299340 - 4.8%
144 331945 299648 - 9.7%
160 321882 314192 - 2.4%

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Christoph Lameter

2010-10-20 13:47:30 UTC

Post by David Rientjes
Overall, the results are _much_ better than the vanilla slub allocator
that I frequently saw ~20% regressions with the TCP_RR netperf benchmark
on a couple of my machines with larger cpu counts. However, there still
is a significant performance degradation compared to slab.

It seems that the memory leak is still present. This likely skews the
results. Thought I had it fixed. Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

47 Replies
39 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Christoph Lameter 2010-10-05 18:57:25 UTC

Christoph Lameter 2010-10-05 18:57:27 UTC

Pekka Enberg 2010-10-06 14:02:17 UTC

Christoph Lameter 2010-10-05 18:57:30 UTC

Christoph Lameter 2010-10-05 18:57:28 UTC

Christoph Lameter 2010-10-05 18:57:29 UTC

Christoph Lameter 2010-10-05 18:57:36 UTC

Christoph Lameter 2010-10-05 18:57:39 UTC

Christoph Lameter 2010-10-05 18:57:34 UTC

Christoph Lameter 2010-10-05 18:57:37 UTC

Christoph Lameter 2010-10-05 18:57:41 UTC

Christoph Lameter 2010-10-05 18:57:38 UTC

Christoph Lameter 2010-10-05 18:57:32 UTC

Christoph Lameter 2010-10-05 18:57:35 UTC

Christoph Lameter 2010-10-05 18:57:26 UTC

Pekka Enberg 2010-10-06 14:02:05 UTC

Christoph Lameter 2010-10-05 18:57:33 UTC

Christoph Lameter 2010-10-05 18:57:40 UTC

Christoph Lameter 2010-10-05 18:57:31 UTC

Pekka Enberg 2010-10-06 08:01:35 UTC

Richard Kennedy 2010-10-06 11:03:27 UTC

Pekka Enberg 2010-10-06 11:19:13 UTC

Richard Kennedy 2010-10-06 15:46:19 UTC

Christoph Lameter 2010-10-06 16:21:13 UTC

Christoph Lameter 2010-10-06 16:21:53 UTC

Christoph Lameter 2010-10-06 20:56:13 UTC

Christoph Lameter 2010-10-06 16:00:25 UTC

Wu Fengguang 2010-10-06 12:37:53 UTC

Alex,Shi 2010-10-13 02:21:12 UTC

Christoph Lameter 2010-10-18 18:00:46 UTC

Alex,Shi 2010-10-19 00:01:44 UTC

Christoph Lameter 2010-10-06 15:56:47 UTC

Mel Gorman 2010-10-13 14:14:55 UTC

Christoph Lameter 2010-10-18 18:13:42 UTC

Mel Gorman 2010-10-19 09:23:38 UTC

Mel Gorman 2010-10-12 18:25:31 UTC

Pekka Enberg 2010-10-13 07:16:45 UTC

Mel Gorman 2010-10-13 13:46:56 UTC

Christoph Lameter 2010-10-13 16:10:29 UTC

Andi Kleen 2010-10-06 10:47:21 UTC

Christoph Lameter 2010-10-06 15:59:55 UTC

Andi Kleen 2010-10-06 16:25:47 UTC

Christoph Lameter 2010-10-06 16:37:12 UTC

Andi Kleen 2010-10-06 16:43:26 UTC

Christoph Lameter 2010-10-06 16:49:38 UTC

Christoph Lameter 2010-10-06 16:52:27 UTC

David Rientjes 2010-10-19 20:39:55 UTC

Christoph Lameter 2010-10-20 13:47:30 UTC

about - legalese

Loading...