Discussion:
[PATCH 00/16] Swap-over-NBD without deadlocking V9
(too old to reply)
Mel Gorman
2012-04-16 12:16:47 UTC
Permalink
Changelog since V8
o Rebase to 3.4-rc2
o Use page flag instead of slab fields to keep structures the same size
o Properly detect allocations from softirq context that use PF_MEMALLOC
o Ensure kswapd does not sleep while processes are throttled
o Do not accidentally throttle !_GFP_FS processes indefinitely

Changelog since V7
o Rebase to 3.3-rc2
o Take greater care propagating page->pfmemalloc to skb
o Propagate pfmemalloc from netdev_alloc_page to skb where possible
o Release RCU lock properly on preempt kernel

Changelog since V6
o Rebase to 3.1-rc8
o Use wake_up instead of wake_up_interruptible()
o Do not throttle kernel threads
o Avoid a potential race between kswapd going to sleep and processes being
throttled

Changelog since V5
o Rebase to 3.1-rc5

Changelog since V4
o Update comment clarifying what protocols can be used (Michal)
o Rebase to 3.0-rc3

Changelog since V3
o Propogate pfmemalloc from packet fragment pages to skb (Neil)
o Rebase to 3.0-rc2

Changelog since V2
o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC (Neil)
o Use wait_event_interruptible (Neil)
o Use !! when casting to bool to avoid any possibilitity of type
truncation (Neil)
o Nicer logic when using skb_pfmemalloc_protocol (Neil)

Changelog since V1
o Rebase on top of mmotm
o Use atomic_t for memalloc_socks (David Miller)
o Remove use of sk_memalloc_socks in vmscan (Neil Brown)
o Check throttle within prepare_to_wait (Neil Brown)
o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Ziljstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 serialises access to min_free_kbytes. It's not strictly needed
by this series but as the series cares about watermarks in
general, it's a harmless fix. It could be merged independently
and may be if CMA is merged in advance.

Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.

Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 6-13 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.

Patch 11 reintroduces __netdev_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.

Patch 13 is a micro-optimisation to avoid a function call in the
common case.

Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.

Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.

Patch 16 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 40% faster.

3.4.0-rc2 3.4.0-rc2
vanilla-slab swapnbd
Sys Time Running Test (seconds) 87.90 73.45
User+Sys Time Running Test (seconds) 91.93 76.91
Total Elapsed Time (seconds) 4174.37 2953.96

drivers/block/nbd.c | 6 +-
drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-
drivers/net/usb/cdc-phonet.c | 2 +-
drivers/usb/gadget/f_phonet.c | 2 +-
include/linux/gfp.h | 13 +-
include/linux/mm_types.h | 9 +
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 28 +++
include/linux/sched.h | 7 +
include/linux/skbuff.h | 83 +++++++-
include/linux/vm_event_item.h | 1 +
include/net/sock.h | 19 ++
include/trace/events/gfpflags.h | 1 +
kernel/softirq.c | 3 +
mm/page_alloc.c | 58 ++++--
mm/slab.c | 212 +++++++++++++++++++--
mm/slub.c | 28 ++-
mm/vmscan.c | 112 +++++++++++
mm/vmstat.c | 1 +
net/core/dev.c | 52 ++++-
net/core/filter.c | 8 +
net/core/skbuff.c | 93 +++++++--
net/core/sock.c | 42 ++++
net/ipv4/tcp.c | 3 +-
net/ipv4/tcp_output.c | 16 +-
net/ipv6/tcp_ipv6.c | 8 +-
30 files changed, 742 insertions(+), 77 deletions(-)
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:48 UTC
Permalink
There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.

[***@chello.nl: Older patch fixed the race with spinlock]
Signed-off-by: Mel Gorman <***@suse.de>
Reviewed-by: Rik van Riel <***@redhat.com>
---
mm/page_alloc.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a712fb9..280eabe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4976,14 +4976,7 @@ static void setup_per_zone_lowmem_reserve(void)
calculate_totalreserve_pages();
}

-/**
- * setup_per_zone_wmarks - called when min_free_kbytes changes
- * or when memory is hot-{added|removed}
- *
- * Ensures that the watermark[min,low,high] values for each zone are set
- * correctly with respect to min_free_kbytes.
- */
-void setup_per_zone_wmarks(void)
+static void __setup_per_zone_wmarks(void)
{
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -5038,6 +5031,20 @@ void setup_per_zone_wmarks(void)
calculate_totalreserve_pages();
}

+/**
+ * setup_per_zone_wmarks - called when min_free_kbytes changes
+ * or when memory is hot-{added|removed}
+ *
+ * Ensures that the watermark[min,low,high] values for each zone are set
+ * correctly with respect to min_free_kbytes.
+ */
+void setup_per_zone_wmarks(void)
+{
+ mutex_lock(&zonelists_mutex);
+ __setup_per_zone_wmarks();
+ mutex_unlock(&zonelists_mutex);
+}
+
/*
* The inactive anon list should be small enough that the VM never has to
* do too much work, but large enough that each inactive page has a chance
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David Rientjes
2012-04-23 23:50:48 UTC
Permalink
Post by Mel Gorman
There is a race between the min_free_kbytes sysctl, memory hotplug
and transparent hugepage support enablement. Memory hotplug uses a
zonelists_mutex to avoid a race when building zonelists. Reuse it to
serialise watermark updates.
Acked-by: David Rientjes <***@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:49 UTC
Permalink
Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

Pages allocated from the reserve are returned with page->pfmemalloc
set and it is up to the caller to determine how the page should be
protected. SLAB restricts access to any page with page->pfmemalloc set
to callers which are known to able to access the PFMEMALLOC reserve. If
one is not available, an attempt is made to allocate a new page rather
than use a reserve. SLUB is a bit more relaxed in that it only records
if the current per-CPU page was allocated from PFMEMALLOC reserve and
uses another partial slab if the caller does not have the necessary
GFP or process flags. This was found to be sufficient in tests to
avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[***@chello.nl: Original implementation]
Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/mm_types.h | 9 +++
include/linux/page-flags.h | 28 +++++++
mm/internal.h | 3 +
mm/page_alloc.c | 27 +++++--
mm/slab.c | 188 +++++++++++++++++++++++++++++++++++++++-----
mm/slub.c | 27 ++++++-
6 files changed, 257 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc3062..56a465f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -53,6 +53,15 @@ struct page {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* slub first free object */
+ bool pfmemalloc; /* If set by the page allocator,
+ * ALLOC_PFMEMALLOC was set
+ * and the low watermark was not
+ * met implying that the system
+ * is under some pressure. The
+ * caller should try ensure
+ * this page is only used to
+ * free other pages.
+ */
};

union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..e66eb0d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -453,6 +453,34 @@ static inline int PageTransTail(struct page *page)
}
#endif

+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ __ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+ VM_BUG_ON(!PageSlab(page));
+ ClearPageActive(page);
+}
+
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1 << PG_mlocked)
#else
diff --git a/mm/internal.h b/mm/internal.h
index 2189af4..bff60d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -239,6 +239,9 @@ static inline struct page *mem_map_next(struct page *iter,
#define __paginginit __init
#endif

+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 280eabe..0fa2c72 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,6 +1463,7 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */

#ifdef CONFIG_FAIL_PAGE_ALLOC

@@ -2208,16 +2209,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;

- if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))) {
+ alloc_flags |= ALLOC_PFMEMALLOC;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}

return alloc_flags;
}

+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2407,8 +2414,16 @@ nopage:
got_pg:
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
- return page;

+ /*
+ * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+ * been OOM killed. The expectation is that the caller is taking
+ * steps that will free more memory. The caller should avoid the
+ * page being used for !PFMEMALLOC purposes.
+ */
+ page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
+ return page;
}

/*
@@ -2459,6 +2474,8 @@ retry_cpuset:
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ else
+ page->pfmemalloc = false;

trace_mm_page_alloc(page, order, gfp_mask, migratetype);

diff --git a/mm/slab.c b/mm/slab.c
index e901a36..7f483d6 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@

#include <trace/events/kmem.h>

+#include "internal.h"
+
/*
* DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
* 0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
#endif

+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active __read_mostly;
+
/* Legal flag mask for kmem_cache_create(). */
#if DEBUG
# define CREATE_MASK (SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
* the entries.
+ *
+ * Entries should not be directly dereferenced as
+ * entries belonging to slabs marked pfmemalloc will
+ * have the lower bits set SLAB_OBJ_PFMEMALLOC
*/
};

+#define SLAB_OBJ_PFMEMALLOC 1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+ return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+ *objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+ return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+ *objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
/*
* bootstrap: The caches do not work without cpuarrays anymore, but the
* cpuarrays are allocated from the generic caches...
@@ -951,6 +980,98 @@ static struct array_cache *alloc_arraycache(int node, int entries,
return nc;
}

+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+ struct page *page = virt_to_page(slabp->s_mem);
+
+ return PageSlabPfmemalloc(page);
+}
+
+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
+ struct array_cache *ac)
+{
+ struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+ struct slab *slabp;
+
+ if (!pfmemalloc_active)
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_full, list)
+ if (is_slab_pfmemalloc(slabp))
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_partial, list)
+ if (is_slab_pfmemalloc(slabp))
+ return;
+
+ list_for_each_entry(slabp, &l3->slabs_free, list)
+ if (is_slab_pfmemalloc(slabp))
+ return;
+
+ pfmemalloc_active = false;
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ gfp_t flags, bool force_refill)
+{
+ int i;
+ void *objp = ac->entry[--ac->avail];
+
+ /* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+ if (unlikely(is_obj_pfmemalloc(objp))) {
+ struct kmem_list3 *l3;
+
+ if (gfp_pfmemalloc_allowed(flags)) {
+ clear_obj_pfmemalloc(&objp);
+ return objp;
+ }
+
+ /* The caller cannot use PFMEMALLOC objects, find another one */
+ for (i = 1; i < ac->avail; i++) {
+ /* If a !PFMEMALLOC object is found, swap them */
+ if (!is_obj_pfmemalloc(ac->entry[i])) {
+ objp = ac->entry[i];
+ ac->entry[i] = ac->entry[ac->avail];
+ ac->entry[ac->avail] = objp;
+ return objp;
+ }
+ }
+
+ /*
+ * If there are empty slabs on the slabs_free list and we are
+ * being forced to refill the cache, mark this one !pfmemalloc.
+ */
+ l3 = cachep->nodelists[numa_mem_id()];
+ if (!list_empty(&l3->slabs_free) && force_refill) {
+ struct slab *slabp = virt_to_slab(objp);
+ ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+ clear_obj_pfmemalloc(&objp);
+ check_ac_pfmemalloc(cachep, ac);
+ return objp;
+ }
+
+ /* No !PFMEMALLOC objects available */
+ ac->avail++;
+ objp = NULL;
+ }
+
+ return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ void *objp)
+{
+ if (unlikely(pfmemalloc_active)) {
+ /* Some pfmemalloc slabs exist, check if this is one */
+ struct page *page = virt_to_page(objp);
+ if (PageSlabPfmemalloc(page))
+ set_obj_pfmemalloc(&objp);
+ }
+
+ ac->entry[ac->avail++] = objp;
+}
+
/*
* Transfer objects in one arraycache to another.
* Locking must be handled by the caller.
@@ -1127,7 +1248,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
STATS_INC_ACOVERFLOW(cachep);
__drain_alien_cache(cachep, alien, nodeid);
}
- alien->entry[alien->avail++] = objp;
+ ac_put_obj(cachep, alien, objp);
spin_unlock(&alien->lock);
} else {
spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1809,6 +1930,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
return NULL;
}

+ /* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+ if (unlikely(page->pfmemalloc))
+ pfmemalloc_active = true;
+
nr_pages = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
add_zone_page_state(page_zone(page),
@@ -1816,9 +1941,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
else
add_zone_page_state(page_zone(page),
NR_SLAB_UNRECLAIMABLE, nr_pages);
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
__SetPageSlab(page + i);

+ if (page->pfmemalloc)
+ SetPageSlabPfmemalloc(page + i);
+ }
+
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);

@@ -1851,6 +1980,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
while (i--) {
BUG_ON(!PageSlab(page));
__ClearPageSlab(page);
+ __ClearPageSlabPfmemalloc(page);
page++;
}
if (current->reclaim_state)
@@ -3120,16 +3250,19 @@ bad:
#define check_slabp(x,y) do { } while(0)
#endif

-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+ bool force_refill)
{
int batchcount;
struct kmem_list3 *l3;
struct array_cache *ac;
int node;

-retry:
check_irq_off();
node = numa_mem_id();
+ if (unlikely(force_refill))
+ goto force_grow;
+retry:
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3179,8 +3312,8 @@ retry:
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);

- ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
- node);
+ ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+ node));
}
check_slabp(cachep, slabp);

@@ -3199,18 +3332,22 @@ alloc_done:

if (unlikely(!ac->avail)) {
int x;
+force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);

/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
- if (!x && ac->avail == 0) /* no objects in sight? abort */
+
+ /* no objects in sight? abort */
+ if (!x && (ac->avail == 0 || force_refill))
return NULL;

if (!ac->avail) /* objects refilled by interrupt? */
goto retry;
}
ac->touched = 1;
- return ac->entry[--ac->avail];
+
+ return ac_get_obj(cachep, ac, flags, force_refill);
}

static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3292,23 +3429,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
void *objp;
struct array_cache *ac;
+ bool force_refill = false;

check_irq_off();

ac = cpu_cache_get(cachep);
if (likely(ac->avail)) {
- STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
- objp = ac->entry[--ac->avail];
- } else {
- STATS_INC_ALLOCMISS(cachep);
- objp = cache_alloc_refill(cachep, flags);
+ objp = ac_get_obj(cachep, ac, flags, false);
+
/*
- * the 'ac' may be updated by cache_alloc_refill(),
- * and kmemleak_erase() requires its correct value.
+ * Allow for the possibility all avail objects are not allowed
+ * by the current flags
*/
- ac = cpu_cache_get(cachep);
+ if (objp) {
+ STATS_INC_ALLOCHIT(cachep);
+ goto out;
+ }
+ force_refill = true;
}
+
+ STATS_INC_ALLOCMISS(cachep);
+ objp = cache_alloc_refill(cachep, flags, force_refill);
+ /*
+ * the 'ac' may be updated by cache_alloc_refill(),
+ * and kmemleak_erase() requires its correct value.
+ */
+ ac = cpu_cache_get(cachep);
+
+out:
/*
* To avoid a false negative, if an object that is in one of the
* per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3630,9 +3779,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
struct kmem_list3 *l3;

for (i = 0; i < nr_objects; i++) {
- void *objp = objpp[i];
+ void *objp;
struct slab *slabp;

+ clear_obj_pfmemalloc(&objpp[i]);
+ objp = objpp[i];
+
slabp = virt_to_slab(objp);
l3 = cachep->nodelists[node];
list_del(&slabp->list);
@@ -3750,7 +3902,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
cache_flusharray(cachep, ac);
}

- ac->entry[ac->avail++] = objp;
+ ac_put_obj(cachep, ac, objp);
}

/**
diff --git a/mm/slub.c b/mm/slub.c
index ffe13fd..f0909bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -33,6 +33,8 @@

#include <trace/events/kmem.h>

+#include "internal.h"
+
/*
* Lock order:
* 1. slub_lock (Global Semaphore)
@@ -1370,6 +1372,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
inc_slabs_node(s, page_to_nid(page), page->objects);
page->slab = s;
page->flags |= 1 << PG_slab;
+ if (page->pfmemalloc)
+ SetPageSlabPfmemalloc(page);

start = page_address(page);

@@ -1414,6 +1418,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
-pages);

__ClearPageSlab(page);
+ __ClearPageSlabPfmemalloc(page);
reset_page_mapcount(page);
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
@@ -2153,6 +2158,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
return object;
}

+static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
+{
+ if (unlikely(PageSlabPfmemalloc(c->page)))
+ return gfp_pfmemalloc_allowed(gfpflags);
+
+ return true;
+}
+
/*
* Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
* or deactivate the page.
@@ -2225,6 +2238,16 @@ redo:
goto new_slab;
}

+ /*
+ * By rights, we should be searching for a slab page that was
+ * PFMEMALLOC but right now, we are losing the pfmemalloc
+ * information when the page leaves the per-cpu allocator
+ */
+ if (unlikely(!pfmemalloc_match(c, gfpflags))) {
+ deactivate_slab(s, c);
+ goto new_slab;
+ }
+
/* must check again c->freelist in case of cpu migration or IRQ */
object = c->freelist;
if (object)
@@ -2329,8 +2352,8 @@ redo:
barrier();

object = c->freelist;
- if (unlikely(!object || !node_match(c, node)))
-
+ if (unlikely(!object || !node_match(c, node) ||
+ !pfmemalloc_match(c, gfpflags)))
object = __slab_alloc(s, gfpflags, node, addr, c);

else {
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David Rientjes
2012-04-23 23:51:02 UTC
Permalink
Post by Mel Gorman
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 280eabe..0fa2c72 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */
#ifdef CONFIG_FAIL_PAGE_ALLOC
@@ -2208,16 +2209,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))) {
+ alloc_flags |= ALLOC_PFMEMALLOC;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}
return alloc_flags;
}
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
- return page;
+ /*
+ * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+ * been OOM killed. The expectation is that the caller is taking
+ * steps that will free more memory. The caller should avoid the
+ * page being used for !PFMEMALLOC purposes.
+ */
+ page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
+ return page;
}
/*
I think this is slightly inconsistent if the page allocation succeeded
without needing ALLOC_NO_WATERMARKS, meaning that page was allocated above
the min watermark. That's possible if the slowpath's first call to
get_page_from_freelist() succeeds without needing
__alloc_pages_high_priority(). So perhaps we need to do something like

got_pg_memalloc:
...
page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
got_pg:
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
return page;

and use got_pg_memalloc everywhere we currently use got_pg other than the
when it succeeds with ALLOC_NO_WATERMARKS.
Post by Mel Gorman
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ else
+ page->pfmemalloc = false;
trace_mm_page_alloc(page, order, gfp_mask, migratetype);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-25 15:05:56 UTC
Permalink
Post by David Rientjes
Post by Mel Gorman
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 280eabe..0fa2c72 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */
#ifdef CONFIG_FAIL_PAGE_ALLOC
@@ -2208,16 +2209,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
- if (!in_interrupt() &&
- ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))))
+ if ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))) {
+ alloc_flags |= ALLOC_PFMEMALLOC;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}
return alloc_flags;
}
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
- return page;
+ /*
+ * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+ * been OOM killed. The expectation is that the caller is taking
+ * steps that will free more memory. The caller should avoid the
+ * page being used for !PFMEMALLOC purposes.
+ */
+ page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
+ return page;
}
/*
I think this is slightly inconsistent if the page allocation succeeded
without needing ALLOC_NO_WATERMARKS, meaning that page was allocated above
the min watermark. That's possible if the slowpath's first call to
get_page_from_freelist() succeeds without needing __alloc_pages_high_priority().
So perhaps we need to do something like
...
page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
return page;
and use got_pg_memalloc everywhere we currently use got_pg other than the
when it succeeds with ALLOC_NO_WATERMARKS.
In this path, we are under memory pressure and at least below the low
watermark. kswapd is woken up and should start reclaiming pages. The
calling process may enter direct reclaim. As we are paging out and some
of that could be over the network, the pages are preserved until the
paging is complete. The problem is that in the context of this patch,
ALLOC_NO_WATERMARK is not being set for the cases we need so what it
does at this point makes sense.

That said, I see your point. The changelog is inconsistent with what
the code is doing. I updated the desription to read "When this patch is
applied, pages allocated from below the low watermark are returned with
page->pfmemalloc set and it is up to the caller to determine how the page
should be protected."

You are also right in that the series as a whole is actually protecting pages
below the low watermark and not below the min watermark as stated. I'll add a
new patch that will be patch 6 in the series that only sets page->pfmemalloc
when ALLOC_NO_WATERMARKS is used and test that.

One could argue that the patch "mm: Introduce __GFP_MEMALLOC to allow access
to emergency reserves" should be updated but I think a separate patch will
be easier to review and also easier to revert if regressions are reported.

Well spotted and thanks for reviewing.

This is not tested but is this what you had in mind? It only sets
page->pfmemalloc if ALLOC_NO_WATERMARKS was necessary to satisfy the
allocation. To do that, ALLOC_NO_WATERMARKS has to be stripped from
__alloc_pages_direct_reclaim and friends but it should goto rebalanace later.

---8<---
mm: Only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used

__alloc_pages_slowpath() is called when the number of free pages is below
the low watermark. If the caller is entitled to use ALLOC_NO_WATERMARKS
then the page will be marked page->pfmemalloc. This protects more pages
than are strictly necessary as we only need to protect pages allocated
below the min watermark (the pfmemalloc reserves).

This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
required to allocate the page.

[***@google.com: David noticed the problem during review]
Signed-off-by: Mel Gorman <***@suse.de>
---
mm/page_alloc.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3738596..363ca90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2043,8 +2043,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

page = get_page_from_freelist(gfp_mask, nodemask,
order, zonelist, high_zoneidx,
- alloc_flags, preferred_zone,
- migratetype);
+ alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, migratetype);
if (page) {
preferred_zone->compact_considered = 0;
preferred_zone->compact_defer_shift = 0;
@@ -2124,8 +2124,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
retry:
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
- alloc_flags, preferred_zone,
- migratetype);
+ alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, migratetype);

/*
* If an allocation failed after direct reclaim, it could be because
@@ -2296,8 +2296,17 @@ rebalance:
page = __alloc_pages_high_priority(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
- if (page)
+ if (page) {
+ /*
+ * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
+ * necessary to allocate the page. The expectation is
+ * that the caller is taking steps that will free more
+ * memory. The caller should avoid the page being used
+ * for !PFMEMALLOC purposes.
+ */
+ page->pfmemalloc = true;
goto got_pg;
+ }
}

/* Atomic allocations - we can't balance anything */
@@ -2414,14 +2423,6 @@ nopage:
warn_alloc_failed(gfp_mask, order, NULL);
return page;
got_pg:
- /*
- * page->pfmemalloc is set when the caller had PFMEMALLOC set, is
- * been OOM killed or specified __GFP_MEMALLOC. The expectation is
- * that the caller is taking steps that will free more memory. The
- * caller should avoid the page being used for !PFMEMALLOC purposes.
- */
- page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
-
if (kmemcheck_enabled)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:50 UTC
Permalink
From: Christoph Lameter <***@linux.com>

This patch removes the check for pfmemalloc from the alloc hotpath and
puts the logic after the election of a new per cpu slab. For a pfmemalloc
page we do not use the fast path but force the use of the slow path which
is also used for the debug case.

This has the side-effect of weakening pfmemalloc processing in the
following way;

1. A process that is allocating for network swap calls __slab_alloc.
pfmemalloc_match is true so the freelist is loaded and c->freelist is
now pointing to a pfmemalloc page.

2. A process that is attempting normal allocations calls slab_alloc,
finds the pfmemalloc page on the freelist and uses it because it did
not check pfmemalloc_match()

The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
the kmalloc slabs being the most vunerable caches on the grounds they
are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
later patch will still protect the system as processes will get throttled
if the pfmemalloc reserves get depleted but performance will not degrade
as smoothly.

[***@suse.de: Expanded changelog]
Signed-off-by: Christoph Lameter <***@linux.com>
Signed-off-by: Mel Gorman <***@suse.de>
---
mm/slub.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f0909bf..f8cbec4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2298,11 +2298,11 @@ new_slab:
}
}

- if (likely(!kmem_cache_debug(s)))
+ if (likely(!kmem_cache_debug(s) && pfmemalloc_match(c, gfpflags)))
goto load_freelist;

/* Only entered in the debug case */
- if (!alloc_debug_processing(s, c->page, object, addr))
+ if (kmem_cache_debug(s) && !alloc_debug_processing(s, c->page, object, addr))
goto new_slab; /* Slab failed checks. Next slab needed */

c->freelist = get_freepointer(s, object);
@@ -2352,8 +2352,7 @@ redo:
barrier();

object = c->freelist;
- if (unlikely(!object || !node_match(c, node) ||
- !pfmemalloc_match(c, gfpflags)))
+ if (unlikely(!object || !node_match(c, node)))
object = __slab_alloc(s, gfpflags, node, addr, c);

else {
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:51 UTC
Permalink
__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC. It allows one to pass along the memalloc state
in object related allocation flags as opposed to task related flags,
such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC
as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag
which is now enough to identify allocations related to page reclaim.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/gfp.h | 10 ++++++++--
include/linux/mm_types.h | 2 +-
include/trace/events/gfpflags.h | 1 +
mm/page_alloc.c | 14 ++++++--------
mm/slab.c | 2 +-
5 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581e74b..94af4a2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -23,6 +23,7 @@ struct vm_area_struct;
#define ___GFP_REPEAT 0x400u
#define ___GFP_NOFAIL 0x800u
#define ___GFP_NORETRY 0x1000u
+#define ___GFP_MEMALLOC 0x2000u
#define ___GFP_COMP 0x4000u
#define ___GFP_ZERO 0x8000u
#define ___GFP_NOMEMALLOC 0x10000u
@@ -76,9 +77,14 @@ struct vm_area_struct;
#define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) /* See above */
#define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) /* See above */
#define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */
+#define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
#define __GFP_COMP ((__force gfp_t)___GFP_COMP) /* Add compound page metadata */
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) /* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves */
+#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
+ * This takes precedence over the
+ * __GFP_MEMALLOC flag if both are
+ * set
+ */
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
@@ -129,7 +135,7 @@ struct vm_area_struct;
/* Control page allocator reclaim behavior */
#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
- __GFP_NORETRY|__GFP_NOMEMALLOC)
+ __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)

/* Control slab gfp mask during early boot */
#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56a465f..7718903 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,7 +54,7 @@ struct page {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* slub first free object */
bool pfmemalloc; /* If set by the page allocator,
- * ALLOC_PFMEMALLOC was set
+ * ALLOC_NO_WATERMARKS was set
* and the low watermark was not
* met implying that the system
* is under some pressure. The
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9fe3a366..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -30,6 +30,7 @@
{(unsigned long)__GFP_COMP, "GFP_COMP"}, \
{(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \
{(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \
+ {(unsigned long)__GFP_MEMALLOC, "GFP_MEMALLOC"}, \
{(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \
{(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \
{(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fa2c72..ee9f49c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1463,7 +1463,6 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
-#define ALLOC_PFMEMALLOC 0x80 /* Caller has PF_MEMALLOC set */

#ifdef CONFIG_FAIL_PAGE_ALLOC

@@ -2209,11 +2208,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;

- if ((current->flags & PF_MEMALLOC) ||
- unlikely(test_thread_flag(TIF_MEMDIE))) {
- alloc_flags |= ALLOC_PFMEMALLOC;
-
- if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (gfp_mask & __GFP_MEMALLOC)
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
alloc_flags |= ALLOC_NO_WATERMARKS;
}

@@ -2222,7 +2220,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)

bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
{
- return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+ return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
}

static inline struct page *
@@ -2421,7 +2419,7 @@ got_pg:
* steps that will free more memory. The caller should avoid the
* page being used for !PFMEMALLOC purposes.
*/
- page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+ page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);

return page;
}
diff --git a/mm/slab.c b/mm/slab.c
index 7f483d6..8e660b3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1930,7 +1930,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
return NULL;
}

- /* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+ /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
if (unlikely(page->pfmemalloc))
pfmemalloc_active = true;
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:53 UTC
Permalink
The reserve is proportionally distributed over all !highmem zones
in the system. So we need to allow an emergency allocation access to
all zones. In order to do that we need to break out of any mempolicy
boundaries we might have.

In my opinion that does not break mempolicies as those are user
oriented and not system oriented. That is, system allocations are
not guaranteed to be within mempolicy boundaries. For instance IRQs
do not even have a mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency
allocations, which are always system allocations (as opposed to user)
is ok.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Signed-off-by: Mel Gorman <***@suse.de>
---
mm/page_alloc.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c0ff7..0d7bea5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2293,6 +2293,13 @@ rebalance:

/* Allocate without watermarks if the context allows */
if (alloc_flags & ALLOC_NO_WATERMARKS) {
+ /*
+ * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
+ * the allocation is high priority and these type of
+ * allocations are system rather than user orientated
+ */
+ zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
page = __alloc_pages_high_priority(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:52 UTC
Permalink
This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. We basically borrow the task flags from whatever process
happens to be preempted by the softirq.

So we modify the gfp to alloc flags mapping to not exclude task flags
in softirq context, and modify the softirq code to save, clear and
restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/sched.h | 7 +++++++
kernel/softirq.c | 3 +++
mm/page_alloc.c | 6 +++++-
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81a173c..2ad408f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1913,6 +1913,13 @@ static inline void rcu_copy_process(struct task_struct *p)

#endif

+static inline void tsk_restore_flags(struct task_struct *p,
+ unsigned long pflags, unsigned long mask)
+{
+ p->flags &= ~mask;
+ p->flags |= pflags & mask;
+}
+
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p,
const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 671f959..d349caa 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ unsigned long pflags = current->flags;
+ current->flags &= ~PF_MEMALLOC;

pending = local_softirq_pending();
account_system_vtime(current);
@@ -265,6 +267,7 @@ restart:

account_system_vtime(current);
__local_bh_enable(SOFTIRQ_OFFSET);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
}

#ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee9f49c..15c0ff7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2211,7 +2211,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
if (gfp_mask & __GFP_MEMALLOC)
alloc_flags |= ALLOC_NO_WATERMARKS;
- else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+ else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (!in_interrupt() &&
+ ((current->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2012-05-01 22:08:13 UTC
Permalink
On Mon, 16 Apr 2012 13:16:52 +0100
Post by Mel Gorman
This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.
hm, why? You just added __GFP_MEMALLOC so we don't need to futz with
PF_MEMALLOC?
Post by Mel Gorman
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. We basically borrow the task flags from whatever process
happens to be preempted by the softirq.
So we modify the gfp to alloc flags mapping to not exclude task flags
in softirq context, and modify the softirq code to save, clear and
restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process.
...
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1913,6 +1913,13 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif
+static inline void tsk_restore_flags(struct task_struct *p,
+ unsigned long pflags, unsigned long mask)
The naming is poor.

p -> "tsk" or "task"
pflags -> "old_flags"
mask -> "flags"
Post by Mel Gorman
+{
+ p->flags &= ~mask;
+ p->flags |= pflags & mask;
+}
+
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p,
const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 671f959..d349caa 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ unsigned long pflags = current->flags;
"old_flags"
Post by Mel Gorman
+ current->flags &= ~PF_MEMALLOC;
The line before this one would be a suitable place for a comment!
Post by Mel Gorman
pending = local_softirq_pending();
account_system_vtime(current);
account_system_vtime(current);
__local_bh_enable(SOFTIRQ_OFFSET);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
}
...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-05-02 16:24:03 UTC
Permalink
Post by Andrew Morton
On Mon, 16 Apr 2012 13:16:52 +0100
Post by Mel Gorman
This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.
hm, why? You just added __GFP_MEMALLOC so we don't need to futz with
PF_MEMALLOC?
The number of call sites is a problem. In patch 12, PF_MEMALLOC is set
where required. For example it is set in __netif_receive_skb() before it
calls packet_type->func() which is a per-protocol receive function such
as net/ipv4/ip_input.c#ip_rcv(). To use __GFP_MEMALLOC, every allocation
on this path would need to check the skb and set the flag as appropriate
for every protocol. This would make a mess and seeing as it is needed for
every allocation it makes more sense to set PF_MEMALLOC.
Post by Andrew Morton
Post by Mel Gorman
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. We basically borrow the task flags from whatever process
happens to be preempted by the softirq.
So we modify the gfp to alloc flags mapping to not exclude task flags
in softirq context, and modify the softirq code to save, clear and
restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process.
...
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1913,6 +1913,13 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif
+static inline void tsk_restore_flags(struct task_struct *p,
+ unsigned long pflags, unsigned long mask)
The naming is poor.
p -> "tsk" or "task"
pflags -> "old_flags"
mask -> "flags"
I went with orig_flags instead of old_flags so it reads as "restore the
original task flags".
Post by Andrew Morton
Post by Mel Gorman
+{
+ p->flags &= ~mask;
+ p->flags |= pflags & mask;
+}
+
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p,
const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 671f959..d349caa 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ unsigned long pflags = current->flags;
"old_flags"
Post by Mel Gorman
+ current->flags &= ~PF_MEMALLOC;
The line before this one would be a suitable place for a comment!
/*
* Mask out PF_MEMALLOC s current task context is borrowed for the
* softirq. A softirq handled such as network RX might set PF_MEMALLOC
* again if the socket is related to swap
*/

?
Post by Andrew Morton
Post by Mel Gorman
pending = local_softirq_pending();
account_system_vtime(current);
account_system_vtime(current);
__local_bh_enable(SOFTIRQ_OFFSET);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
}
...
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:54 UTC
Permalink
Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation. It is only used on allocation
paths that may be required for writing pages back to network storage.

Signed-off-by: Peter Zijlstra <***@chello.nl>
Signed-off-by: Mel Gorman <***@suse.de>
---
include/net/sock.h | 5 +++++
net/ipv4/tcp.c | 3 ++-
net/ipv4/tcp_output.c | 16 +++++++++-------
net/ipv6/tcp_ipv6.c | 8 +++++---
4 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a6ba1f8..ba8a339 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -643,6 +643,11 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
return test_bit(flag, &sk->sk_flags);
}

+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+ return gfp_mask;
+}
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5d54ed3..bb17467 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -698,7 +698,8 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
/* The TCP header must be at least 32-bit aligned. */
size = ALIGN(size, 4);

- skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
+ skb = alloc_skb_fclone(size + sk->sk_prot->max_header,
+ sk_allocation(sk, gfp));
if (skb) {
if (sk_wmem_schedule(sk, skb->truesize)) {
/*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 364784a..45c69d9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2345,7 +2345,7 @@ void tcp_send_fin(struct sock *sk)
/* Socket is locked, keep trying until memory is available. */
for (;;) {
skb = alloc_skb_fclone(MAX_TCP_HEADER,
- sk->sk_allocation);
+ sk_allocation(sk, sk->sk_allocation));
if (skb)
break;
yield();
@@ -2371,7 +2371,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
struct sk_buff *skb;

/* NOTE: No TCP options attached and we never retransmit this. */
- skb = alloc_skb(MAX_TCP_HEADER, priority);
+ skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
return;
@@ -2444,7 +2444,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,

if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
s_data_desired = cvp->s_data_desired;
- skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
+ skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
+ sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;

@@ -2634,7 +2635,8 @@ int tcp_connect(struct sock *sk)

tcp_connect_init(sk);

- buff = alloc_skb_fclone(MAX_TCP_HEADER + 15, sk->sk_allocation);
+ buff = alloc_skb_fclone(MAX_TCP_HEADER + 15,
+ sk_allocation(sk, sk->sk_allocation));
if (unlikely(buff == NULL))
return -ENOBUFS;

@@ -2740,7 +2742,7 @@ void tcp_send_ack(struct sock *sk)
* tcp_transmit_skb() will set the ownership to this
* sock.
*/
- buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+ buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2755,7 +2757,7 @@ void tcp_send_ack(struct sock *sk)

/* Send it off, this clears delayed acks for us. */
TCP_SKB_CB(buff)->when = tcp_time_stamp;
- tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC);
+ tcp_transmit_skb(sk, buff, 0, sk_allocation(sk, GFP_ATOMIC));
}

/* This routine sends a packet with an out of date sequence
@@ -2775,7 +2777,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
struct sk_buff *skb;

/* We don't queue it, tcp_transmit_skb() sets ownership. */
- skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+ skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 12c6ece..765a25a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1352,7 +1352,8 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
/* Clone pktoptions received with SYN */
newnp->pktoptions = NULL;
if (treq->pktopts != NULL) {
- newnp->pktoptions = skb_clone(treq->pktopts, GFP_ATOMIC);
+ newnp->pktoptions = skb_clone(treq->pktopts,
+ sk_allocation(sk, GFP_ATOMIC));
kfree_skb(treq->pktopts);
treq->pktopts = NULL;
if (newnp->pktoptions)
@@ -1401,7 +1402,8 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
* across. Shucks.
*/
tcp_md5_do_add(newsk, (union tcp_md5_addr *)&newnp->daddr,
- AF_INET6, key->key, key->keylen, GFP_ATOMIC);
+ AF_INET6, key->key, key->keylen,
+ sk_allocation(sk, GFP_ATOMIC));
}
#endif

@@ -1496,7 +1498,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
--ANK (980728)
*/
if (np->rxopt.all)
- opt_skb = skb_clone(skb, GFP_ATOMIC);
+ opt_skb = skb_clone(skb, sk_allocation(sk, GFP_ATOMIC));

if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
sock_rps_save_rxhash(sk, skb);
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:55 UTC
Permalink
Allow specific sockets to be tagged SOCK_MEMALLOC and use
__GFP_MEMALLOC for their allocations. These sockets will be able to go
below watermarks and allocate from the emergency reserve. Such sockets
are to be used to service the VM (iow. to swap over). They must be
handled kernel side, exposing such a socket to user-space is a bug.

There is a risk that the reserves be depleted so for now, the
administrator is responsible for increasing min_free_kbytes as
necessary to prevent deadlock for their workloads.

[***@chello.nl: Original patches]
Signed-off-by: Mel Gorman <***@suse.de>
---
include/net/sock.h | 5 ++++-
net/core/sock.c | 22 ++++++++++++++++++++++
2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ba8a339..d8df4a1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -606,6 +606,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* VM depends on this socket for swapping */
SOCK_TIMESTAMPING_TX_HARDWARE, /* %SOF_TIMESTAMPING_TX_HARDWARE */
SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SOF_TIMESTAMPING_TX_SOFTWARE */
SOCK_TIMESTAMPING_RX_HARDWARE, /* %SOF_TIMESTAMPING_RX_HARDWARE */
@@ -645,7 +646,7 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)

static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
{
- return gfp_mask;
+ return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
}

static inline void sk_acceptq_removed(struct sock *sk)
@@ -786,6 +787,8 @@ extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
extern void sk_stream_wait_close(struct sock *sk, long timeo_p);
extern int sk_stream_error(struct sock *sk, int flags, int err);
extern void sk_stream_kill_queues(struct sock *sk);
+extern void sk_set_memalloc(struct sock *sk);
+extern void sk_clear_memalloc(struct sock *sk);

extern int sk_wait_data(struct sock *sk, long *timeo);

diff --git a/net/core/sock.c b/net/core/sock.c
index b2e14c0..b0f0613 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -266,6 +266,28 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
EXPORT_SYMBOL(sysctl_optmem_max);

+/**
+ * sk_set_memalloc - sets %SOCK_MEMALLOC
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_MEMALLOC on a socket for access to emergency reserves.
+ * It's the responsibility of the admin to adjust min_free_kbytes
+ * to meet the requirements
+ */
+void sk_set_memalloc(struct sock *sk)
+{
+ sock_set_flag(sk, SOCK_MEMALLOC);
+ sk->sk_allocation |= __GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+void sk_clear_memalloc(struct sock *sk)
+{
+ sock_reset_flag(sk, SOCK_MEMALLOC);
+ sk->sk_allocation &= ~__GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
#if defined(CONFIG_CGROUPS)
#if !defined(CONFIG_NET_CLS_CGROUP)
int net_cls_subsys_id = -1;
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:56 UTC
Permalink
Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed. SKBs allocated from the
reserve are tagged in skb->pfmemalloc. If an SKB is allocated from
the reserve and the socket is later found to be unrelated to page
reclaim, the packet is dropped so that the memory remains available
for page reclaim. Network protocols are expected to recover from this
packet loss.

[***@chello.nl: Ideas taken from various patches]
Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/gfp.h | 3 ++
include/linux/skbuff.h | 17 +++++++--
include/net/sock.h | 6 ++++
mm/internal.h | 3 --
net/core/filter.c | 8 +++++
net/core/skbuff.c | 93 ++++++++++++++++++++++++++++++++++++++++--------
net/core/sock.c | 4 +++
7 files changed, 114 insertions(+), 20 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 94af4a2..83cd7b6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -385,6 +385,9 @@ void drain_local_pages(void *dummy);
*/
extern gfp_t gfp_allowed_mask;

+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
extern void pm_restrict_gfp_mask(void);
extern void pm_restore_gfp_mask(void);

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3337027..e889594 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -464,6 +464,7 @@ struct sk_buff {
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
+ __u8 pfmemalloc:1;
__u8 ooo_okay:1;
__u8 l4_rxhash:1;
__u8 wifi_acked_valid:1;
@@ -502,6 +503,15 @@ struct sk_buff {
#include <linux/slab.h>


+#define SKB_ALLOC_FCLONE 0x01
+#define SKB_ALLOC_RX 0x02
+
+/* Returns true if the skb was allocated from PFMEMALLOC reserves */
+static inline bool skb_pfmemalloc(struct sk_buff *skb)
+{
+ return unlikely(skb->pfmemalloc);
+}
+
/*
* skb might have a dst pointer attached, refcounted or not.
* _skb_refdst low order bit is set if refcount was _not_ taken
@@ -559,7 +569,7 @@ extern void kfree_skb(struct sk_buff *skb);
extern void consume_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int flags, int node);
extern struct sk_buff *build_skb(void *data);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
@@ -570,7 +580,7 @@ static inline struct sk_buff *alloc_skb(unsigned int size,
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
+ return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
}

extern void skb_recycle(struct sk_buff *skb);
@@ -1665,7 +1675,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+ SKB_ALLOC_RX, NUMA_NO_NODE);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
diff --git a/include/net/sock.h b/include/net/sock.h
index d8df4a1..a936614 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -644,6 +644,12 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
return test_bit(flag, &sk->sk_flags);
}

+extern atomic_t memalloc_socks;
+static inline int sk_memalloc_socks(void)
+{
+ return atomic_read(&memalloc_socks);
+}
+
static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
{
return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
diff --git a/mm/internal.h b/mm/internal.h
index bff60d8..2189af4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -239,9 +239,6 @@ static inline struct page *mem_map_next(struct page *iter,
#define __paginginit __init
#endif

-/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
-
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
diff --git a/net/core/filter.c b/net/core/filter.c
index 6f755cc..bc620e5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -82,6 +82,14 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
int err;
struct sk_filter *filter;

+ /*
+ * If the skb was allocated from pfmemalloc reserves, only
+ * allow SOCK_MEMALLOC sockets to use it as this socket is
+ * helping free memory
+ */
+ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
+ return -ENOMEM;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index baf8d28..ec8cb9b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -146,6 +146,43 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
BUG();
}

+
+/*
+ * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
+ * the caller if emergency pfmemalloc reserves are being used. If it is and
+ * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
+ * may be used. Otherwise, the packet data may be discarded until enough
+ * memory is free
+ */
+#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
+ __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
+void *__kmalloc_reserve(size_t size, gfp_t flags, int node, unsigned long ip,
+ bool *pfmemalloc)
+{
+ void *obj;
+ bool ret_pfmemalloc = false;
+
+ /*
+ * Try a regular allocation, when that fails and we're not entitled
+ * to the reserves, fail.
+ */
+ obj = kmalloc_node_track_caller(size,
+ flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ node);
+ if (obj || !(gfp_pfmemalloc_allowed(flags)))
+ goto out;
+
+ /* Try again but now we are using pfmemalloc reserves */
+ ret_pfmemalloc = true;
+ obj = kmalloc_node_track_caller(size, flags, node);
+
+out:
+ if (pfmemalloc)
+ *pfmemalloc = ret_pfmemalloc;
+
+ return obj;
+}
+
/* Allocate a new skbuff. We do this ourselves so we can fill in a few
* 'private' fields and also do memory statistics to find all the
* [BEEP] leaks.
@@ -156,8 +193,10 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
* __alloc_skb - allocate a network buffer
* @size: size to allocate
* @gfp_mask: allocation mask
- * @fclone: allocate from fclone cache instead of head cache
- * and allocate a cloned (child) skb
+ * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
+ * instead of head cache and allocate a cloned (child) skb.
+ * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
+ * allocations in case the data is required for writeback
* @node: numa node to allocate memory on
*
* Allocate a new &sk_buff. The returned buffer has no headroom and a
@@ -168,14 +207,19 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ bool pfmemalloc;
+
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;

- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
+ gfp_mask |= __GFP_MEMALLOC;

/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
@@ -190,7 +234,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
*/
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
- data = kmalloc_node_track_caller(size, gfp_mask, node);
+ data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
if (!data)
goto nodata;
/* kmalloc(size) might give us more room than requested.
@@ -208,6 +252,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
memset(skb, 0, offsetof(struct sk_buff, tail));
/* Account for allocated memory : skb + skb->head */
skb->truesize = SKB_TRUESIZE(size);
+ skb->pfmemalloc = pfmemalloc;
atomic_set(&skb->users, 1);
skb->head = data;
skb->data = data;
@@ -223,7 +268,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
atomic_set(&shinfo->dataref, 1);
kmemcheck_annotate_variable(shinfo->destructor_arg);

- if (fclone) {
+ if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff *child = skb + 1;
atomic_t *fclone_ref = (atomic_t *) (child + 1);

@@ -233,6 +278,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
atomic_set(fclone_ref, 1);

child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->pfmemalloc = pfmemalloc;
}
out:
return skb;
@@ -310,7 +356,8 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
{
struct sk_buff *skb;

- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+ SKB_ALLOC_RX, NUMA_NO_NODE);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -605,6 +652,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
#if IS_ENABLED(CONFIG_IP_VS)
new->ipvs_property = old->ipvs_property;
#endif
+ new->pfmemalloc = old->pfmemalloc;
new->protocol = old->protocol;
new->mark = old->mark;
new->skb_iif = old->skb_iif;
@@ -763,6 +811,9 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
@@ -799,6 +850,13 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
}

+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+ if (skb_pfmemalloc((struct sk_buff *)skb))
+ return SKB_ALLOC_RX;
+ return 0;
+}
+
/**
* skb_copy - create private copy of an sk_buff
* @skb: buffer to copy
@@ -820,7 +878,8 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
{
int headerlen = skb_headroom(skb);
unsigned int size = (skb_end_pointer(skb) - skb->head) + skb->data_len;
- struct sk_buff *n = alloc_skb(size, gfp_mask);
+ struct sk_buff *n = __alloc_skb(size, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);

if (!n)
return NULL;
@@ -855,7 +914,8 @@ EXPORT_SYMBOL(skb_copy);
struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
{
unsigned int size = skb_headlen(skb) + headroom;
- struct sk_buff *n = alloc_skb(size, gfp_mask);
+ struct sk_buff *n = __alloc_skb(size, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);

if (!n)
goto out;
@@ -952,7 +1012,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
goto adjust_others;
}

- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+ data = kmalloc_reserve(size + sizeof(struct skb_shared_info), gfp_mask,
+ NUMA_NO_NODE, NULL);
if (!data)
goto nodata;

@@ -1060,8 +1123,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
/*
* Allocate the copy buffer
*/
- struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
- gfp_mask);
+ struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+ gfp_mask, skb_alloc_rx_flag(skb),
+ NUMA_NO_NODE);
int oldheadroom = skb_headroom(skb);
int head_copy_len, head_copy_off;
int off;
@@ -2727,8 +2791,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
skb_release_head_state(nskb);
__skb_push(nskb, doffset);
} else {
- nskb = alloc_skb(hsize + doffset + headroom,
- GFP_ATOMIC);
+ nskb = __alloc_skb(hsize + doffset + headroom,
+ GFP_ATOMIC, skb_alloc_rx_flag(skb),
+ NUMA_NO_NODE);

if (unlikely(!nskb))
goto err;
diff --git a/net/core/sock.c b/net/core/sock.c
index b0f0613..cd2ad2b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -266,6 +266,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
EXPORT_SYMBOL(sysctl_optmem_max);

+atomic_t memalloc_socks __read_mostly;
+
/**
* sk_set_memalloc - sets %SOCK_MEMALLOC
* @sk: socket to set it on
@@ -278,6 +280,7 @@ void sk_set_memalloc(struct sock *sk)
{
sock_set_flag(sk, SOCK_MEMALLOC);
sk->sk_allocation |= __GFP_MEMALLOC;
+ atomic_inc(&memalloc_socks);
}
EXPORT_SYMBOL_GPL(sk_set_memalloc);

@@ -285,6 +288,7 @@ void sk_clear_memalloc(struct sock *sk)
{
sock_reset_flag(sk, SOCK_MEMALLOC);
sk->sk_allocation &= ~__GFP_MEMALLOC;
+ atomic_dec(&memalloc_socks);
}
EXPORT_SYMBOL_GPL(sk_clear_memalloc);
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:57 UTC
Permalink
The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If the packet is fragmented, it is possible that pages
will be allocated from the PFMEMALLOC reserve without propagating
this information to the skb. This patch propagates page->pfmemalloc
from pages allocated for fragments to the skb.

Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/skbuff.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e889594..b01a993 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1227,6 +1227,17 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
{
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];

+ /*
+ * Propagate page->pfmemalloc to the skb if we can. The problem is
+ * that not all callers have unique ownership of the page. If
+ * pfmemalloc is set, we check the mapping as a mapping implies
+ * page->index is set (index and pfmemalloc share space).
+ * If it's a valid mapping, we cannot use page->pfmemalloc but we
+ * do not lose pfmemalloc information as the pages would not be
+ * allocated using __GFP_MEMALLOC.
+ */
+ if (page->pfmemalloc && !page->mapping)
+ skb->pfmemalloc = true;
frag->page.p = page;
frag->page_offset = off;
skb_frag_size_set(frag, size);
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:58 UTC
Permalink
The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If page splitting is used, it is possible that pages will
be allocated from the PFMEMALLOC reserve without propagating this
information to the skb. This patch propagates page->pfmemalloc from
pages allocated for fragments to the skb.

It works by reintroducing and expanding the netdev_alloc_page() API
to take an skb. If the page was allocated from pfmemalloc reserves,
it is automatically copied. If the driver allocates the page before
the skb, it should call propagate_pfmemalloc_skb() after the skb is
allocated to ensure the flag is copied properly.

Failure to do so is not critical. The resulting driver may perform
slower if it is used for swap-over-NBD or swap-over-NFS but it should
not result in failure.

Signed-off-by: Mel Gorman <***@suse.de>
---
drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +-
drivers/net/usb/cdc-phonet.c | 2 +-
drivers/usb/gadget/f_phonet.c | 2 +-
include/linux/skbuff.h | 55 +++++++++++++++++++++
8 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 2dae795..05f02b3 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -528,7 +528,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
#endif

while (n--) {
- pg = alloc_page(gfp);
+ pg = __netdev_alloc_page(gfp, NULL);
if (unlikely(!pg)) {
q->alloc_failed++;
break;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index 0bd585b..e8a372e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -653,7 +653,7 @@ static unsigned int refill_fl(struct adapter *adapter, struct sge_fl *fl,

alloc_small_pages:
while (n--) {
- page = alloc_page(gfp | __GFP_NOWARN | __GFP_COLD);
+ page = __netdev_alloc_page(gfp | __GFP_NOWARN, NULL);
if (unlikely(!page)) {
fl->alloc_failed++;
break;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 5ec3159..9c2d7f6 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6229,7 +6229,7 @@ static bool igb_alloc_mapped_page(struct igb_ring *rx_ring,
return true;

if (!page) {
- page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ page = __netdev_alloc_page(GFP_ATOMIC, bi->skb);
bi->page = page;
if (unlikely(!page)) {
rx_ring->rx_stats.alloc_failed++;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3e26b1f9..3a60339 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1126,7 +1126,7 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring *rx_ring,

/* alloc new page for storage */
if (likely(!page)) {
- page = alloc_pages(GFP_ATOMIC | __GFP_COLD,
+ page = __netdev_alloc_pages(GFP_ATOMIC, bi->skb,
ixgbe_rx_pg_order(rx_ring));
if (unlikely(!page)) {
rx_ring->rx_stats.alloc_rx_page_failed++;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 307611a..4f1c14f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -369,7 +369,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
if (!bi->page_dma &&
(adapter->flags & IXGBE_FLAG_RX_PS_ENABLED)) {
if (!bi->page) {
- bi->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+ bi->page = __netdev_alloc_page(GFP_ATOMIC, NULL);
if (!bi->page) {
adapter->alloc_rx_page_failed++;
goto no_buffers;
@@ -403,6 +403,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_adapter *adapter,
*/
skb_reserve(skb, NET_IP_ALIGN);

+ propagate_pfmemalloc_skb(bi->page_dma, skb);
bi->skb = skb;
}
if (!bi->dma) {
diff --git a/drivers/net/usb/cdc-phonet.c b/drivers/net/usb/cdc-phonet.c
index 3e41b00..5d46473 100644
--- a/drivers/net/usb/cdc-phonet.c
+++ b/drivers/net/usb/cdc-phonet.c
@@ -130,7 +130,7 @@ static int rx_submit(struct usbpn_dev *pnd, struct urb *req, gfp_t gfp_flags)
struct page *page;
int err;

- page = alloc_page(gfp_flags);
+ page = __netdev_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
if (!page)
return -ENOMEM;

diff --git a/drivers/usb/gadget/f_phonet.c b/drivers/usb/gadget/f_phonet.c
index 965a629..19adb15 100644
--- a/drivers/usb/gadget/f_phonet.c
+++ b/drivers/usb/gadget/f_phonet.c
@@ -301,7 +301,7 @@ pn_rx_submit(struct f_phonet *fp, struct usb_request *req, gfp_t gfp_flags)
struct page *page;
int err;

- page = alloc_page(gfp_flags);
+ page = __netdev_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL);
if (!page)
return -ENOMEM;

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b01a993..4e31d4a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1733,6 +1733,61 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
return __netdev_alloc_skb_ip_align(dev, length, GFP_ATOMIC);
}

+/*
+ * __netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
+ * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
+ * @order: size of the allocation
+ *
+ * Allocate a new page. dev currently unused.
+ *
+ * %NULL is returned if there is no free memory.
+*/
+static inline struct page *__netdev_alloc_pages(gfp_t gfp_mask,
+ struct sk_buff *skb,
+ unsigned int order)
+{
+ struct page *page;
+
+ gfp_mask |= __GFP_COLD;
+
+ if (!(gfp_mask & __GFP_NOMEMALLOC))
+ gfp_mask |= __GFP_MEMALLOC;
+
+ page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
+ if (skb && page && page->pfmemalloc)
+ skb->pfmemalloc = true;
+
+ return page;
+}
+
+/**
+ * __netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX
+ * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used
+ *
+ * Allocate a new page. dev currently unused.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(gfp_t gfp_mask,
+ struct sk_buff *skb)
+{
+ return __netdev_alloc_pages(gfp_mask, skb, 0);
+}
+
+/**
+ * propagate_pfmemalloc_skb - Propagate pfmemalloc if skb is allocated after RX page
+ * @page: The page that was allocated from netdev_alloc_page
+ * @skb: The skb that may need pfmemalloc set
+ */
+static inline void propagate_pfmemalloc_skb(struct page *page,
+ struct sk_buff *skb)
+{
+ if (page && page->pfmemalloc)
+ skb->pfmemalloc = true;
+}
+
/**
* skb_frag_page - retrieve the page refered to by a paged fragment
* @frag: the paged fragment
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:16:59 UTC
Permalink
In order to make sure pfmemalloc packets receive all memory
needed to proceed, ensure processing of pfmemalloc SKBs happens
under PF_MEMALLOC. This is limited to a subset of protocols that
are expected to be used for writing to swap. Taps are not allowed to
use PF_MEMALLOC as these are expected to communicate with userspace
processes which could be paged out.

[***@chello.nl: Ideas taken from various patches]
[***@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman <***@suse.de>
---
include/net/sock.h | 5 +++++
net/core/dev.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
net/core/sock.c | 16 ++++++++++++++++
3 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index a936614..de6ebda 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -728,8 +728,13 @@ static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *s
return 0;
}

+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
{
+ if (skb_pfmemalloc(skb))
+ return __sk_backlog_rcv(sk, skb);
+
return sk->sk_backlog_rcv(sk, skb);
}

diff --git a/net/core/dev.c b/net/core/dev.c
index c25d453..ca7f264 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3153,6 +3153,23 @@ void netdev_rx_handler_unregister(struct net_device *dev)
}
EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);

+/*
+ * Limit the use of PFMEMALLOC reserves to those protocols that implement
+ * the special handling of PFMEMALLOC skbs.
+ */
+static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
+{
+ switch (skb->protocol) {
+ case __constant_htons(ETH_P_ARP):
+ case __constant_htons(ETH_P_IP):
+ case __constant_htons(ETH_P_IPV6):
+ case __constant_htons(ETH_P_8021Q):
+ return true;
+ default:
+ return false;
+ }
+}
+
static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
@@ -3162,14 +3179,27 @@ static int __netif_receive_skb(struct sk_buff *skb)
bool deliver_exact = false;
int ret = NET_RX_DROP;
__be16 type;
+ unsigned long pflags = current->flags;

net_timestamp_check(!netdev_tstamp_prequeue, skb);

trace_netif_receive_skb(skb);

+ /*
+ * PFMEMALLOC skbs are special, they should
+ * - be delivered to SOCK_MEMALLOC sockets only
+ * - stay away from userspace
+ * - have bounded memory usage
+ *
+ * Use PF_MEMALLOC as this saves us from propagating the allocation
+ * context down to all allocation sites.
+ */
+ if (skb_pfmemalloc(skb))
+ current->flags |= PF_MEMALLOC;
+
/* if we've gotten here through NAPI, check netpoll */
if (netpoll_receive_skb(skb))
- return NET_RX_DROP;
+ goto out;

if (!skb->skb_iif)
skb->skb_iif = skb->dev->ifindex;
@@ -3190,7 +3220,7 @@ another_round:
if (skb->protocol == cpu_to_be16(ETH_P_8021Q)) {
skb = vlan_untag(skb);
if (unlikely(!skb))
- goto out;
+ goto unlock;
}

#ifdef CONFIG_NET_CLS_ACT
@@ -3200,6 +3230,9 @@ another_round:
}
#endif

+ if (skb_pfmemalloc(skb))
+ goto skip_taps;
+
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
@@ -3208,13 +3241,17 @@ another_round:
}
}

+skip_taps:
#ifdef CONFIG_NET_CLS_ACT
skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
if (!skb)
- goto out;
+ goto unlock;
ncls:
#endif

+ if (skb_pfmemalloc(skb) && !skb_pfmemalloc_protocol(skb))
+ goto drop;
+
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (vlan_tx_tag_present(skb)) {
if (pt_prev) {
@@ -3224,7 +3261,7 @@ ncls:
if (vlan_do_receive(&skb, !rx_handler))
goto another_round;
else if (unlikely(!skb))
- goto out;
+ goto unlock;
}

if (rx_handler) {
@@ -3234,7 +3271,7 @@ ncls:
}
switch (rx_handler(&skb)) {
case RX_HANDLER_CONSUMED:
- goto out;
+ goto unlock;
case RX_HANDLER_ANOTHER:
goto another_round;
case RX_HANDLER_EXACT:
@@ -3264,6 +3301,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
atomic_long_inc(&skb->dev->rx_dropped);
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
@@ -3272,8 +3310,10 @@ ncls:
ret = NET_RX_DROP;
}

-out:
+unlock:
rcu_read_unlock();
+out:
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
}

diff --git a/net/core/sock.c b/net/core/sock.c
index cd2ad2b..9ad1ed9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -292,6 +292,22 @@ void sk_clear_memalloc(struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_clear_memalloc);

+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+
+ /* these should have been dropped before queueing */
+ BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
+
+ current->flags |= PF_MEMALLOC;
+ ret = sk->sk_backlog_rcv(sk, skb);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+ return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
+
#if defined(CONFIG_CGROUPS)
#if !defined(CONFIG_NET_CLS_CGROUP)
int net_cls_subsys_id = -1;
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:17:00 UTC
Permalink
Getting and putting objects in SLAB currently requires a function call
but the bulk of the work is related to PFMEMALLOC reserves which are
only consumed when network-backed storage is critical. Use an inline
function to determine if the function call is required.

Signed-off-by: Mel Gorman <***@suse.de>
---
mm/slab.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 8e660b3..5b2ae1c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -117,6 +117,8 @@
#include <linux/memory.h>
#include <linux/prefetch.h>

+#include <net/sock.h>
+
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
@@ -1012,7 +1014,7 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
pfmemalloc_active = false;
}

-static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
gfp_t flags, bool force_refill)
{
int i;
@@ -1059,7 +1061,20 @@ static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
return objp;
}

-static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static inline void *ac_get_obj(struct kmem_cache *cachep,
+ struct array_cache *ac, gfp_t flags, bool force_refill)
+{
+ void *objp;
+
+ if (unlikely(sk_memalloc_socks()))
+ objp = __ac_get_obj(cachep, ac, flags, force_refill);
+ else
+ objp = ac->entry[--ac->avail];
+
+ return objp;
+}
+
+static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
void *objp)
{
if (unlikely(pfmemalloc_active)) {
@@ -1069,6 +1084,15 @@ static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
set_obj_pfmemalloc(&objp);
}

+ return objp;
+}
+
+static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ void *objp)
+{
+ if (unlikely(sk_memalloc_socks()))
+ objp = __ac_put_obj(cachep, ac, objp);
+
ac->entry[ac->avail++] = objp;
}
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:17:03 UTC
Permalink
Under significant pressure when writing back to network-backed storage,
direct reclaimers may get throttled. This is expected to be a
short-lived event and the processes get woken up again but processes do
get stalled. This patch counts how many times such stalling occurs. It's
up to the administrator whether to reduce these stalls by increasing
min_free_kbytes.

Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/vm_event_item.h | 1 +
mm/vmscan.c | 3 +++
mm/vmstat.c | 1 +
3 files changed, 5 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..652e5f3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -29,6 +29,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGSTEAL),
FOR_ALL_ZONES(PGSCAN_KSWAPD),
FOR_ALL_ZONES(PGSCAN_DIRECT),
+ PGSCAN_DIRECT_THROTTLE,
#ifdef CONFIG_NUMA
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fdb63db..ff2322e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2480,6 +2480,9 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
if (pfmemalloc_watermark_ok(pgdat))
return;

+ /* Account for the throttling */
+ count_vm_event(PGSCAN_DIRECT_THROTTLE);
+
/*
* If the caller cannot enter the filesystem, it's possible that it
* is processing a journal transaction. In this case, it is not safe
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..0fff13d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -741,6 +741,7 @@ const char * const vmstat_text[] = {
TEXTS_FOR_ZONES("pgsteal")
TEXTS_FOR_ZONES("pgscan_kswapd")
TEXTS_FOR_ZONES("pgscan_direct")
+ "pgscan_direct_throttle",

#ifdef CONFIG_NUMA
"zone_reclaim_failed",
--
1.7.9.2
Mel Gorman
2012-04-16 12:17:01 UTC
Permalink
Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC
reserves so pages backed by NBD, particularly if swap related, can
be cleaned to prevent the machine being deadlocked. It is still
possible that the PFMEMALLOC reserves get depleted resulting in
deadlock but this can be resolved by the administrator by increasing
min_free_kbytes.

Signed-off-by: Mel Gorman <***@suse.de>
---
drivers/block/nbd.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 061427a75d..76bc96f 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -154,6 +154,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
+ unsigned long pflags = current->flags;

if (unlikely(!sock)) {
dev_err(disk_to_dev(nbd->disk),
@@ -167,8 +168,9 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
siginitsetinv(&blocked, sigmask(SIGKILL));
sigprocmask(SIG_SETMASK, &blocked, &oldset);

+ current->flags |= PF_MEMALLOC;
do {
- sock->sk->sk_allocation = GFP_NOIO;
+ sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
iov.iov_base = buf;
iov.iov_len = size;
msg.msg_name = NULL;
@@ -214,6 +216,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
} while (size > 0);

sigprocmask(SIG_SETMASK, &oldset, NULL);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);

return result;
}
@@ -405,6 +408,7 @@ static int nbd_do_it(struct nbd_device *nbd)

BUG_ON(nbd->magic != NBD_MAGIC);

+ sk_set_memalloc(nbd->sock->sk);
nbd->pid = task_pid_nr(current);
ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
if (ret) {
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-04-16 12:17:02 UTC
Permalink
If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.

Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 111 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dff7115..e6b733d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -663,6 +663,7 @@ typedef struct pglist_data {
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
+ wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d7bea5..71802cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4325,6 +4325,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
pgdat_resize_init(pgdat);
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33c332b..fdb63db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2431,6 +2431,73 @@ out:
return 0;
}

+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+ bool wmark_ok;
+
+ for (i = 0; i <= ZONE_NORMAL; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
+
+ /* kswapd must be awake if processes are being throttled */
+ if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+ pgdat->classzone_idx = min(pgdat->classzone_idx,
+ (enum zone_type)ZONE_NORMAL);
+ wake_up_interruptible(&pgdat->kswapd_wait);
+ }
+
+ return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ pg_data_t *pgdat;
+
+ /* Kernel threads such as kjournald should not be throttled */
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ pgdat = zone->zone_pgdat;
+ if (pfmemalloc_watermark_ok(pgdat))
+ return;
+
+ /*
+ * If the caller cannot enter the filesystem, it's possible that it
+ * is processing a journal transaction. In this case, it is not safe
+ * to block on pfmemalloc_wait as kswapd could also be blocked waiting
+ * to start a transaction. Instead, throttle for up to a second before
+ * the reclaim must continue.
+ */
+ if (!(gfp_mask & __GFP_FS)) {
+ wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat), HZ);
+ return;
+ }
+
+ /* Throttle until kswapd wakes the process */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
@@ -2449,6 +2516,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.gfp_mask = sc.gfp_mask,
};

+ throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+ /*
+ * Do not enter reclaim if fatal signal is pending. 1 is returned so
+ * that the page allocator does not consider triggering OOM
+ */
+ if (fatal_signal_pending(current))
+ return 1;
+
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
@@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
if (remaining)
return true;

+ /*
+ * There is a potential race between when kswapd checks it watermarks
+ * and a process gets throttled. There is also a potential race if
+ * processes get throttled, kswapd wakes, a large process exits therby
+ * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+ * is going to sleep, no process should be sleeping on pfmemalloc_wait
+ * so wake them now if necessary. If necessary, processes will wake
+ * kswapd and get throttled again
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+ wake_up(&pgdat->pfmemalloc_wait);
+ return true;
+ }
+
/* Check the watermark levels */
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2871,6 +2961,12 @@ loop_again:
}

}
+
+ /* Wake throttled direct reclaimers if low watermark is met */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+ pfmemalloc_watermark_ok(pgdat))
+ wake_up(&pgdat->pfmemalloc_wait);
+
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
/*
@@ -3005,6 +3101,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

/*
+ * There is a potential race between when kswapd checks it
+ * watermarks and a process gets throttled. There is also
+ * a potential race if processes get throttled, kswapd wakes,
+ * a large process exits therby balancing the zones that causes
+ * kswapd to miss a wakeup. If kswapd is going to sleep, no
+ * process should be sleeping on pfmemalloc_wait so wake them
+ * now if necessary. If necessary, processes will wake kswapd
+ * and get throttled again
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait))
+ wake_up(&pgdat->pfmemalloc_wait);
+
+ /*
* vmstat counters are not perfectly accurate and the estimated
* value for counters such as NR_FREE_PAGES can deviate from the
* true value by nr_online_cpus * threshold. To avoid the zone
--
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2012-05-01 22:24:37 UTC
Permalink
On Mon, 16 Apr 2012 13:17:02 +0100
Post by Mel Gorman
If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.
This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.
...
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+ bool wmark_ok;
+
+ for (i = 0; i <= ZONE_NORMAL; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
wmark_ok = free_pages > pfmemalloc_reserve / 2;
Post by Mel Gorman
+
+ /* kswapd must be awake if processes are being throttled */
+ if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+ pgdat->classzone_idx = min(pgdat->classzone_idx,
+ (enum zone_type)ZONE_NORMAL);
+ wake_up_interruptible(&pgdat->kswapd_wait);
+ }
+
+ return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ pg_data_t *pgdat;
+
+ /* Kernel threads such as kjournald should not be throttled */
The comment should explain "why", not "what". Particularly when the
"what" was bleedin obvious ;)

Also... why?
Post by Mel Gorman
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ pgdat = zone->zone_pgdat;
+ if (pfmemalloc_watermark_ok(pgdat))
+ return;
+
+ /*
+ * If the caller cannot enter the filesystem, it's possible that it
+ * is processing a journal transaction. In this case, it is not safe
+ * to block on pfmemalloc_wait as kswapd could also be blocked waiting
+ * to start a transaction. Instead, throttle for up to a second before
+ * the reclaim must continue.
+ */
I suppose this applies to fs locks in general, not just to
journal_start()?
Post by Mel Gorman
+ if (!(gfp_mask & __GFP_FS)) {
+ wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat), HZ);
+ return;
+ }
+
+ /* Throttle until kswapd wakes the process */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
...
@@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
if (remaining)
return true;
+ /*
+ * There is a potential race between when kswapd checks it watermarks
"its"
Post by Mel Gorman
+ * and a process gets throttled. There is also a potential race if
+ * processes get throttled, kswapd wakes, a large process exits therby
+ * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+ * is going to sleep, no process should be sleeping on pfmemalloc_wait
+ * so wake them now if necessary. If necessary, processes will wake
+ * kswapd and get throttled again
+ */
Yes, the possibility for missed wakeups here worried me. There's no
synchronization and it would be easy to leave holes.

It's good that there is no timeout on the throttling - a timeout would
cover up rare races most nastily.
Post by Mel Gorman
+ if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+ wake_up(&pgdat->pfmemalloc_wait);
+ return true;
+ }
A bool-returning function called "sleeping_prematurely" should have no
side-effects. But it now performs wakeups. Wanna see if there is a
way of making this nicer?
Post by Mel Gorman
/* Check the watermark levels */
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;
}
}
+
+ /* Wake throttled direct reclaimers if low watermark is met */
s/"what"/"why"/ !
Post by Mel Gorman
+ if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+ pfmemalloc_watermark_ok(pgdat))
+ wake_up(&pgdat->pfmemalloc_wait);
+
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
/*
...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-05-02 16:24:12 UTC
Permalink
Post by Andrew Morton
On Mon, 16 Apr 2012 13:17:02 +0100
Post by Mel Gorman
If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.
This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.
...
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+ bool wmark_ok;
+
+ for (i = 0; i <= ZONE_NORMAL; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
wmark_ok = free_pages > pfmemalloc_reserve / 2;
Of course, I don't know what I was on when I wrote that particular line.
Post by Andrew Morton
Post by Mel Gorman
+
+ /* kswapd must be awake if processes are being throttled */
+ if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+ pgdat->classzone_idx = min(pgdat->classzone_idx,
+ (enum zone_type)ZONE_NORMAL);
+ wake_up_interruptible(&pgdat->kswapd_wait);
+ }
+
+ return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ pg_data_t *pgdat;
+
+ /* Kernel threads such as kjournald should not be throttled */
The comment should explain "why", not "what". Particularly when the
"what" was bleedin obvious ;)
Also... why?
/*
* Kernel threads should not be throttled as they may be indirectly
* responsible for cleaning pages necessary for reclaim to make forward
* progress. kjournald for example may enter direct reclaim while
* committing a transaction where throttling it could forcing other
* processes to block on log_wait_commit()
*/

Does that help?
Post by Andrew Morton
Post by Mel Gorman
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ pgdat = zone->zone_pgdat;
+ if (pfmemalloc_watermark_ok(pgdat))
+ return;
+
+ /*
+ * If the caller cannot enter the filesystem, it's possible that it
+ * is processing a journal transaction. In this case, it is not safe
+ * to block on pfmemalloc_wait as kswapd could also be blocked waiting
+ * to start a transaction. Instead, throttle for up to a second before
+ * the reclaim must continue.
+ */
I suppose this applies to fs locks in general, not just to
journal_start()?
Yes. I updated the comment to reflect that.

/*
* If the caller cannot enter the filesystem, it's possible that it
* is due to the caller holding an FS lock or performing a journal
* transaction in the case of a filesystem like ext[3|4]. In this case,
* it is not safe to block on pfmemalloc_wait as kswapd could be
* blocked waiting on the same lock. Instead, throttle for up to a
* second before continuing.
*/
Post by Andrew Morton
Post by Mel Gorman
+ if (!(gfp_mask & __GFP_FS)) {
+ wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat), HZ);
+ return;
+ }
+
+ /* Throttle until kswapd wakes the process */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
...
@@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
if (remaining)
return true;
+ /*
+ * There is a potential race between when kswapd checks it watermarks
"its"
Fixed.
Post by Andrew Morton
Post by Mel Gorman
+ * and a process gets throttled. There is also a potential race if
+ * processes get throttled, kswapd wakes, a large process exits therby
+ * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+ * is going to sleep, no process should be sleeping on pfmemalloc_wait
+ * so wake them now if necessary. If necessary, processes will wake
+ * kswapd and get throttled again
+ */
Yes, the possibility for missed wakeups here worried me. There's no
synchronization and it would be easy to leave holes.
It's good that there is no timeout on the throttling - a timeout would
cover up rare races most nastily.
Yes and I wanted to avoid that. If there is a lost wakup, sysrq+t should
show processes stuck in throttle_direct_reclaim() while kswapd is asleep.
Post by Andrew Morton
Post by Mel Gorman
+ if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+ wake_up(&pgdat->pfmemalloc_wait);
+ return true;
+ }
A bool-returning function called "sleeping_prematurely" should have no
side-effects. But it now performs wakeups. Wanna see if there is a
way of making this nicer?
Minimally, the two instances of "There is a potential race" was a
merging mistake so I deleted the one in kswapd_try_to_sleep().

I looked at moving this wake_up outside sleeping_prematurely() but it
looked worse really. What I did instead was rename
sleeping_prematurely() to prepare_kswapd_sleep() and and commented it
like this

/*
* Prepare kswapd for sleeping. This verifies that there are no processes
* waiting in throttle_direct_reclaim() and that watermarks have been
* met.
*
* Returns true if kswapd is ready to sleep
*/
static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)

It's cheating a bit but a name like "prepare" implies that it may have
side-effects.
Post by Andrew Morton
Post by Mel Gorman
/* Check the watermark levels */
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;
}
}
+
+ /* Wake throttled direct reclaimers if low watermark is met */
s/"what"/"why"/ !
/*
* If the low watermark is met there is no need for processes
* to be throttled on pfmemalloc_wait as they should not be
* able to safely make forward progress. Wake them
*/

?

Here is how the patch currently stands

---8<---
mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage

If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.

Signed-off-by: Mel Gorman <***@suse.de>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 128 +++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dff7115..e6b733d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -663,6 +663,7 @@ typedef struct pglist_data {
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
+ wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e225a7c..b9eb64a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4326,6 +4326,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
pgdat_resize_init(pgdat);
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33c332b..6f322e8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2431,6 +2431,80 @@ out:
return 0;
}

+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+ bool wmark_ok;
+
+ for (i = 0; i <= ZONE_NORMAL; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ wmark_ok = free_pages > pfmemalloc_reserve / 2;
+
+ /* kswapd must be awake if processes are being throttled */
+ if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+ pgdat->classzone_idx = min(pgdat->classzone_idx,
+ (enum zone_type)ZONE_NORMAL);
+ wake_up_interruptible(&pgdat->kswapd_wait);
+ }
+
+ return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ pg_data_t *pgdat;
+
+ /*
+ * Kernel threads should not be throttled as they may be indirectly
+ * responsible for cleaning pages necessary for reclaim to make forward
+ * progress. kjournald for example may enter direct reclaim while
+ * committing a transaction where throttling it could forcing other
+ * processes to block on log_wait_commit().
+ */
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ pgdat = zone->zone_pgdat;
+ if (pfmemalloc_watermark_ok(pgdat))
+ return;
+
+ /*
+ * If the caller cannot enter the filesystem, it's possible that it
+ * is due to the caller holding an FS lock or performing a journal
+ * transaction in the case of a filesystem like ext[3|4]. In this case,
+ * it is not safe to block on pfmemalloc_wait as kswapd could be
+ * blocked waiting on the same lock. Instead, throttle for up to a
+ * second before continuing.
+ */
+ if (!(gfp_mask & __GFP_FS)) {
+ wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat), HZ);
+ return;
+ }
+
+ /* Throttle until kswapd wakes the process */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
@@ -2449,6 +2523,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.gfp_mask = sc.gfp_mask,
};

+ throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+ /*
+ * Do not enter reclaim if fatal signal is pending. 1 is returned so
+ * that the page allocator does not consider triggering OOM
+ */
+ if (fatal_signal_pending(current))
+ return 1;
+
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
@@ -2598,8 +2681,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
return balanced_pages >= (present_pages >> 2);
}

-/* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
+/*
+ * Prepare kswapd for sleeping. This verifies that there are no processes
+ * waiting in throttle_direct_reclaim() and that watermarks have been met.
+ *
+ * Returns true if kswapd is ready to sleep
+ */
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)
{
int i;
@@ -2608,7 +2696,21 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,

/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
- return true;
+ return false;
+
+ /*
+ * There is a potential race between when kswapd checks its watermarks
+ * and a process gets throttled. There is also a potential race if
+ * processes get throttled, kswapd wakes, a large process exits therby
+ * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+ * is going to sleep, no process should be sleeping on pfmemalloc_wait
+ * so wake them now if necessary. If necessary, processes will wake
+ * kswapd and get throttled again
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+ wake_up(&pgdat->pfmemalloc_wait);
+ return false;
+ }

/* Check the watermark levels */
for (i = 0; i <= classzone_idx; i++) {
@@ -2641,9 +2743,9 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return !pgdat_balanced(pgdat, balanced, classzone_idx);
+ return pgdat_balanced(pgdat, balanced, classzone_idx);
else
- return !all_zones_ok;
+ return all_zones_ok;
}

/*
@@ -2871,6 +2973,16 @@ loop_again:
}

}
+
+ /*
+ * If the low watermark is met there is no need for processes
+ * to be throttled on pfmemalloc_wait as they should not be
+ * able to safely make forward progress. Wake them
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+ pfmemalloc_watermark_ok(pgdat))
+ wake_up(&pgdat->pfmemalloc_wait);
+
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
/*
@@ -2971,7 +3083,7 @@ out:
}

/*
- * Return the order we were reclaiming at so sleeping_prematurely()
+ * Return the order we were reclaiming at so prepare_kswapd_sleep()
* makes a decision on the order we were last reclaiming at. However,
* if another caller entered the allocator slow path while kswapd
* was awake, order will remain at the higher level
@@ -2991,7 +3103,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

/* Try to sleep for a short interval */
- if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+ if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
remaining = schedule_timeout(HZ/10);
finish_wait(&pgdat->kswapd_wait, &wait);
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3001,7 +3113,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* After a short sleep, check if it was a premature sleep. If not, then
* go fully to sleep until explicitly woken up.
*/
- if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+ if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Eric B Munson
2012-04-21 18:15:41 UTC
Permalink
Post by Mel Gorman
Changelog since V8
o Rebase to 3.4-rc2
o Use page flag instead of slab fields to keep structures the same size
o Properly detect allocations from softirq context that use PF_MEMALLOC
o Ensure kswapd does not sleep while processes are throttled
o Do not accidentally throttle !_GFP_FS processes indefinitely
Changelog since V7
o Rebase to 3.3-rc2
o Take greater care propagating page->pfmemalloc to skb
o Propagate pfmemalloc from netdev_alloc_page to skb where possible
o Release RCU lock properly on preempt kernel
Changelog since V6
o Rebase to 3.1-rc8
o Use wake_up instead of wake_up_interruptible()
o Do not throttle kernel threads
o Avoid a potential race between kswapd going to sleep and processes being
throttled
Changelog since V5
o Rebase to 3.1-rc5
Changelog since V4
o Update comment clarifying what protocols can be used (Michal)
o Rebase to 3.0-rc3
Changelog since V3
o Propogate pfmemalloc from packet fragment pages to skb (Neil)
o Rebase to 3.0-rc2
Changelog since V2
o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC (Neil)
o Use wait_event_interruptible (Neil)
o Use !! when casting to bool to avoid any possibilitity of type
truncation (Neil)
o Nicer logic when using skb_pfmemalloc_protocol (Neil)
Changelog since V1
o Rebase on top of mmotm
o Use atomic_t for memalloc_socks (David Miller)
o Remove use of sk_memalloc_socks in vmscan (Neil Brown)
o Check throttle within prepare_to_wait (Neil Brown)
o Add statistics on throttling instead of printk
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Ziljstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 serialises access to min_free_kbytes. It's not strictly needed
by this series but as the series cares about watermarks in
general, it's a harmless fix. It could be merged independently
and may be if CMA is merged in advance.
Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 6-13 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __netdev_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 40% faster.
3.4.0-rc2 3.4.0-rc2
vanilla-slab swapnbd
Sys Time Running Test (seconds) 87.90 73.45
User+Sys Time Running Test (seconds) 91.93 76.91
Total Elapsed Time (seconds) 4174.37 2953.96
I have tested these with an artificial swap benchmark and with a large project
compile on a beagle board. They work great for me. My tests only used this
set via swap over NFS so it probably wasn't very thorough coverage.

Tested-by: Eric B Munson <***@mgebm.net>
Andrew Morton
2012-05-01 22:28:26 UTC
Permalink
This patchset is far less ghastly than I feared/remembered/dreamed ;)

The mm parts, anyway. Are the net guys on board with it all?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-05-03 15:00:48 UTC
Permalink
Post by Andrew Morton
This patchset is far less ghastly than I feared/remembered/dreamed ;)
That might be the best comment the series ever received :)
Post by Andrew Morton
The mm parts, anyway. Are the net guys on board with it all?
They are cc'd but have not given any feedback in a while. That could be
because they are happy with it or because if they felt the MM parts were
blocking the series then it was unnecessary to review the network parts.

Any of the networking people care to comment?
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
David Miller
2012-05-03 17:06:11 UTC
Permalink
From: Mel Gorman <***@suse.de>
Date: Thu, 3 May 2012 16:00:48 +0100
Post by Mel Gorman
Any of the networking people care to comment?
Post another series with any lingering feedback you've received and
I'll make sure it gets queued up in patchwork so that it gets properly
reviewed by us.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Mel Gorman
2012-05-04 10:16:59 UTC
Permalink
Post by David Miller
Date: Thu, 3 May 2012 16:00:48 +0100
Post by Mel Gorman
Any of the networking people care to comment?
Post another series with any lingering feedback you've received and
I'll make sure it gets queued up in patchwork so that it gets properly
reviewed by us.
Will do, thanks a lot. Tests are currently running on a rebased series
and I'll post after a successful completion.
--
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Continue reading on narkive:
Loading...