unaligned accesses in SLAB etc.

Discussion:

unaligned accesses in SLAB etc.

David Miller

2014-10-12 02:15:10 UTC

I'm getting tons of the following on sparc64:

[603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.410523] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.424061] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.437617] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603970.554394] log_unaligned: 333 callbacks suppressed
[603970.564041] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603970.577576] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603970.591122] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603970.604669] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603970.618216] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603976.515633] log_unaligned: 31 callbacks suppressed
[603976.525092] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.540196] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.555308] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.570411] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.585526] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603982.476424] log_unaligned: 43 callbacks suppressed
[603982.485881] Kernel unaligned access at TPC[549378] kmem_cache_alloc+0xd8/0x1e0
[603982.501590] Kernel unaligned access at TPC[5470a8] kmem_cache_free+0xc8/0x200
[603982.501605] Kernel unaligned access at TPC[549378] kmem_cache_alloc+0xd8/0x1e0
[603982.530382] Kernel unaligned access at TPC[5470a8] kmem_cache_free+0xc8/0x200
[603982.544820] Kernel unaligned access at TPC[549378] kmem_cache_alloc+0xd8/0x1e0
[603987.567130] log_unaligned: 11 callbacks suppressed
[603987.576582] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.591696] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.606811] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.621904] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.637017] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-12 17:20:12 UTC

From: David Miller <***@davemloft.net>
Date: Sat, 11 Oct 2014 22:15:10 -0400 (EDT)

Post by David Miller
[603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.410523] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0

The unaligned accesses are happening in the SLAB_OBJ_PFMEMALLOC code,
which assumes that all object pointers are "unsigned long" aligned:

static inline void set_obj_pfmemalloc(void **objp)
{
*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
return;
}

etc. etc.

But that code has been there working forever. Something changed
recently such that this assumption no longer holds.

In all of the cases, the address is 4-byte aligned but not 8-byte
aligned. And they are vmalloc addresses.

Which made me suspect the percpu commit:

====================
commit bf0dea23a9c094ae869a88bb694fbe966671bf6d
Author: Joonsoo Kim <***@lge.com>
Date: Thu Oct 9 15:26:27 2014 -0700

mm/slab: use percpu allocator for cpu cache
====================

And indeed, reverting this commit fixes the problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

m***@linux.ee

2014-10-13 20:22:37 UTC

Post by David Miller
Date: Sat, 11 Oct 2014 22:15:10 -0400 (EDT)

Post by David Miller
[603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.410523] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0

In all of the cases, the address is 4-byte aligned but not 8-byte
aligned. And they are vmalloc addresses.
====================
commit bf0dea23a9c094ae869a88bb694fbe966671bf6d
Date: Thu Oct 9 15:26:27 2014 -0700
mm/slab: use percpu allocator for cpu cache
====================
And indeed, reverting this commit fixes the problem.

I tested Joonsoo Kim's fix and it gets rid of the kernel unaligned
access messages, yes.

But the instability on UltraSparc II era machines still remains -
occassional Bus Errors during kernel compilation, messages like this:

sh[11771]: segfault at ffd6a4d1 ip 00000000f7cc5714 (rpc 00000000f7cc562c) sp 00000000ffd69d90 error 30002 in libc-2.19.so[f7c44000+16a000]

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Joonsoo Kim

2014-10-13 23:52:19 UTC

Post by m***@linux.ee

Post by David Miller
Date: Sat, 11 Oct 2014 22:15:10 -0400 (EDT)

Post by David Miller
[603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.410523] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0

In all of the cases, the address is 4-byte aligned but not 8-byte
aligned. And they are vmalloc addresses.
====================
commit bf0dea23a9c094ae869a88bb694fbe966671bf6d
Date: Thu Oct 9 15:26:27 2014 -0700
mm/slab: use percpu allocator for cpu cache
====================
And indeed, reverting this commit fixes the problem.

I tested Joonsoo Kim's fix and it gets rid of the kernel unaligned
access messages, yes.
But the instability on UltraSparc II era machines still remains -
sh[11771]: segfault at ffd6a4d1 ip 00000000f7cc5714 (rpc 00000000f7cc562c) sp 00000000ffd69d90 error 30002 in libc-2.19.so[f7c44000+16a000]

Hello, Meelis.

Thanks for testing.

I'd like to know that your another problem is related to commit
bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache").
So, if the commit is reverted, your another problem is also gone completely?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-14 00:04:16 UTC

From: Joonsoo Kim <***@lge.com>
Date: Tue, 14 Oct 2014 08:52:19 +0900

Post by Joonsoo Kim
I'd like to know that your another problem is related to commit
bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache"). So,
if the commit is reverted, your another problem is also gone
completely?

The other problem has been present forever.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Joonsoo Kim

2014-10-14 00:14:54 UTC

Post by David Miller
Date: Tue, 14 Oct 2014 08:52:19 +0900

Post by Joonsoo Kim
I'd like to know that your another problem is related to commit
bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache"). So,
if the commit is reverted, your another problem is also gone
completely?

The other problem has been present forever.

Okay.
Thanks for notifying me.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

m***@linux.ee

2014-10-14 21:19:36 UTC

Post by David Miller

Post by Joonsoo Kim
I'd like to know that your another problem is related to commit
bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache"). So,
if the commit is reverted, your another problem is also gone
completely?

The other problem has been present forever.

Umm? I am afraid I have been describing it badly. This random
SIGBUS+SIGSEGV problem is new - I have not seen it before.

I have been able to do kernel compiles for years on sparc64 (modulo
specific bugs in specific configurations) and 3.17 + start/end swap
patch seems also stable for most machine. With yesterdays git + align
patch, it dies with SIGBUS multiple times during compilation so it's a
new regression for me.

Will try reverting that commit tomorrow.

My only other current sparc64 problems that I am seeing - V210/V440 die
during bootup if compiled with gcc 4.9 and V480 dies with FATAL
exceptions during bootups since previous kernel release. Maybe also
exit_mmap warning - I do not know if they have been fixed, I see them
rarely.

--
Meelis Roos (***@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

David Miller

2014-10-14 21:32:46 UTC

From: ***@linux.ee
Date: Wed, 15 Oct 2014 00:19:36 +0300 (EEST)

Post by m***@linux.ee

Post by David Miller

Post by Joonsoo Kim
I'd like to know that your another problem is related to commit
bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache"). So,
if the commit is reverted, your another problem is also gone
completely?

The other problem has been present forever.

Umm? I am afraid I have been describing it badly. This random
SIGBUS+SIGSEGV problem is new - I have not seen it before.

Sorry, I thought it was the same bug that causes git corruptions
for you. I misunderstood.

Post by m***@linux.ee
I have been able to do kernel compiles for years on sparc64 (modulo
specific bugs in specific configurations) and 3.17 + start/end swap
patch seems also stable for most machine. With yesterdays git + align
patch, it dies with SIGBUS multiple times during compilation so it's a
new regression for me.
Will try reverting that commit tomorrow.

If that fails, please try to bisect, it will help us a lot.

Post by m***@linux.ee
My only other current sparc64 problems that I am seeing - V210/V440 die
during bootup if compiled with gcc 4.9 and V480 dies with FATAL
exceptions during bootups since previous kernel release. Maybe also
exit_mmap warning - I do not know if they have been fixed, I see them
rarely.

The gcc-4.9 case is interesting, are you saying that a gcc-4.9 compiled
kernel works fine on other systems?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-15 08:04:49 UTC

Post by David Miller

Post by m***@linux.ee
My only other current sparc64 problems that I am seeing - V210/V440 die
during bootup if compiled with gcc 4.9 and V480 dies with FATAL
exceptions during bootups since previous kernel release. Maybe also
exit_mmap warning - I do not know if they have been fixed, I see them
rarely.

The gcc-4.9 case is interesting, are you saying that a gcc-4.9 compiled
kernel works fine on other systems?

Yes, all USII based systems work fine with Debian gcc-4.9, as does
T2000. Of USIII* systems, V210 and V440 exhibit the boot hang with
gcc-4.9 and V480 crashes wit FATAL exception during boot that is
probably earlier than the gcc boot hang so I do not know about V480 and
gcc-4.9. V240 not tested because of fan failures, V245 is in the queue
for setup but not tested so far.

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-15 18:36:24 UTC

From: Meelis Roos <***@linux.ee>
Date: Wed, 15 Oct 2014 11:04:49 +0300 (EEST)

Post by Meelis Roos

Post by David Miller

Post by m***@linux.ee
My only other current sparc64 problems that I am seeing - V210/V440 die
during bootup if compiled with gcc 4.9 and V480 dies with FATAL
exceptions during bootups since previous kernel release. Maybe also
exit_mmap warning - I do not know if they have been fixed, I see them
rarely.

The gcc-4.9 case is interesting, are you saying that a gcc-4.9 compiled
kernel works fine on other systems?

Yes, all USII based systems work fine with Debian gcc-4.9, as does
T2000. Of USIII* systems, V210 and V440 exhibit the boot hang with
gcc-4.9 and V480 crashes wit FATAL exception during boot that is
probably earlier than the gcc boot hang so I do not know about V480 and
gcc-4.9. V240 not tested because of fan failures, V245 is in the queue
for setup but not tested so far.

Ok, on the V210/V440 can you boot with "-p" on the kernel boot command
line and post the log? Let's start by seeing how far it gets, maybe
we can figure out roughly where it dies.

A boot hang should be relatively easy to diagnose and pinpoint.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-15 20:11:34 UTC

Post by David Miller

Post by Meelis Roos

Post by David Miller
The gcc-4.9 case is interesting, are you saying that a gcc-4.9 compiled
kernel works fine on other systems?

Yes, all USII based systems work fine with Debian gcc-4.9, as does
T2000. Of USIII* systems, V210 and V440 exhibit the boot hang with
gcc-4.9 and V480 crashes wit FATAL exception during boot that is
probably earlier than the gcc boot hang so I do not know about V480 and
gcc-4.9. V240 not tested because of fan failures, V245 is in the queue
for setup but not tested so far.

Ok, on the V210/V440 can you boot with "-p" on the kernel boot command
line and post the log? Let's start by seeing how far it gets, maybe
we can figure out roughly where it dies.

http://www.spinics.net/lists/sparclinux/msg12238.html and
http://www.spinics.net/lists/sparclinux/msg12468.html are my relevant
posts about it. Should I get something more? It would be easy because of
ALOM.

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-16 03:11:54 UTC

From: Meelis Roos <***@linux.ee>
Date: Wed, 15 Oct 2014 23:11:34 +0300 (EEST)

Post by Meelis Roos

Post by David Miller

Post by Meelis Roos

Post by David Miller
The gcc-4.9 case is interesting, are you saying that a gcc-4.9 compiled
kernel works fine on other systems?

Yes, all USII based systems work fine with Debian gcc-4.9, as does
T2000. Of USIII* systems, V210 and V440 exhibit the boot hang with
gcc-4.9 and V480 crashes wit FATAL exception during boot that is
probably earlier than the gcc boot hang so I do not know about V480 and
gcc-4.9. V240 not tested because of fan failures, V245 is in the queue
for setup but not tested so far.

Ok, on the V210/V440 can you boot with "-p" on the kernel boot command
line and post the log? Let's start by seeing how far it gets, maybe
we can figure out roughly where it dies.

http://www.spinics.net/lists/sparclinux/msg12238.html and
http://www.spinics.net/lists/sparclinux/msg12468.html are my relevant
posts about it. Should I get something more? It would be easy because of
ALOM.

Less information than I had hoped :-/

I thought it was hanging "during boot" meaning before we try to
execute userspace. When in fact it seems to die exactly when we start
running the init process.

Wrt. disassembly of fault_in_user_windows(), that's not likely the
cause because if it were being miscompiled it would equally not work
on the other systems.

Something in the UltraSPARC-III specific code paths is going wrong
(either it is miscompiled, or the code makes an assumption that isn't
valid which has happened in the past).

Do you happen to have both gcc-4.9 and a previously working compiler
on these systems? If you do, we can build a kernel with gcc-4.9 and
then selectively compile certain failes with the older working
compiler to narrow down what compiles into something non-working with
gcc-4.9

I would start with the following files:

arch/sparc/mm/init_64.c
arch/sparc/mm/tlb.c
arch/sparc/mm/tsb.c
arch/sparc/mm/fault_64.c

And failing that, go for various files under arch/sparc/kernel/ such as:

arch/sparc/kernel/process_64.c
arch/sparc/kernel/smp_64.c
arch/sparc/kernel/sys_sparc_64.c
arch/sparc/kernel/sys_sparc32.c
arch/sparc/kernel/traps_64.c

Hopefully, this should be a simply matter of doing a complete build
with gcc-4.9, then removing the object file we want to selectively
build with the older compiler and then going:

make CC="gcc-4.6" arch/sparc/mm/init_64.o

then relinking with plain 'make'.

If the build system rebuilds the object file on you when you try
to relink the final kernel image, we'll have to do some of this
by hand to make the test.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-16 07:22:56 UTC

Post by David Miller
Do you happen to have both gcc-4.9 and a previously working compiler
on these systems? If you do, we can build a kernel with gcc-4.9 and
then selectively compile certain failes with the older working
compiler to narrow down what compiles into something non-working with
gcc-4.9

Yes, I kept gcc-4.6 to help resolving it.

[...]

Post by David Miller
Hopefully, this should be a simply matter of doing a complete build
with gcc-4.9, then removing the object file we want to selectively
make CC="gcc-4.6" arch/sparc/mm/init_64.o
then relinking with plain 'make'.
If the build system rebuilds the object file on you when you try
to relink the final kernel image, we'll have to do some of this
by hand to make the test.

Unfortunately it starts a full rebuild with plain make after compiling
some files with gcc-4.6 - detects CC change?

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-16 20:11:49 UTC

Post by Meelis Roos

Post by David Miller
Hopefully, this should be a simply matter of doing a complete build
with gcc-4.9, then removing the object file we want to selectively
make CC="gcc-4.6" arch/sparc/mm/init_64.o
then relinking with plain 'make'.
If the build system rebuilds the object file on you when you try
to relink the final kernel image, we'll have to do some of this
by hand to make the test.

Unfortunately it starts a full rebuild with plain make after compiling
some files with gcc-4.6 - detects CC change?

Figured out from make V=1 how to call gcc-4.6 directly, so far my
bisection shows that it one or probably more of arch/sparc/kernel/*.c
but probably more than 1 - 2 halfs of it both failed. Still bisecting.

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-16 20:18:23 UTC

From: Meelis Roos <***@linux.ee>
Date: Thu, 16 Oct 2014 23:11:49 +0300 (EEST)

Post by Meelis Roos

Post by Meelis Roos

Post by David Miller
Hopefully, this should be a simply matter of doing a complete build
with gcc-4.9, then removing the object file we want to selectively
make CC="gcc-4.6" arch/sparc/mm/init_64.o
then relinking with plain 'make'.
If the build system rebuilds the object file on you when you try
to relink the final kernel image, we'll have to do some of this
by hand to make the test.

Unfortunately it starts a full rebuild with plain make after compiling
some files with gcc-4.6 - detects CC change?

Figured out from make V=1 how to call gcc-4.6 directly, so far my
bisection shows that it one or probably more of arch/sparc/kernel/*.c
but probably more than 1 - 2 halfs of it both failed. Still bisecting.

Thanks a lot for working this out.

I'm going to also try to setup a test environment so I can try this
gcc-4.9 stuff on my T4-2 as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-16 07:02:57 UTC

Post by David Miller

Post by m***@linux.ee

Post by David Miller

Post by Joonsoo Kim
I'd like to know that your another problem is related to commit
bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache"). So,
if the commit is reverted, your another problem is also gone
completely?

The other problem has been present forever.

Umm? I am afraid I have been describing it badly. This random
SIGBUS+SIGSEGV problem is new - I have not seen it before.

Sorry, I thought it was the same bug that causes git corruptions
for you. I misunderstood.

Post by m***@linux.ee
I have been able to do kernel compiles for years on sparc64 (modulo
specific bugs in specific configurations) and 3.17 + start/end swap
patch seems also stable for most machine. With yesterdays git + align
patch, it dies with SIGBUS multiple times during compilation so it's a
new regression for me.
Will try reverting that commit tomorrow.

If that fails, please try to bisect, it will help us a lot.

Commit bf0dea23a9c0 is working OK with no revert needed (checked out
this revision and it tested OK).

So far I know that the breakage seems to have happened between
cadbb58039f7cab1def9c931012ab04c953a6997 (first sparc commit of
the batch, working OK on V100) and
bdcf81b658ebc4c2640c3c2c55c8b31c601b6996 (last sparc commit before the
merge, breaks on E3500). Will continue bisecting the sparc64 commits.

Also, I noticed that when the problem happens, it's deterministic - with
some kernels, sshd dies reproducibly on login. With most kernels,
building kernel breaks in one specific location, not randomly.

scripts/Makefile.build:352: recipe for target 'sound/modules.order' failed
make[1]: *** [sound/modules.order] Bus error
make[1]: *** Deleting file 'sound/modules.order'
Makefile:929: recipe for target 'sound' failed

Will tell when I get more details.

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-16 20:07:42 UTC

From: Meelis Roos <***@linux.ee>
Date: Thu, 16 Oct 2014 10:02:57 +0300 (EEST)

Post by Meelis Roos
scripts/Makefile.build:352: recipe for target 'sound/modules.order' failed
make[1]: *** [sound/modules.order] Bus error
make[1]: *** Deleting file 'sound/modules.order'
Makefile:929: recipe for target 'sound' failed

I just reproduced this on my Sun Blade 2500, so it can trigger on UltraSPARC-IIIi
systems too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-16 20:16:44 UTC

Post by David Miller

Post by Meelis Roos
scripts/Makefile.build:352: recipe for target 'sound/modules.order' failed
make[1]: *** [sound/modules.order] Bus error
make[1]: *** Deleting file 'sound/modules.order'
Makefile:929: recipe for target 'sound' failed

I just reproduced this on my Sun Blade 2500, so it can trigger on UltraSPARC-IIIi
systems too.

My bisection led to the folloowing commit but it seems irrelevant (I
have no sun4v on these machines):

4ccb9272892c33ef1c19a783cfa87103b30c2784 is the first bad commit
commit 4ccb9272892c33ef1c19a783cfa87103b30c2784
Author: bob picco <***@meloft.net>
Date: Tue Sep 16 09:26:47 2014 -0400

sparc64: sun4v TLB error power off events

However, the following chunk sound slightly suspicious:

+ if (fault_code & FAULT_CODE_BAD_RA)
+ goto do_sigbus;
+

because SIGNUS is what I got. For some machines, it killed chekroot
during startup, for some shells under some circumstances, for some sshd.

--
Meelis Roos (***@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

David Miller

2014-10-16 20:20:01 UTC

From: Meelis Roos <***@linux.ee>
Date: Thu, 16 Oct 2014 23:16:44 +0300 (EEST)

Post by Meelis Roos

Post by David Miller

Post by Meelis Roos
scripts/Makefile.build:352: recipe for target 'sound/modules.order' failed
make[1]: *** [sound/modules.order] Bus error
make[1]: *** Deleting file 'sound/modules.order'
Makefile:929: recipe for target 'sound' failed

I just reproduced this on my Sun Blade 2500, so it can trigger on UltraSPARC-IIIi
systems too.

My bisection led to the folloowing commit but it seems irrelevant (I
4ccb9272892c33ef1c19a783cfa87103b30c2784 is the first bad commit
commit 4ccb9272892c33ef1c19a783cfa87103b30c2784
Date: Tue Sep 16 09:26:47 2014 -0400
sparc64: sun4v TLB error power off events
+ if (fault_code & FAULT_CODE_BAD_RA)
+ goto do_sigbus;
+
because SIGNUS is what I got. For some machines, it killed chekroot
during startup, for some shells under some circumstances, for some sshd.

Good catch!

So I'm going to audit all the code paths to make sure we don't put garbage
into the fault_code value.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-16 20:50:17 UTC

From: David Miller <***@redhat.com>
Date: Thu, 16 Oct 2014 16:20:01 -0400 (EDT)

Post by David Miller
So I'm going to audit all the code paths to make sure we don't put garbage
into the fault_code value.

There are two code paths where we can put garbage into the fault_code
value. And for the dtlb_prot.S case, the value we put in there is
TLB_TAG_ACCESS which is 0x30, which include bit 0x20 which is that
FAULT_CODE_BAD_RA indication which is erroneously triggering.

The other path is via hugepage TLB misses, for the situation where
we haven't allocated the huge TSB for the thread yet. That might
explain some other longer-term problems we've had.

I'm about to test the following fix:

diff --git a/arch/sparc/kernel/dtlb_prot.S b/arch/sparc/kernel/dtlb_prot.S
index b2c2c5b..d668ca14 100644
--- a/arch/sparc/kernel/dtlb_prot.S
+++ b/arch/sparc/kernel/dtlb_prot.S
@@ -24,11 +24,11 @@
mov TLB_TAG_ACCESS, %g4 ! For reload of vaddr

/* PROT ** ICACHE line 2: More real fault processing */
+ ldxa [%g4] ASI_DMMU, %g5 ! Put tagaccess in %g5
bgu,pn %xcc, winfix_trampoline ! Yes, perform winfixup
- ldxa [%g4] ASI_DMMU, %g5 ! Put tagaccess in %g5
- ba,pt %xcc, sparc64_realfault_common ! Nope, normal fault
mov FAULT_CODE_DTLB | FAULT_CODE_WRITE, %g4
- nop
+ ba,pt %xcc, sparc64_realfault_common ! Nope, normal fault
+ nop
nop
nop
nop
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 14158d4..be98685 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -162,10 +162,10 @@ tsb_miss_page_table_walk_sun4v_fastpath:
nop
.previous

- rdpr %tl, %g3
- cmp %g3, 1
+ rdpr %tl, %g7
+ cmp %g7, 1
bne,pn %xcc, winfix_trampoline
- nop
+ mov %g3, %g4
ba,pt %xcc, etrap
rd %pc, %g7
call hugetlb_setup

Meelis Roos

2014-10-17 11:12:09 UTC

Post by David Miller
Date: Thu, 16 Oct 2014 16:20:01 -0400 (EDT)

Post by David Miller
So I'm going to audit all the code paths to make sure we don't put garbage
into the fault_code value.

There are two code paths where we can put garbage into the fault_code
value. And for the dtlb_prot.S case, the value we put in there is
TLB_TAG_ACCESS which is 0x30, which include bit 0x20 which is that
FAULT_CODE_BAD_RA indication which is erroneously triggering.
The other path is via hugepage TLB misses, for the situation where
we haven't allocated the huge TSB for the thread yet. That might
explain some other longer-term problems we've had.

Thank you - it seems to work fine for me on E3500 on top of
3.17.0-07551-g052db7e + slab alignment fix.

However, on top of mainline HEAD 3.17.0-09670-g0429fbc it explodes with
scheduler BUG - just reported to LKML + sched maintainers.

--
Meelis Roos (***@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

David Miller

2014-10-18 17:59:07 UTC

From: Meelis Roos <***@linux.ee>
Date: Fri, 17 Oct 2014 14:12:09 +0300 (EEST)

Post by Meelis Roos
However, on top of mainline HEAD 3.17.0-09670-g0429fbc it explodes with
scheduler BUG - just reported to LKML + sched maintainers.

task_stack_end_corrupted() cannot work properly on sparc64.

It stores the magic value at "task_thread_info(p) + 1", but on
sparc64 that's where we store the nested array of FPU register
saves.

In fact this facility could be corrupting FPU register state in
certain circumstances.

The current sparc64 design is intentional, the CPU stack grows down
toward the thread_info, and the FPU stack saving area grows up from
the end of thread_info.

I don't want to define the array size of the fpregs save area
explicitly and thereby placing an artificial limit there.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

David Miller

2014-10-18 18:23:35 UTC

From: David Miller <***@davemloft.net>
Date: Sat, 18 Oct 2014 13:59:07 -0400 (EDT)

Post by David Miller
I don't want to define the array size of the fpregs save area
explicitly and thereby placing an artificial limit there.

Nevermind, it seems we have a hard limit of 7 FPU save areas anyways.

Meelis, please try this patch:

diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index f85dc85..cc6275c 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -63,7 +63,8 @@ struct thread_info {
struct pt_regs *kern_una_regs;
unsigned int kern_una_insn;

- unsigned long fpregs[0] __attribute__ ((aligned(64)));
+ unsigned long fpregs[(7 * 256) / sizeof(unsigned long)]
+ __attribute__ ((aligned(64)));
};

#endif /* !(__ASSEMBLY__) */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-19 12:31:56 UTC

Post by David Miller

Post by David Miller
I don't want to define the array size of the fpregs save area
explicitly and thereby placing an artificial limit there.

Nevermind, it seems we have a hard limit of 7 FPU save areas anyways.

Works fine with 3.17.0-09670-g0429fbc + fault patch.

Will try current git next to find any new problems :)

--
Meelis Roos (***@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Meelis Roos

2014-10-19 17:12:43 UTC

Post by Meelis Roos

Post by David Miller

Post by David Miller
I don't want to define the array size of the fpregs save area
explicitly and thereby placing an artificial limit there.

Nevermind, it seems we have a hard limit of 7 FPU save areas anyways.

Works fine with 3.17.0-09670-g0429fbc + fault patch.
Will try current git next to find any new problems :)

Works on all 3 machines, with latest git (only had to apply the no-ipv6
patch on one of them). Thank you for the good work!

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-19 17:18:55 UTC

From: Meelis Roos <***@linux.ee>
Date: Sun, 19 Oct 2014 20:12:43 +0300 (EEST)

Post by Meelis Roos

Post by Meelis Roos

Post by David Miller

Post by David Miller
I don't want to define the array size of the fpregs save area
explicitly and thereby placing an artificial limit there.

Nevermind, it seems we have a hard limit of 7 FPU save areas anyways.

Works fine with 3.17.0-09670-g0429fbc + fault patch.
Will try current git next to find any new problems :)

Works on all 3 machines, with latest git (only had to apply the no-ipv6
patch on one of them). Thank you for the good work!

Thanks for testing.

Hopefully we can kill the gcc-4.9 bug next, and then see if that
exit_mmap() crash is still happening.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sam Ravnborg

2014-10-19 15:32:20 UTC

Post by David Miller
Date: Sat, 18 Oct 2014 13:59:07 -0400 (EDT)

Post by David Miller
I don't want to define the array size of the fpregs save area
explicitly and thereby placing an artificial limit there.

Nevermind, it seems we have a hard limit of 7 FPU save areas anyways.
diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
index f85dc85..cc6275c 100644
--- a/arch/sparc/include/asm/thread_info_64.h
+++ b/arch/sparc/include/asm/thread_info_64.h
@@ -63,7 +63,8 @@ struct thread_info {
struct pt_regs *kern_una_regs;
unsigned int kern_una_insn;
- unsigned long fpregs[0] __attribute__ ((aligned(64)));
+ unsigned long fpregs[(7 * 256) / sizeof(unsigned long)]
+ __attribute__ ((aligned(64)));

Could be written as __aligned(64)

Sam
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

David Miller

2014-10-19 17:27:37 UTC

From: Sam Ravnborg <***@ravnborg.org>
Date: Sun, 19 Oct 2014 17:32:20 +0200

Post by Sam Ravnborg

Post by David Miller
+ __attribute__ ((aligned(64)));

Could be written as __aligned(64)

I'll try to remember to sweep this up in sparc-next, thanks Sam.

We probably use this long-hand form in a lot of other places in
the sparc code too, so I'll try to do a full sweep.

Thanks again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Sam Ravnborg

2014-10-19 19:55:53 UTC

Post by David Miller
Date: Sun, 19 Oct 2014 17:32:20 +0200

Post by Sam Ravnborg

Post by David Miller
+ __attribute__ ((aligned(64)));

Could be written as __aligned(64)

I'll try to remember to sweep this up in sparc-next, thanks Sam.
We probably use this long-hand form in a lot of other places in
the sparc code too, so I'll try to do a full sweep.

Another related one would be a full sweep of "__asm__ __volatile__"
to the shorter version "asm volatile".

The latter is used in a few places in sparc already - so toolchain supports it.

I got hits in:
include/asm/irqflags_32.h: asm volatile("rd %%psr, %0" : "=r" (flags));
include/asm/processor_64.h:#define cpu_relax() asm volatile("\n99:\n\t" \
kernel/kprobes.c: asm volatile(".global kretprobe_trampoline\n"

But this would touch 93 files. Thats too much crunch :-(

Sam

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Meelis Roos

2014-10-16 20:20:11 UTC

Post by David Miller
I just reproduced this on my Sun Blade 2500, so it can trigger on UltraSPARC-IIIi
systems too.

I looked it up - V210 and V440 are also IIIi, not plain III. So I do not
have information about real USIII, sorry for confusion.

--
Meelis Roos (***@linux.ee)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Meelis Roos

2014-10-16 20:40:49 UTC

Post by Meelis Roos

Post by David Miller
I just reproduced this on my Sun Blade 2500, so it can trigger on UltraSPARC-IIIi
systems too.

I looked it up - V210 and V440 are also IIIi, not plain III. So I do not
have information about real USIII, sorry for confusion.

Brr, I just understood I confused 2 problems with the same subject. You
are talking about SIGBUS problem that is also happening on IIIi, my last
comment is about gcc-4.9 problem so please just ignore it.

--
Meelis Roos (***@linux.ee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Joonsoo Kim

2014-10-12 17:22:15 UTC

Post by David Miller
[603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.410523] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603965.424061] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603965.437617] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603970.554394] log_unaligned: 333 callbacks suppressed
[603970.564041] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603970.577576] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603970.591122] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603970.604669] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
[603970.618216] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
[603976.515633] log_unaligned: 31 callbacks suppressed
[603976.525092] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.540196] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.555308] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.570411] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603976.585526] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603982.476424] log_unaligned: 43 callbacks suppressed
[603982.485881] Kernel unaligned access at TPC[549378] kmem_cache_alloc+0xd8/0x1e0
[603982.501590] Kernel unaligned access at TPC[5470a8] kmem_cache_free+0xc8/0x200
[603982.501605] Kernel unaligned access at TPC[549378] kmem_cache_alloc+0xd8/0x1e0
[603982.530382] Kernel unaligned access at TPC[5470a8] kmem_cache_free+0xc8/0x200
[603982.544820] Kernel unaligned access at TPC[549378] kmem_cache_alloc+0xd8/0x1e0
[603987.567130] log_unaligned: 11 callbacks suppressed
[603987.576582] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.591696] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.606811] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.621904] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0
[603987.637017] Kernel unaligned access at TPC[548080] cache_alloc_refill+0x180/0x3a0

Hello,

Could you test below patch?
If it fixes your problem, I will send it with proper description.

Thanks.

---------->8----------------
diff --git a/mm/slab.c b/mm/slab.c
index 154aac8..eb2b2ea 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1992,7 +1992,7 @@ static struct array_cache __percpu *alloc_kmem_cache_cpus(
struct array_cache __percpu *cpu_cache;

size = sizeof(void *) * entries + sizeof(struct array_cache);
- cpu_cache = __alloc_percpu(size, 0);
+ cpu_cache = __alloc_percpu(size, sizeof(void *));

if (!cpu_cache)
return NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

David Miller

2014-10-12 17:30:47 UTC

From: Joonsoo Kim <***@gmail.com>
Date: Mon, 13 Oct 2014 02:22:15 +0900

Post by Joonsoo Kim
Could you test below patch?
If it fixes your problem, I will send it with proper description.

It works, I just tested using ARCH_KMALLOC_MINALIGN which would be
better.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>

Joonsoo Kim

2014-10-12 17:43:57 UTC

Post by David Miller
Date: Mon, 13 Oct 2014 02:22:15 +0900

Post by Joonsoo Kim
Could you test below patch?
If it fixes your problem, I will send it with proper description.

It works, I just tested using ARCH_KMALLOC_MINALIGN which would be
better.

Oops. resend with whole Cc list.

Thanks for testing.
ARCH_KMALLOC_MINALIGN is for object alignment,
but, current problem is caused by alignment of cpu cache array.
I think that my fix is more proper in this situation.
I will send fix tomorrow,
because I'd like to test more and it's 2:42 am. :)

Thanks.

33 Replies
11 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

David Miller 2014-10-12 02:15:10 UTC

David Miller 2014-10-12 17:20:12 UTC

m***@linux.ee 2014-10-13 20:22:37 UTC

Joonsoo Kim 2014-10-13 23:52:19 UTC

David Miller 2014-10-14 00:04:16 UTC

Joonsoo Kim 2014-10-14 00:14:54 UTC

m***@linux.ee 2014-10-14 21:19:36 UTC

David Miller 2014-10-14 21:32:46 UTC

Meelis Roos 2014-10-15 08:04:49 UTC

David Miller 2014-10-15 18:36:24 UTC

Meelis Roos 2014-10-15 20:11:34 UTC

David Miller 2014-10-16 03:11:54 UTC

Meelis Roos 2014-10-16 07:22:56 UTC

Meelis Roos 2014-10-16 20:11:49 UTC

David Miller 2014-10-16 20:18:23 UTC

Meelis Roos 2014-10-16 07:02:57 UTC

David Miller 2014-10-16 20:07:42 UTC

Meelis Roos 2014-10-16 20:16:44 UTC

David Miller 2014-10-16 20:20:01 UTC

David Miller 2014-10-16 20:50:17 UTC

Meelis Roos 2014-10-17 11:12:09 UTC

David Miller 2014-10-18 17:59:07 UTC

David Miller 2014-10-18 18:23:35 UTC

Meelis Roos 2014-10-19 12:31:56 UTC

Meelis Roos 2014-10-19 17:12:43 UTC

David Miller 2014-10-19 17:18:55 UTC

Sam Ravnborg 2014-10-19 15:32:20 UTC

David Miller 2014-10-19 17:27:37 UTC

Sam Ravnborg 2014-10-19 19:55:53 UTC

Meelis Roos 2014-10-16 20:20:11 UTC

Meelis Roos 2014-10-16 20:40:49 UTC

Joonsoo Kim 2014-10-12 17:22:15 UTC

David Miller 2014-10-12 17:30:47 UTC

Joonsoo Kim 2014-10-12 17:43:57 UTC

about - legalese

Loading...