Project

General

Profile

JFFS2 "scheduling while atomic" problem with L138 running linux kernel 3.2

Added by Fred Weiser about 4 years ago

I have been getting a lot of "scheduling while atomic" messages seemingly centered around JFFS2:

Jun 12 12:08:37 ultrasonic kernel: BUG: scheduling while atomic: flash-writer/2221/0x00000002
Jun 12 12:08:37 ultrasonic kernel: Modules linked in: ads7843(O) fpga_uart(O) fpga_spi(O) fpga_i2c(O) fpga_gpio(O) fpga_ctrl(O) dsplinkk(O)
Jun 12 12:08:37 ultrasonic kernel: [<c000d5a8>] (unwind_backtrace+0x0/0xe0) from [<c039b6e8>] (__schedule+0x58/0x3b4)
Jun 12 12:08:37 ultrasonic kernel: [<c039b6e8>] (__schedule+0x58/0x3b4) from [<c039c7e8>] (__mutex_lock_slowpath+0x90/0x100)
Jun 12 12:08:37 ultrasonic kernel: [<c039c7e8>] (__mutex_lock_slowpath+0x90/0x100) from [<c012a2c4>] (jffs2_garbage_collect_pass+0x2a4/0x90c)
Jun 12 12:08:37 ultrasonic kernel: [<c012a2c4>] (jffs2_garbage_collect_pass+0x2a4/0x90c) from [<c012e610>] (jffs2_flush_wbuf_gc+0x60/0xe0)
Jun 12 12:08:37 ultrasonic kernel: [<c012e610>] (jffs2_flush_wbuf_gc+0x60/0xe0) from [<c0121acc>] (jffs2_fsync+0x44/0x54)
Jun 12 12:08:37 ultrasonic kernel: [<c0121acc>] (jffs2_fsync+0x44/0x54) from [<c00a545c>] (vfs_fsync_range+0x34/0x44)
Jun 12 12:08:37 ultrasonic kernel: [<c00a545c>] (vfs_fsync_range+0x34/0x44) from [<c00a548c>] (vfs_fsync+0x20/0x28)
Jun 12 12:08:37 ultrasonic kernel: [<c00a548c>] (vfs_fsync+0x20/0x28) from [<c00a552c>] (do_fsync+0x20/0x34)
Jun 12 12:08:37 ultrasonic kernel: [<c00a552c>] (do_fsync+0x20/0x34) from [<c00093e0>] (ret_fast_syscall+0x0/0x2c)

Stackoverflow describes a bug that was eventually corrected:

http://stackoverflow.com/questions/17198046/jffs2-scheduling-while-atomic-error-on-kernel-2-6

Has this correction been included in the MDK for the L138 (linux kernel 3.2)? In the new Yocto build? Our current MDK is 2012-08-10.

What does the scheduler actually do when this error is thrown? Does the thread survive and keep running?

Thanks


Replies (6)

RE: JFFS2 "scheduling while atomic" problem with L138 running linux kernel 3.2 - Added by Bob Duke about 4 years ago

Fred,

The current MityDSP-L138 kernel (Branch: mitydsp-linux-v3.2) has the following code in fs/jffs2/gc.c:

D1(printk(KERN_DEBUG "No progress from erasing blocks; doing GC anyway\n"));
spin_lock(&c->erase_completion_lock);
mutex_lock(&c->alloc_sem);

Which were updated in commits made in 2010. This matches the code snippet in the SO issue you posted. So, I think the fix is in the current MityDSP-L138 kernel (Yocto-based or otherwise).

I am not sure exactly how the scheduler handles this error. I have seen JFFS2 kernel errors "recover" but I don't know if it is due to a GC thread restart or if there is any data lost.

If you can reproduce this error, you can try enabling JFFS2 debug information to see if it provides any useful information about the state of the filesystem before and after the crash.

In order to obtain JFFS2 debugging information, enable CONFIG_JFFS2_FS_DEBUG in your kernel config. Then, if you enable all kernel error messages (echo 9 > /proc/sys/kernel/printk) you'll be able to troubleshoot what is happening when the GC is running.

RE: JFFS2 "scheduling while atomic" problem with L138 running linux kernel 3.2 - Added by Fred Weiser about 4 years ago

Ah, yes, so this is indeed the problem; the code you show above is before the fix. The issue is mutex_lock cannot follow spin_lock. Spin_lock places the kernel into atomic mode; during this mode, the processor must not call scheduler, which is what mutex_lock will do if it must wait on its semaphore.

Does the new kernel (3.14) JFFS2 code fix this? How difficult would it be to get a more modern jffs2 module into the 3.2 kernel? I imagine a patch must be floating around for this somewhere... --Thanks

RE: JFFS2 "scheduling while atomic" problem with L138 running linux kernel 3.2 - Added by Jonathan Cormier about 4 years ago

The patch was included in the 3.2.18 kernel release. You can cherry-pick the patch from there to verify it fixes your issue.

$ git remote add linux-stable git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git fetch linux-stable
remote: Counting objects: 495396, done.
...
$ git checkout -b 3.2_test_jffs_fix
$ git cherry-pick 5a6cc206dfa8de733e848590b43d30706216733c
[mitydsp-linux-v3.2 266458992ce0] jffs2: Fix lock acquisition order bug in gc path
 Author: Josh Cartwright <joshc@linux.com>
 1 file changed, 1 insertion(+), 1 deletion(-)

RE: JFFS2 "scheduling while atomic" problem with L138 running linux kernel 3.2 - Added by Jonathan Cormier about 4 years ago

Note we can list any other jffs patches that is in the linux-3.2.y branch that isn't in ours by doing the following:

 $ git log --pretty=oneline mitydsp-linux-v3.2...linux-stable/linux-3.2.y | grep jffs
ee2c1c3f794a4b184d51fa9dce27185b2af550fe jffs2: fix handling of corrupted summary length
65417cd845c9e6217bcbdd7c9b91b01d57a09a9d jffs2: Fix crash due to truncation of csize
9634d073d823def05f7b4f2b89b35c4c55be10ac jffs2: Fix segmentation fault found in stress test
62a9eee371b54d6c56f9867b193f4a2410c50c05 jffs2: avoid soft-lockup in jffs2_reserve_space_gc()
22af2c501a2c1980e848ce138031ed1cab278f55 jffs2: remove from wait queue after schedule()
e3bb00e3b611c48b83730c880d8406c35de5bb4b jffs2: Fix lock acquisition order bug in jffs2_write_begin
5a6cc206dfa8de733e848590b43d30706216733c jffs2: Fix lock acquisition order bug in gc path

Note that for the "Fix lock acquisition order bug in jffs2_write_begin" commit I just happened to notice that it was recently reverted due to some errors found.
http://osdir.com/ml/linux-kernel/2016-03/msg01015.html

RE: JFFS2 "scheduling while atomic" problem with L138 running linux kernel 3.2 - Added by Bob Duke about 4 years ago

Sorry I missed that Fred.

You can just make the change (swapping two lines) yourself for testing, or you can apply the commit via Jon's instruction.

If that fixes your problem, we can push it to our main branch.

    (1-6/6)
    Add picture from clipboard (Maximum size: 500 MB)