Project

General

Profile

NAND Flash garbage collector and performance issues

Added by Fred Weiser almost 5 years ago

Hi. We are using the MityDSP L138 module, and are witnessing our main program (Linux/ARM) freeze for long periods of time (>2 minutes). "Top" is showing a status of "D" (blocked from running) for that process, and the jffs2 garbage collector is very active. Most likely the main app is attempting to access flash when the gc is busy with it. We have seen this problem grow worse over time (years) with short periods of blocking growing into longer ones. Concerns are:

1) Is flash being worn out and gc functions are taking much longer than before?
2) Is flash simply fragmented over a period of time and gc is busy reorganizing (perhaps poorly tuned)?
3) Are there any tools (mtd?) available to evaluate what's going on (ex flash wear, flash fragmentation, etc)?
4) Are there any tools to monitor file read/write activity to determine what process are "hogs"?
5) Are there any jffs2 tuning parameters that may help the gc stay ahead of system needs?

If anyone has had similar problems with flash, I'd like to hear how you approached examining the problem.

Thanks!


Replies (6)

RE: NAND Flash garbage collector and performance issues - Added by Jonathan Cormier almost 5 years ago

Fred Weiser wrote:

Hi. We are using the MityDSP L138 module, and are witnessing our main program (Linux/ARM) freeze for long periods of time (>2 minutes). "Top" is showing a status of "D" (blocked from running) for that process, and the jffs2 garbage collector is very active. Most likely the main app is attempting to access flash when the gc is busy with it. We have seen this problem grow worse over time (years) with short periods of blocking growing into longer ones. Concerns are:

You might be able to use strace to determine how your app is hanging. strace ./appname Not sure if its installed on the L138 filesystem. Alternatively you could use gdb to debug your application.

1) Is flash being worn out and gc functions are taking much longer than before?
2) Is flash simply fragmented over a period of time and gc is busy reorganizing (perhaps poorly tuned)?

Is there sufficient free space? Perhaps the garbage collector is struggling to find free blocks to move data around with.
According to this page, the garbage collector thread is a low priority task and is intended to run when everything is idle. http://www.ecoscentric.com/ecospro/doc/html/ref/fs-jffs2-usage.html

This thread provides some commands you could try to pause the gc thread..
http://lists.infradead.org/pipermail/linux-mtd/2009-March/024871.html

Some information on jffs2: http://www.linux-mtd.infradead.org/doc/jffs2.html

3) Are there any tools (mtd?) available to evaluate what's going on (ex flash wear, flash fragmentation, etc)?

Not that I'm aware of. May require some extensive google searching.

4) Are there any tools to monitor file read/write activity to determine what process are "hogs"?

http://unix.stackexchange.com/questions/55212/how-can-i-monitor-disk-io
You may have to crosscompile some of these tools if they aren't in the opkg repos.

5) Are there any jffs2 tuning parameters that may help the gc stay ahead of system needs?

Not sure.

If anyone has had similar problems with flash, I'd like to hear how you approached examining the problem.

Thanks!

RE: NAND Flash garbage collector and performance issues - Added by Gregory Gluszek almost 5 years ago

Hi Fred,

We've run into issues here with system boot times slowing down after extended use. This does not sound exactly like what you're experiencing, but perhaps the cause and fix we found might be useful or relevant to your issue. The culprit seems to be related to the systemd log growing too big. The fix we've found is to explicitly specify SystemMaxUse= in /etc/systemd/journald.conf. I've been setting this to 4M and have not experienced slow down issues since.

Thanks,
\Greg

RE: NAND Flash garbage collector and performance issues - Added by Fred Weiser over 4 years ago

Answering questions:

The jffs2 flash is divided into 2 partitions of which I'm using only one; I have just over half the space within that partition used (51% according to the df command).

I have run some tests to determine where the "hangs" are occurring; any thread that encounters a disk I/O statement blocks, and threads that do not keep running. Pausing the GC process keeps them blocked. The tools that can deal with jffs2 partitions are pretty sparse; I debugged using printf in my code.

My belief at this point is flash is functioning ok but is so fragmented that the jffs2 GC is struggling to clear out blocks to erase.
_

RE: NAND Flash garbage collector and performance issues - Added by Fred Weiser over 4 years ago

After transferring some files to the target flash (about 15 MB worth, overwriting old files), a short time later I encountered the following. I'm now wondering if there are some holes in the jffs2 drivers in use case areas that do not get exercised very often. A reboot was required, and the GC simmered at 2.5% cpu for a minute or two, then rocketed up to 95% for a few minutes before settling down again.

raw node at 0x3ff00000 is off the end of device!
------------[ cut here ]------------
kernel BUG at fs/jffs2/nodemgmt.c:524!
Internal error: Oops - undefined instruction: 0 [#8] PREEMPT
Modules linked in: ads7843(O) fpga_uart(O) fpga_spi(O) fpga_i2c(O) fpga_gpio(O) fpga_ctrl(O) dsplinkk(O)
CPU: 0 Tainted: G D O (3.2.0 #2)
PC is at jffs2_mark_node_obsolete+0x54/0x57c
LR is at jffs2_mark_node_obsolete+0x54/0x57c
pc : [<c01232d8>] lr : [<c01232d8>] psr: 60000013
sp : c5877ef0 ip : c056bf85 fp : 00000000
r10: c517a440 r9 : 00012844 r8 : c5aca02c
r7 : 0000d7bc r6 : 3ff00000 r5 : c517a440 r4 : c5aca000
r3 : 00000011 r2 : c5877ee4 r1 : c049590f r0 : 00000037
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
Control: 0005317f Table: c2f0c000 DAC: 00000017
Process sync_supers (pid: 177, stack limit = 0xc5876270)
Stack: (0xc5877ef0 to 0xc5878000)
7ee0: c2c5e000 00000002 c056b34c c003b6d8
7f00: 00000000 00000002 c2c5e000 c2c5e01c c5854e40 00000015 c5a94bc0 c5aca000
7f20: c5aebb48 c517a440 0000d7bc c5aca02c 00012844 c517a440 00000000 c012a5e8
7f40: c5854e40 c05447b8 c05447b8 c5854e70 ffffffff 00000000 c51dc000 00000002
7f60: c5aca000 c5aca02c 02c71000 00000559 c054eacc 00000000 00000000 c012e610
7f80: 00000000 c5aca200 c5aca000 c5876000 c5aca240 c012d294 c5aca200 00000000
7fa0: c5876000 c0084d80 c5876000 00000001 c006cb90 00000013 00000000 c006cbc4
7fc0: 00000000 c581df88 00000000 c0036410 00000000 00000000 00000000 00000000
7fe0: c5877fe0 c5877fe0 c581df88 c0036394 c0009cf8 c0009cf8 00000000 00000000
[<c01232d8>] (jffs2_mark_node_obsolete+0x54/0x57c) from [<c012a5e8>] (jffs2_garbage_collect_pass+0x5c8/0x90c)
[<c012a5e8>] (jffs2_garbage_collect_pass+0x5c8/0x90c) from [<c012e610>] (jffs2_flush_wbuf_gc+0x60/0xe0)
[<c012e610>] (jffs2_flush_wbuf_gc+0x60/0xe0) from [<c012d294>] (jffs2_write_super+0x2c/0x38)
[<c012d294>] (jffs2_write_super+0x2c/0x38) from [<c0084d80>] (sync_supers+0xac/0x118)
[<c0084d80>] (sync_supers+0xac/0x118) from [<c006cbc4>] (bdi_sync_supers+0x34/0x48)
[<c006cbc4>] (bdi_sync_supers+0x34/0x48) from [<c0036410>] (kthread+0x7c/0x84)
[<c0036410>] (kthread+0x7c/0x84) from [<c0009cf8>] (kernel_thread_exit+0x0/0x8)
Code: 3a000003 e1a01006 e59f0504 eb09e0ab (e7f001f2)
---[ end trace c4c663ec3877c7f7 ]---

I wonder how easily the MDK could be rebuilt using UBIFS? At this point, I might be ready to trade one set of problems for another...
/Fred

RE: NAND Flash garbage collector and performance issues - Added by Jonathan Cormier over 4 years ago

Fred Weiser wrote:

Answering questions:

The jffs2 flash is divided into 2 partitions of which I'm using only one; I have just over half the space within that partition used (51% according to the df command).

I have run some tests to determine where the "hangs" are occurring; any thread that encounters a disk I/O statement blocks, and threads that do not keep running. Pausing the GC process keeps them blocked. The tools that can deal with jffs2 partitions are pretty sparse; I debugged using printf in my code.

My belief at this point is flash is functioning ok but is so fragmented that the jffs2 GC is struggling to clear out blocks to erase.

This is quite possible if you are writing a lot of data onto the nand. If you have no need for the 2nd partition you could reduce fragmentation by removing the 2nd partition and expanding your main partition. The more free space the easier it is for the garbage collector to do its job.

If you have no need for writing to the nand, you could mount the filesystem read-only to prevent fragmentation and garbage collection.

RE: NAND Flash garbage collector and performance issues - Added by Jonathan Cormier over 4 years ago

Fred Weiser wrote:

After transferring some files to the target flash (about 15 MB worth, overwriting old files), a short time later I encountered the following. I'm now wondering if there are some holes in the jffs2 drivers in use case areas that do not get exercised very often. A reboot was required, and the GC simmered at 2.5% cpu for a minute or two, then rocketed up to 95% for a few minutes before settling down again.

@raw node at 0x3ff00000 is off the end of device!

Error seems to indicate that its trying to use data beyond the address space of the nand during garbage collect. Very strange.

------------[ cut here ]------------
kernel BUG at fs/jffs2/nodemgmt.c:524!
Internal error: Oops - undefined instruction: 0 [#8] PREEMPT
Modules linked in: ads7843(O) fpga_uart(O) fpga_spi(O) fpga_i2c(O) fpga_gpio(O) fpga_ctrl(O) dsplinkk(O)
CPU: 0 Tainted: G D O (3.2.0 #2)
PC is at jffs2_mark_node_obsolete+0x54/0x57c
LR is at jffs2_mark_node_obsolete+0x54/0x57c
pc : [<c01232d8>] lr : [<c01232d8>] psr: 60000013
sp : c5877ef0 ip : c056bf85 fp : 00000000
r10: c517a440 r9 : 00012844 r8 : c5aca02c
r7 : 0000d7bc r6 : 3ff00000 r5 : c517a440 r4 : c5aca000
r3 : 00000011 r2 : c5877ee4 r1 : c049590f r0 : 00000037
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
Control: 0005317f Table: c2f0c000 DAC: 00000017
Process sync_supers (pid: 177, stack limit = 0xc5876270)
Stack: (0xc5877ef0 to 0xc5878000)
7ee0: c2c5e000 00000002 c056b34c c003b6d8
7f00: 00000000 00000002 c2c5e000 c2c5e01c c5854e40 00000015 c5a94bc0 c5aca000
7f20: c5aebb48 c517a440 0000d7bc c5aca02c 00012844 c517a440 00000000 c012a5e8
7f40: c5854e40 c05447b8 c05447b8 c5854e70 ffffffff 00000000 c51dc000 00000002
7f60: c5aca000 c5aca02c 02c71000 00000559 c054eacc 00000000 00000000 c012e610
7f80: 00000000 c5aca200 c5aca000 c5876000 c5aca240 c012d294 c5aca200 00000000
7fa0: c5876000 c0084d80 c5876000 00000001 c006cb90 00000013 00000000 c006cbc4
7fc0: 00000000 c581df88 00000000 c0036410 00000000 00000000 00000000 00000000
7fe0: c5877fe0 c5877fe0 c581df88 c0036394 c0009cf8 c0009cf8 00000000 00000000
[<c01232d8>] (jffs2_mark_node_obsolete+0x54/0x57c) from [<c012a5e8>] (jffs2_garbage_collect_pass+0x5c8/0x90c)
[<c012a5e8>] (jffs2_garbage_collect_pass+0x5c8/0x90c) from [<c012e610>] (jffs2_flush_wbuf_gc+0x60/0xe0)
[<c012e610>] (jffs2_flush_wbuf_gc+0x60/0xe0) from [<c012d294>] (jffs2_write_super+0x2c/0x38)
[<c012d294>] (jffs2_write_super+0x2c/0x38) from [<c0084d80>] (sync_supers+0xac/0x118)
[<c0084d80>] (sync_supers+0xac/0x118) from [<c006cbc4>] (bdi_sync_supers+0x34/0x48)
[<c006cbc4>] (bdi_sync_supers+0x34/0x48) from [<c0036410>] (kthread+0x7c/0x84)
[<c0036410>] (kthread+0x7c/0x84) from [<c0009cf8>] (kernel_thread_exit+0x0/0x8)
Code: 3a000003 e1a01006 e59f0504 eb09e0ab (e7f001f2)
---[ end trace c4c663ec3877c7f7 ]---
@

I wonder how easily the MDK could be rebuilt using UBIFS? At this point, I might be ready to trade one set of problems for another...
/Fred

I haven't used UBIFS on the L138, though I have used it on our 335x. Which kernel are you using? I would recommend using the 3.2 kernel for UBIFS and possibly applying the 3.2 backports. git://git.infradead.org/users/dedekind/ubifs-v3.2.git

It shouldn't be difficult to create a ubifs filesystem image from the MDK filesystem tarball. The steps would be similar to the following instructions but modified for the L138 nand. https://support.criticallink.com/redmine/projects/armc8-platforms/wiki/UBIFS_Nand_Boot#Creating-UBIFS-file-system

    (1-6/6)
    Add picture from clipboard (Maximum size: 500 MB)