The Arduino Watchdog Timer
Most microcontrollers, including the Atmel chips using on the Arduino, have a watchdog timer. Its a sort of policeman that resets the micro if it thinks the program has stopped working properly. You signal the program is working correctly simply by calling wdt_reset() each time through the loop, after enabling the watchdog in the setup routine. So although the following program will get stuck in the while loop, the watchdog timer will reset the micro every second and the program will do at least part of its job. Of course it is better to fix the bug causing the lockup. And for that we can use the two stage watchdog timer: interrupt and reset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
#include <avr/wdt.h> void setup() { Serial.begin(9600); wdt_enable(WDTO_1S); } void loop() { // the program is alive...for now. wdt_reset(); Serial.println("Hello"); while (1) ; // do nothing. the program will lockup here. Serial.println("Can't get here"); } |
A word of caution here: there is a bug in the early Mega2560 bootloader. If you turn on the watchdog timer on this board, it can get stuck in the bootloader. Update the bootloader using Arduino IDE version 1.0.4 (or later) and all should be well.
Watchdog Interrupt and Reset
The watchdog timer on the Arduino microcontrollers can operate in four different modes:
- Disabled: the program simply gets stuck if it runs into an infinite loop
- Reset: if wdt_reset() is not called within every timeout period, the watchdog timer will reset the program and it will start again from the beginning
- Interrupt: an interrupt will be generated if wdt_reset() isn’t called within every timeout period
- Interrupt + reset: a combination of the last two. If wdt_reset() isn’t called within the timeout period, the watchdog timer will first raise an interrupt. Then, if wdt_reset() isn’t called within another timeout period, the program will reset and start again
It is the two interrupt modes that are useful for detecting lockups. Why? When an interrupt is fired the micro pushes the current program counter onto the stack. That’s the address where the Arduino was executing code when the watchdog fired…and a good place to look for the bug. Of course, if the program has just crashed it is going to be a bit risky trying to send the address out the serial port. Instead, the address can be saved in the eeprom and printed next time the program starts.
An Arduino Crash Location Detection Example
We packaged all this up into a simple example, which is available on github. The program will work with the Arduino IDE, or our Visual Studio Arduino build tool and it is made up of three key files:
- ApplicationMonitor.h/ ApplicationMonitor.cpp: this is the core part of the implementation. It handles the watchdog interrupt, storing the information to the EEPROM and printing diagnostic information at startup.
- Program.cpp: an example program which locks up after a few iterations to illustrate using the library.
The example program shows how the crash handler could be used in a typical program. The ApplicationMonitor.h and ApplicationMonitor.cpp files could be included in your program folder, as here, or moved into a library folder to be used across several programs. The whole test program is reproduced below, but the key lines are:
- Include the crash monitor header file.
1#include "ApplicationMonitor.h" - Create a global monitor object. This must be called ApplicationMonitor otherwise you will get errors when you build the program. You can optionally set the eeprom address where data is saved and the number of crash reports that will be saved here. By default, 10 reports are saved, starting from address 500.
1Watchdog::CApplicationMonitor ApplicationMonitor; - Write the crash data out the serial port, at a convenient location:
1ApplicationMonitor.Dump(Serial); - Initialize the application monitor and set the timeout. This would normally live in the setup function, as here:
1ApplicationMonitor.EnableWatchdog(Watchdog::CApplicationMonitor::Timeout_4s); - Keep calling the IAmAlive function within the timeout period to prevent automatic reset by the watchdog timer. This would normally happen at the top of the loop function.
1ApplicationMonitor.IAmAlive(); - Finally, you can save a 32-bit value with the next crash report. This could record any information that might be helpful to tracking down the bug such as the current program mode.
1ApplicationMonitor.SetData(g_nIterations++);
The complete example program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
#include "ApplicationMonitor.h" Watchdog::CApplicationMonitor ApplicationMonitor; // An LED to flash while we can. const int gc_nLEDPin = 13; // countdown until the program locks up. int g_nEndOfTheWorld = 15; // number of iterations completed. int g_nIterations = 0; void setup() { Serial.begin(9600); pinMode(gc_nLEDPin, OUTPUT); Serial.println("Ready"); ApplicationMonitor.Dump(Serial); ApplicationMonitor.EnableWatchdog(Watchdog::CApplicationMonitor::Timeout_4s); //ApplicationMonitor.DisableWatchdog(); Serial.println("Hello World!"); } void loop() { ApplicationMonitor.IAmAlive(); ApplicationMonitor.SetData(g_nIterations++); Serial.println("The end is nigh!!!"); digitalWrite(gc_nLEDPin, HIGH); // turn the LED on (HIGH is the voltage level) delay(200); // wait for a second digitalWrite(gc_nLEDPin, LOW); // turn the LED off by making the voltage LOW delay(200); // wait for a second if (g_nEndOfTheWorld == 0) { Serial.println("The end is here. Goodbye cruel world."); while(1) ; // do nothing until the watchdog timer kicks in and resets the program. } --g_nEndOfTheWorld; } |
Here’s a screenshot of the MegunoLink Pro monitor window after the program has been running for a minute or two showing the program output:
You can see it has saved 9 crash reports so far, and the next one will be saved at location 9 (of 10). The interesting part, the location in our program where the lockup occurred, is included too. For crash 8, the program was executing instructions at program memory address 0x398. For tracking down the offending line of code in the program, we need the byte address though. In this case: 0x730 (the byte address is simply twice the program memory address).
The hunting of the source code
By itself, the program memory address is not very useful. It isn’t a line number of file name; it is simply the location in program memory of the instruction that the Arduino was executing when the lock-up occurred. To get from the program memory address, to the offending line of source code we need to use the disassembler. To upload your program to an Arduino, the IDE first compiles the code in each file into instructions the microcontroller can understand—assembly code—creating a separate object file for each source file. A linker joins these together into an executable and linking format, or ELF, file. The ELF file contains the instructions at each program address and the disassembler lets us connect those addresses back to the original program code using this magical incantation:
1 |
avr-objdump -d -S -j .text CrashTracking.elf > Disassembly.txt |
Here, CrashTracking.elf is the linker output for the crashing program and avr-objdump is part of the Arduino installation that you’ll find in “arduino\hardware\tools\avr\bin”. You’ll need to add this folder to your system path to run the object dump program. Running this command will create a text file named “Disassembly.txt”. The disassembly file contains the original source code mixed in with the assembly instructions, as in the excerpt below. Lines with assembly instructions start with the program counter address. Typically, you’ll see a line of source code followed by the assembly instructions that the microcontroller executes to implement the source line.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
if (g_nEndOfTheWorld == 0) 718: 80 91 4c 02 lds r24, 0x024C 71c: 90 91 4d 02 lds r25, 0x024D 720: 00 97 sbiw r24, 0x00 ; 0 722: 39 f4 brne .+14 ; 0x732 <loop+0x82> { Serial.println("The end is here. Goodbye cruel world."); 724: 8b e8 ldi r24, 0x8B ; 139 726: 94 e0 ldi r25, 0x04 ; 4 728: 63 e1 ldi r22, 0x13 ; 19 72a: 72 e0 ldi r23, 0x02 ; 2 72c: 0e 94 aa 09 call 0x1354 ; 0x1354 <_ZN5Print7printlnEPKc> 730: ff cf rjmp .-2 ; 0x730 <loop+0x80> while(1) ; // do nothing until the watchdog timer kicks in and resets the program. } --g_nEndOfTheWorld; 732: 01 97 sbiw r24, 0x01 ; 1 734: 90 93 4d 02 sts 0x024D, r25 738: 80 93 4c 02 sts 0x024C, r24 73c: 08 95 ret |
So to find the offending lock-up, search through the disassembly until you find an assembly line with the same address as the byte address in the application monitor output. In this case, the lock-up occurs at address 0x730, or line 13 in the listing above. Ah-ha! The problem is obvious: address 0x730 simply jumps back to itself in an endless loop that will end only with the collapse of the Universe (or the power plug is pulled, whichever comes first). But you don’t need to read the assembly code to make use of the application monitor. For line 14 and 15 have the corresponding program code (sometimes the program code comes after the program address): an endless while loop keeping the Arduino busy doing absolutely nothing. Remove this line and the program will cheerfully, like all soothsayers, predict a doom that never arises.
The end (of all bugs)
Okay, perhaps this won’t track down every bug in your program, but if you think this will be a useful tool in your quest for the perfect program, post a comment below. You can download the Arduino lockup monitor from github. And download MegunoLink Pro too. Once you’ve fixed the bugs in your program, you can use MegunoLink Pro to monitor, plot or log your data in real-time or build a simple user interface to control your Arduino program from your PC.
-
[…] Detect lockups in Arduino: https://www.megunolink.com/articles/how-to-detect-lockups-using-the-arduino-watchdog/ […]
[…] tracking down your fault and also where in the program you were when disaster struck. Phil over at MegunoLink has written a nice crash library for the Arduino that does just this. The article is very […]
Leave a Comment
Great stuff, thanks!
Very good recommendations. I inserted it in my home automation. Thanks!
I’m using Due SAM3X on 1.6.7 ide and I have a bug which made me crazy for days… I’ve found this very helpful topic but when I try to implement the files, I get the error while compiling…
In file included from F:\# All docs ever\Documents\# Arduino\AirPlatform_GND_V4.3-bmpdebug_WD\AirPlatform_GND_V4.3-bmpdebug_WD.ino:47:0:
sketch\ApplicationMonitor.h:10:21: fatal error: avr/wdt.h: No such file or directory
I’ve copied the files from github to C:\Program Files (x86)\Arduino\libraries\CrashTracking as a folder and after one try and getting the error, I’ve also copied ApplicationMonitor.h and .cpp files inside my sketch folder too but still no use… PS. I did restart the ide after the process and also reboot the pc. A conflict with SAM3X uC ?
Hi Emre, unfortunately it looks like the Due is quite a different device from traditional AVR (uno etc). It cant find wdt.h because it doesnt exist for the Due in that location and judging from this
http://forum.arduino.cc/index.php?topic=358228.0
its probably not compatible anyway. You would need to figure out how to port the crash detection from AVR to the due device.
Good luck
Phil
Can this be used to dump the memory addresses into a string rather than to Console / serial? Thanks
It should be possible but is not part of the current version. You would need to modify our application monitor to add the functionality.
Cheers
Phil
Excellent – This dug me out of a bad situation very quickly 🙂
As it stands this will loop forever – resetting the Arduino each time it freezes.
For very rare events how could it be made to stop resetting the Arduino after 1st trap??
ie so that the report dump is on the terminal screen without having to scroll back 3days.
Obviously this stops the application running but this is more convenient in a debug/test environment.
TIA
Bob
The approach we use is to implement a serial command that can dump the crash log whenever you like. So, for example, you could use our serial library to implement a command that calls ApplicationMonitor.Dump(Serial);
For our serial command handler, check out: https://www.megunolink.com/documentation/getting-started-process-serial-commands/
Does it overwrite oldest entry after DEFAULT_ENTRIES number of resets has happened? I have reset my micro many times and the addresses of the 10 entries is not changing which is suspicious
Thanks
It will overwrite older entries. The address is only saved during watchdog faults however. A normal reset won’t do it.
I am running your Program.cpp “sketch” above on a Arduino Nano, using Arduino IDE 1.6.12. Everything seem to run correctly as far as to line 41 (“The end is here. Goodbye cruel world.”), and then the micro goes into the while(1), loop evidently. The problem is that it stays there and the watchdog timer does not manage to reset. While in this “spin” the built-in LED (13) flashes very rapidly (like about 10 on/off cycles per seconds), for some reason. There should be no LED flashing in the simulated lock-up loop. If that rapid flashing is somehow caused by repeated resetting, it still doesn’t explain why it happens at approximately 10 Hz, since the watchdog timer should fire at an intervall of 4 seconds.
I can’t understand what is going wrong here. The only thing I have changed in your original crashTracking code is delay(200) to delay(1000) in lines 35 and 37, to slow the loop down a little.
Grateful for any tips,
Martin
I found a post from 2012 by guru Nick Gammon on the Arduino forum addressing a similar issue with a Nano.
I quote: “Sounds like … have one of the bootloaders that doesn’t reset the watchdog timer. The flash you see is the bootloader initial flash of pin 13. Then it resets again and flashes it again.”
So now I’ve burned the Optiboot bootloader to my board, and everything is hunky-dory.
Great great code ! Thank you !
This looks a nice wrapping up the Watchdog usage with Arduino.
I couldn’t get a benefit from this yet. I am using bare metal Atmega2560 without bootloader
and using Visual Studio with Visual Micro. My own application works as expected.
So, I wanted to give a test your sample application on the same chip.
I build and flashed your sample application safe and sound onto the chip. For uploading I am using AtmelStudio.
Eventually, all I get a continuous reset and on the terminal window, only prints “Application�Ready” and LED 13 rapidly blinking.
I know the issue while you are using old bootloader. But, in my case there is no bootloader.
Any idea about that?
Thank you for your inputs from now.
Hi Sener, not sure about this one. Did you have any luck getting it going?
How would you suggest clearing the wdtLog in eeprom? I’m thinking only the headers need clearing. Any suggestions how?
Easiest way would be something like:
void CCrashReport::Clear()
{
CApplicationMonitorHeader Header;
Header.m_uSavedReports = 0;
Header.m_uNextReport = 0;
SaveHeader(Header);
}
I dont understant this line
if (g_nEndOfTheWorld == 0)
If g_nEndOfTheWorld is set at 15, where is it dynamically decremented to reach zero.
Hi Allan, line 46 “–g_nEndOfTheWorld;”
looks like 6 Bytes stored for a UNO, example: 0B 03 header, then record 1 = 0A 48 25 00 00 00
byte address is not stored, simple 2x ‘word address’
0B = total records saved (weird, should be 0A)
03 = next position to save
then each record x 10 = 0A 48 25 00 00 00
0A 48 = 2 bytes for word address
25 00 00 00 ?? what is the uData that is stored ? where does it come from ? and easiest way to add an additional byte to store context if triggered for a manual reset ?
Any advice why integrating other code using USART Interrupts, that the combo of this code causes a death of the micro – I guess its stuck in bootloader
The program actually locked in downloading new code over USB-Serial on Leonardo, so assume the code download was longer that the WDT period (set at 4s) ? so the code half-loaded ? now the bootloader is damaged and cannot load new code ? or stuck in an infinite loop ?
appreciate your advice, stuck…