I’m trying to create an SSM document that will install software on an EC2 instance. For the most part it’s all working, but I’ve tried to add in some error handling and it does not behave the way I expect. I am finding it hard to find a definitive explanation of what is reasonable to expect, so I could easily be doing something wrong.
I’ve tried to simplify this issue as much as possible into a barebones SSM JSON document that exhibits my problem. I apologize for the length of this example, but thought it best to include the whole thing for context. It’s a sequence of five steps. Step0 is unimportant – it just does some prep and cleanup from a previous invocation. Step1 simply echoes some stuff to a file. Step2 echoes to a file and then performs a bad mv
operation. The ideas is that this should trigger an error and control should go to step BuhBye at the bottom, and the whole process should end. Step3 is like Step1, but in this scenario should never be executed (at least that’s what I’ve thought), since step BuhBye should end it all.
{
"schemaVersion": "2.2",
"description": "A very simple HelloWorld SSM document for exploring issues with error handling",
"mainSteps": [
{
"action": "aws:runShellScript",
"name": "Step0",
"inputs": {
"runCommand": [
"set -e",
"set -o | grep errexit",
"echo 'Step0 START...'",
"rm -rf /tmp/HWSimple.txt"
]
}
},
{
"action": "aws:runShellScript",
"name": "Step1",
"onFailure": "step:BuhBye",
"inputs": {
"runCommand": [
"set -e",
"date >> /tmp/HWSimple.txt",
"echo 'Step1...' >> /tmp/HWSimple.txt",
"echo '--------' >> /tmp/HWSimple.txt"
]
}
},
{
"action": "aws:runShellScript",
"name": "Step2",
"onFailure": "step:BuhBye",
"inputs": {
"runCommand": [
"set -e",
"date >> /tmp/HWSimple.txt",
"echo 'Step2 before bad statement...' >> /tmp/HWSimple.txt",
"echo '--------' >> /tmp/HWSimple.txt",
"mv /BOGUS/OldFile /BOGUS/NewFile",
"if [ $? -ne 0 ]; then date >> /tmp/HWSimple.txt; echo 'Step2 failed' >> /tmp/HWSimple.txt; exit 1; fi",
"date >> /tmp/HWSimple.txt",
"echo 'Step2 After bad statement...' >> /tmp/HWSimple.txt",
"echo '--------' >> /tmp/HWSimple.txt"
]
}
},
{
"action": "aws:runShellScript",
"name": "Step3",
"onFailure": "step:BuhBye",
"isEnd": true,
"inputs": {
"runCommand": [
"set -e",
"date >> /tmp/HWSimple.txt",
"echo 'Step3...' >> /tmp/HWSimple.txt",
"echo '--------' >> /tmp/HWSimple.txt"
]
}
},
{
"action": "aws:runShellScript",
"name": "BuhBye",
"inputs": {
"runCommand": [
"date >> /tmp/HWSimple.txt",
"echo 'BuhBye Error Handler...' >> /tmp/HWSimple.txt",
"echo 'An error occurred. Exiting the SSM document.'",
"exit 127"
]
}
}
]
}
When I run this and go to the instance afterwards, I can look at the ongoing output file /tmp/HWSimple.txt, and this indicates that 1) In Step 2, execution stops after my conditional check for a problem and 2) execution just continues to Step 3 and, despite the "isEnd": true
statement goes on to execute the BuhBye step:
$ cat /tmp/HWSimple.txt
Sat Sep 21 07:23:23 PM UTC 2024
Step1...
--------
Sat Sep 21 07:23:25 PM UTC 2024
Step2 before bad statement...
--------
Sat Sep 21 07:23:28 PM UTC 2024
Step3...
--------
Sat Sep 21 07:23:30 PM UTC 2024
BuhBye Error Handler...
I’m really at a loss, and feel like I’m missing something fundamental. ChatGPT has been pretty helpful for a number of the many problems I’ve stumbled through, but this one seems elusive.