Troubleshooting Failed Cron Jobs: A Debugging Guide
Your monitoring alerts you that a cron job failed. Now what? The job ran fine for months, you haven't changed anything, but suddenly it's not working. Finding the root cause requires systematic investigation.
This guide covers the most common cron job failures and how to diagnose them.
Step 1: Verify the Job Actually Ran
Before debugging the job itself, confirm whether it ran at all. A job that didn't execute has different causes than a job that ran and failed.
Check the system cron logs:
1# Debian/Ubuntu2grep CRON /var/log/syslog | tail -503
4# CentOS/RHEL5grep CRON /var/log/cron | tail -506
7# macOS8log show --predicate 'process == "cron"' --last 1hLook for your job's command. If it's not in the logs, the job never started. Causes include:
- Crontab syntax error - A typo prevents all jobs from running
- Cron service not running - Check with
systemctl status cron - User doesn't have cron access - Check
/etc/cron.allowand/etc/cron.deny
If the job did run, move to the next step.
Step 2: Check the Exit Code
Every command exits with a code. Zero means success, non-zero means failure. If your job logged its exit code (it should), check what it was:
1# Common exit codes2# 0 - Success3# 1 - General error4# 2 - Misuse of shell command5# 126 - Command not executable6# 127 - Command not found7# 128+n - Killed by signal n (e.g., 137 = killed by SIGKILL)Exit code 137 (128 + 9) often means the job was killed by the OOM killer or exceeded a memory limit. Exit code 143 (128 + 15) means it received SIGTERM - often from a timeout.
Step 3: Run the Command Manually
The most effective debugging step: run the exact cron command as the cron user.
1# Switch to the user that runs the cron job2sudo -u cronuser bash3
4# Set the same minimal environment cron uses5env -i HOME=$HOME SHELL=/bin/bash PATH=/usr/bin:/bin /path/to/your/script.shThis surfaces problems that only appear in cron's minimal environment:
- Missing PATH entries
- Missing environment variables
- Permission issues for the cron user
Watch for error messages that wouldn't appear when running as yourself.
Step 4: Check Recent System Changes
Jobs that worked for months don't break randomly. Something changed. Common culprits:
Package Updates
1# Debian/Ubuntu - check recent package changes2grep " install\| upgrade" /var/log/dpkg.log | tail -203
4# CentOS/RHEL5yum history listA Python upgrade can break scripts. A database client update can change behavior. An SSL certificate update can break API calls.
Configuration Management
If you use Ansible, Puppet, Chef, or similar tools, check recent runs:
1# Example: Ansible log2tail -100 /var/log/ansible.log3
4# Check if crontab was modified5ls -la /var/spool/cron/crontabs/Config management can overwrite environment files, change permissions, or modify crontabs.
Disk Space
1df -hFull disks cause strange failures. Logs can't write, temp files can't create, databases can't operate.
Memory Pressure
1dmesg | grep -i "out of memory" | tail -10The OOM killer terminates processes when memory runs out. Cron jobs are often victims because they run briefly and aren't "protected."
Step 5: Examine Dependencies
Cron jobs rarely work in isolation. They connect to databases, call APIs, read files, and write output. Each dependency can fail.
Database Connectivity
1# Test database connection2mysql -u user -p -h hostname -e "SELECT 1"3# or4psql -h hostname -U user -d database -c "SELECT 1"Database issues: connection limits reached, password changed, firewall rules modified.
Network and API Access
1# Test network connectivity2curl -I https://api.example.comAPI issues: SSL certificate expired, endpoint changed, rate limits hit, authentication expired.
File System Access
1# Check file permissions2ls -la /path/to/required/files3ls -la /path/to/output/directory4
5# Check if directories exist6test -d /expected/directory && echo "exists" || echo "missing"File issues: permission changes, directory removed, mount point not mounted.
Step 6: Check Resource Limits
Cron jobs run with system-imposed limits that can cause silent failures.
Open File Limits
1# Check current limits2ulimit -a3
4# Check system-wide limits for the cron user5grep cronuser /etc/security/limits.confJobs that open many files or network connections can hit limits.
Memory Limits
If using systemd, check for memory limits:
1systemctl show cron | grep MemoryContainer environments (Docker, Kubernetes) impose their own limits.
Step 7: Review Recent Code Changes
If the job's script changed recently, review the diff:
1git log --oneline -10 /path/to/script.sh2git diff HEAD~1 /path/to/script.shCommon code problems:
- New dependency not installed in production
- Hard-coded path that doesn't exist in production
- Environment variable expected but not set
Common Failure Patterns
The "Works On My Machine" Failure
Symptom: Script runs fine manually but fails in cron
Cause: Different environment. Cron has minimal PATH, no shell profile, different working directory.
Fix: Use absolute paths for everything. Explicitly set required environment variables. Don't rely on shell aliases or functions.
The Midnight Failure
Symptom: Job fails specifically at midnight or day boundaries
Cause: Date arithmetic issues. Logs rolling over. Backup processes competing for resources.
Fix: Stagger jobs away from the hour. Check date handling in scripts. Review what else runs at the same time.
The "First of the Month" Failure
Symptom: Monthly job fails but weekly/daily jobs work fine
Cause: Often data-related. Monthly aggregations process more data. Reports cover different date ranges. More concurrent users during business hours.
Fix: Check for timeouts. Increase resource allocations for monthly jobs. Verify the job can handle full month's data.
The Gradual Degradation
Symptom: Job gets slower over time, eventually times out
Cause: Growing data. Missing database indexes. Accumulating temp files. Log files getting huge.
Fix: Add monitoring for job duration, not just success/failure. Set up alerts for jobs that take longer than expected.
Building a Debugging Checklist
When a job fails, work through this list:
- Did the job run? (Check cron logs)
- What was the exit code? (Check job logs)
- Can you reproduce manually? (Run as cron user)
- What changed recently? (System, config, code)
- Are dependencies working? (Database, APIs, files)
- Are resources available? (Disk, memory, limits)
Document what you find. The next failure might have the same cause, and future-you will thank present-you for the notes.
Preventing Future Failures
Once you've fixed the immediate issue, prevent recurrence:
- Add more logging to capture details that would have helped
- Add monitoring for the failure mode you discovered
- Document the fix so others can find it
- Consider automation to detect the condition before it causes failure
Most cron job failures are preventable with proper monitoring and logging. Set up monitoring with Cronzy to catch failures before they become incidents.