Incident response – Complete LMS

This is a real incident from a client we helped last year, anonymised. The numbers and timings are accurate. We’ve changed identifying details and kept it deliberately practical — if you’re ever in this position, the order things happen in is the part that matters.

The call

2:11 AM UK time, Tuesday in mid-November. A WhatsApp message:

“Moodle is down. Showing ‘Error reading from database’. We have final exams starting Thursday at 09:00. Please help.”

The client was a vocational college in continental Europe, roughly 1,400 students, running Moodle™ 4.1 LTS on a single VPS — Apache, PHP-FPM, MariaDB 10.6 — at a generic hosting provider. They were not, at this point, our managed-hosting client. We did their off-site backups, that was all.

The on-call engineer (me) replied within four minutes — fast for that hour because the WhatsApp ringer had been on for a different reason — and we got to work.

What we knew at 2:15 AM

Moodle™ was showing the database-connection error page.
The site had been working at 22:00 Monday, when a teacher posted an announcement.
Two students had reported the error at 00:40 and 01:20.
The host’s status page was green. No incidents announced.
We had off-site backups from 03:00 UTC the previous day — about 23 hours old.
Final exams started in 30 hours and 50 minutes.

The exam window mattered because Moodle™ was the delivery mechanism: the exam was an in-person, on-paper assessment, but the question papers, the rubrics, the timing, and crucially the assignment-submission window for the take-home component all lived in Moodle™ courses. Losing Moodle™ for 30 hours would have been bad but recoverable. Losing it on Thursday morning would have been a serious incident for the college.

02:15–02:35 — Triage

I asked the client to do three things and not log into anything else:

Give me read-only SSH access to the VPS.
Forward me the last 200 lines of /var/log/mysql/error.log and /var/log/apache2/error.log.
Take a photo of the actual error page on the site so I had the literal wording.

Step 3 sounds trivial. It’s actually one of the most useful things you can ask for under pressure: the exact text of the visible error often tells you which subsystem failed first, and clients often paraphrase (“the database is broken”) in a way that loses information.

While they did that, I pulled up our backup status. The last verified restore for this client was 11 days earlier, the standard monthly check. It had passed. That meant the backup we had on the shelf was almost certainly restorable — useful to know before we needed it.

The Apache error log was full of:

PHP Fatal error: Uncaught dml_connection_exception:
Error connecting to database

The MariaDB error log was much more interesting:

[ERROR] [FATAL] InnoDB: Table mdl_question_attempt_step_data in
file ./moodle/mdl_question_attempt_step_data.ibd is encrypted but
encryption service or used key_id 1 is not available.

That was the smoking gun. The MariaDB encryption key file had become unreadable. The database was up, but some tables couldn’t be opened. Moodle™‘s connection check tries to query a system table that, on this site, lived in the same tablespace — so every page returned the connection error.

This is not a Moodle™ bug. It’s a MariaDB-level thing that can happen for several reasons: a botched OS upgrade, a /etc/mysql/encryption/ permissions change, a stray chmod -R from a backup script, or — as it turned out here — an apt unattended-upgrades run that had silently updated mariadb-server-core two hours earlier and changed the ownership of the key directory.

02:35–02:55 — Decide between repair and restore

We had two paths:

Path A: fix the encryption-key permission, restart MariaDB, and the site comes back. Estimated time: 10–30 minutes if it’s only the permissions.
Path B: restore the previous day’s backup onto a fresh VM, point DNS at it, and accept the loss of one day of forum posts, file submissions and grade changes. Estimated time: 90–120 minutes for restore, plus DNS propagation (low TTL was already in place because we’d insisted on it during onboarding).

The temptation in a crisis is to try Path A first because it’s faster if it works. But the cost of failing Path A and then falling back to Path B is the time you already spent on Path A. So we did both in parallel.

I asked the client to authorise spinning up a recovery VM with the same provider in the same region. Path B started running on a new box while I worked on Path A on the live one.

02:55–03:20 — Path A: the fix

The encryption key file lived at /etc/mysql/encryption/keyfile. Owner had been changed during the package update from mysql:mysql to root:root. MariaDB couldn’t read it.

$ ls -la /etc/mysql/encryption/
total 12
drwxr-xr-x 2 root root 4096 Nov 12 00:13 .
drwxr-xr-x 4 root root 4096 Nov 12 00:13 ..
-rw------- 1 root root  768 Aug  4 11:22 keyfile

Compare to a server where it works:

-rw------- 1 mysql mysql 768 Aug  4 11:22 keyfile

The fix was one line:

sudo chown mysql:mysql /etc/mysql/encryption/keyfile
sudo systemctl restart mariadb

I did not run that yet. The temptation in a crisis is to act fast. Acting fast on a broken database is how you turn a recoverable incident into an unrecoverable one. Instead:

I took a filesystem snapshot of /var/lib/mysql first — rsync -a to a sibling directory. About six minutes for 18 GB.
I made a copy of the encryption keyfile to a path the recovery VM could pull from. Critically, the keyfile is needed to read the encrypted data later. If you lose it, your encrypted tables are unreadable forever. Backups that don’t include the key are useless against this specific failure.
Only then did I run the chown.

I restarted MariaDB. It came up. I checked the error log for new lines — clean. I mysql --execute "SELECT COUNT(*) FROM mdl_question_attempt_step_data;" and got a number. The site loaded. Login worked. Course list rendered.

03:21 AM. We were back.

03:20–04:00 — Don’t trust the win

Coming back from an incident is the moment you most want to declare victory and go to bed. It’s also the moment you’re most likely to miss the second, smaller problem hiding under the first.

I asked the on-call engineer at our end — the one who’d been pulled in as a second pair of eyes at 02:35 — to do a different test in parallel: take the off-site backup from 03:00 UTC the previous day and start restoring it onto Path B’s recovery VM. Not to use it, but to have it ready, fully smoke-tested, in case anything else surfaced before Thursday.

This is the part most teams skip. They get the site back and stop. We treated the situation as still-fragile until the exam window had passed.

While that ran, I worked through a checklist on the live site:

Random sample of courses load: ✓
Quiz attempt opens and renders: ✓
Assignment submission upload works: ✓
Forum post reads: ✓
Email send (test message to /admin/test_outgoing_mail.php): ✓
Cron is running, latest task < 1 minute ago: ✓
Backup verification: re-run, passed.

The recovery VM finished its restore at 03:48. Smoke tests on the recovery VM also passed. We had two working sites: the live one, plus a 23-hour-old backup ready to swap in if anything else broke.

I left both up. Costs the client about $30 to keep the recovery VM running until Thursday — a small price for the option value.

What actually caused it

The post-mortem is more useful than the war story.

Root cause: unattended-upgrades had run a mariadb-server-core update at 00:13. As part of the package update, a postinst script had chown root:root /etc/mysql/encryption/keyfile — almost certainly a bug or an interaction with a non-standard config the client’s previous admin had set up.

Contributing factors:

Unattended security upgrades were enabled, with no notification to the team.
The encryption keyfile lived outside /var/lib/mysql, in a path the team had forgotten about.
Nobody on the client’s team had MariaDB encryption documented anywhere.
The Moodle™ error page didn’t expose the database error, so the team couldn’t even tell where to start looking.

The fixes we put in place that week:

unattended-upgrades reconfigured to allow security updates but never restart services — so any package that wants a restart gets queued for a Tuesday-morning manual run with a human watching.
A pre-update snapshot of /etc/mysql is taken automatically before any MariaDB package update.
The encryption keyfile is now backed up to two separate off-site locations, with documented restoration steps.
A custom Moodle™ error page that includes a unique incident ID and a “what to do next” link — so the client’s team knows to contact us, not refresh the page for two hours.

What the client paid

Three things.

A flat emergency fee for the night — quoted in the WhatsApp thread before I logged in. The client agreed in writing.

A monthly maintenance retainer afterwards, which they hadn’t been paying before but signed up to within a week.

A small once-off fee to write the post-mortem and put the preventative measures in place.

We don’t bill hourly. We never have. The flat-fee model means we have no incentive to drag an incident out, and the client knows their downside before authorising us to start.

If you remember one thing

Two things, actually.

One: in a crisis, work two paths in parallel. The fast path and the safe path. The fast path saves the day when it works; the safe path saves your career when it doesn’t.

Two: the moment you get a site back is the moment you most want to declare victory. Don’t. Smoke-test, leave the recovery option warm, and write the post-mortem the next morning while it’s fresh. The next outage is being prepared right now by something nobody’s looking at.

If you’d like a similar level of “we keep your backup warm and rehearsed” without it being your job, that’s our backup & DR service. Or if your Moodle™ is on fire right now, email [email protected] and we’ll be on it within 30 minutes.