Building a Rock‑Solid RAID Monitoring & Alerting System – Protect Against Drive Failure

By admin
September 8, 2025

At MegaHost, we believe infrastructure should be transparent, resilient, and efficient. Recently, we completed a full‑stack RAID monitoring setup that not only keeps our data safe but also delivers clear, actionable reports straight to our inboxes — every single day.

This guide walks you through what we did, why we did it, and how you can apply the same principles to your own hosting environment.

1. Building the Foundation: RAID with `mdadm`

We started by configuring RAID‑1 arrays using mdadm — a tried‑and‑true Linux tool for software RAID management.

Our setup:

Three arrays:
- /dev/md0 – System partition (4 GiB)
- /dev/md1 – Boot/EFI partition (~1 GiB)
- /dev/md2 – Data partition (3.49 TiB) with an internal bitmap for faster resyncs
Mirrored NVMe drives for redundancy — if one fails, the other keeps the system running without downtime.
Persistent superblocks so arrays auto‑assemble on reboot.

Why it matters: RAID‑1 ensures data availability even during hardware failure. For hosting providers, that’s non‑negotiable.

2. Multi‑Recipient Email Forwarding for Alerts

A monitoring system is only as good as its ability to reach the right people.

We configured Postfix to forward health reports to multiple email addresses — ensuring that:

The primary sysadmin gets the report.
A backup contact receives it in case the primary is unavailable.
Optionally, a client or stakeholder can be looped in for transparency.

Pro tip: Use /etc/aliases or Postfix’s virtual_alias_maps to manage multiple recipients without editing the script.

3. Automated Disk Health Reports via `smartctl`

We integrated smartctl to pull NVMe SMART data:

Drive temperature
Available spare percentage
Wear level
Media/data integrity errors
Lifetime read/write totals

4. From GB to TB: Readability Upgrade

Originally, SMART output showed lifetime reads/writes in gigabytes. We updated the script to calculate and display terabytes instead:

TB_READ=$(awk -v u="$UNITS_READ" 'BEGIN {printf "%.2f", u * 512000 / (1024^4)}')
TB_WRITTEN=$(awk -v u="$UNITS_WRITTEN" 'BEGIN {printf "%.2f", u * 512000 / (1024^4)}')

Why it matters: For enterprise‑grade NVMe drives, GB numbers quickly become unwieldy. TB values are easier to read, compare, and trend over time.

5. Cron‑Driven Daily Reports

We set up a cron job to run the report script every morning:

0 4 * * * /usr/local/sbin/raid-daily-report.sh

This ensures:

Daily visibility into RAID status and disk health.
Immediate alerts if a drive starts failing or a RAID array degrades.
Historical tracking for wear and usage trends.

6. Email‑Friendly Formatting

The script outputs a human‑readable report:

=== RAID & Disk Health Report ===
Generated: Mon Sep 8 08:01:41 AM CEST 2025

--- RAID Status ---
[...]
--- SMART / NVMe Health ---
Device: /dev/nvme0n1
Temperature: 35 Celsius
Total Bytes Read: 438.78 TB
Total Bytes Written: 84.67 TB

This makes it instantly scannable — no need to SSH in unless something’s wrong.

7. Why This Matters for Hosting Clients

By combining RAID redundancy, SMART monitoring, and automated reporting, MegaHost delivers:

Proactive maintenance — issues are caught before they cause downtime.
Transparency — clients can see the health of the infrastructure hosting their data.
Peace of mind — knowing there’s a daily safety check running in the background.

8. Key Takeaways for Other Sysadmins

If you want to replicate this:

Set up RAID with mdadm and persistent superblocks.
Enable SMART monitoring for your drives.
Write a reporting script that combines RAID and SMART data.
Format for humans — clear headings, easy units (TB over GB).
Automate with cron and send to multiple recipients.

Installing the Right Version of smartmontools

Most Linux distributions ship an older version of smartctl in their default repositories. For our monitoring setup, we needed smartmontools 7.4 because:

It has full NVMe support for the attributes we parse (Data Units Read/Written, Available Spare, Percentage Used, etc.).
Output format is consistent with our parsing logic.
Older versions either omit these fields or display them differently.

We compiled v7.4 from source in /usr/local/src/smartmontools-7.4 and installed it system‑wide. This ensures our daily RAID & Disk Health Report script always gets complete, accurate NVMe data.

Final Thoughts

This project reflects MegaHost’s philosophy: Infrastructure should be resilient, transparent, and client‑empowering. By investing in proactive monitoring and clear communication, we’re not just protecting data — we’re building trust.

💡 Want to learn more? We’ll be publishing more deep‑dive guides on:

CDN caching best practices
Ethical UX in hosting control panels
Failover strategies for high‑availability setups