Building a Rock‑Solid RAID Monitoring & Alerting System – Protect Against Drive Failure
At MegaHost, we believe infrastructure should be transparent, resilient, and efficient. Recently, we completed a full‑stack RAID monitoring setup that not only keeps our data safe but also delivers clear, actionable reports straight to our inboxes — every single day.
This guide walks you through what we did, why we did it, and how you can apply the same principles to your own hosting environment.
1. Building the Foundation: RAID with mdadm
We started by configuring RAID‑1 arrays using mdadm
— a tried‑and‑true Linux tool for software RAID management.
Our setup:
- Three arrays:
/dev/md0
– System partition (4 GiB)/dev/md1
– Boot/EFI partition (~1 GiB)/dev/md2
– Data partition (3.49 TiB) with an internal bitmap for faster resyncs
- Mirrored NVMe drives for redundancy — if one fails, the other keeps the system running without downtime.
- Persistent superblocks so arrays auto‑assemble on reboot.
Why it matters: RAID‑1 ensures data availability even during hardware failure. For hosting providers, that’s non‑negotiable.
2. Multi‑Recipient Email Forwarding for Alerts
A monitoring system is only as good as its ability to reach the right people.
We configured Postfix to forward health reports to multiple email addresses — ensuring that:
- The primary sysadmin gets the report.
- A backup contact receives it in case the primary is unavailable.
- Optionally, a client or stakeholder can be looped in for transparency.
Pro tip: Use /etc/aliases
or Postfix’s virtual_alias_maps
to manage multiple recipients without editing the script.
3. Automated Disk Health Reports via smartctl
We integrated smartctl
to pull NVMe SMART data:
- Drive temperature
- Available spare percentage
- Wear level
- Media/data integrity errors
- Lifetime read/write totals
4. From GB to TB: Readability Upgrade
Originally, SMART output showed lifetime reads/writes in gigabytes. We updated the script to calculate and display terabytes instead:
TB_READ=$(awk -v u="$UNITS_READ" 'BEGIN {printf "%.2f", u * 512000 / (1024^4)}')
TB_WRITTEN=$(awk -v u="$UNITS_WRITTEN" 'BEGIN {printf "%.2f", u * 512000 / (1024^4)}')
Why it matters: For enterprise‑grade NVMe drives, GB numbers quickly become unwieldy. TB values are easier to read, compare, and trend over time.
5. Cron‑Driven Daily Reports
We set up a cron job to run the report script every morning:
0 4 * * * /usr/local/sbin/raid-daily-report.sh
This ensures:
- Daily visibility into RAID status and disk health.
- Immediate alerts if a drive starts failing or a RAID array degrades.
- Historical tracking for wear and usage trends.
6. Email‑Friendly Formatting
The script outputs a human‑readable report:
=== RAID & Disk Health Report ===
Generated: Mon Sep 8 08:01:41 AM CEST 2025
--- RAID Status ---
[...]
--- SMART / NVMe Health ---
Device: /dev/nvme0n1
Temperature: 35 Celsius
Total Bytes Read: 438.78 TB
Total Bytes Written: 84.67 TB
This makes it instantly scannable — no need to SSH in unless something’s wrong.
7. Why This Matters for Hosting Clients
By combining RAID redundancy, SMART monitoring, and automated reporting, MegaHost delivers:
- Proactive maintenance — issues are caught before they cause downtime.
- Transparency — clients can see the health of the infrastructure hosting their data.
- Peace of mind — knowing there’s a daily safety check running in the background.
8. Key Takeaways for Other Sysadmins
If you want to replicate this:
- Set up RAID with
mdadm
and persistent superblocks. - Enable SMART monitoring for your drives.
- Write a reporting script that combines RAID and SMART data.
- Format for humans — clear headings, easy units (TB over GB).
- Automate with cron and send to multiple recipients.
Installing the Right Version of smartmontools
Most Linux distributions ship an older version of smartctl
in their default repositories. For our monitoring setup, we needed smartmontools 7.4 because:
- It has full NVMe support for the attributes we parse (Data Units Read/Written, Available Spare, Percentage Used, etc.).
- Output format is consistent with our parsing logic.
- Older versions either omit these fields or display them differently.
We compiled v7.4 from source in /usr/local/src/smartmontools-7.4
and installed it system‑wide. This ensures our daily RAID & Disk Health Report script always gets complete, accurate NVMe data.
Final Thoughts
This project reflects MegaHost’s philosophy: Infrastructure should be resilient, transparent, and client‑empowering. By investing in proactive monitoring and clear communication, we’re not just protecting data — we’re building trust.
💡 Want to learn more? We’ll be publishing more deep‑dive guides on:
- CDN caching best practices
- Ethical UX in hosting control panels
- Failover strategies for high‑availability setups