Designing a Server Monitoring and Alerting Service (Cheat Sheet)
Kevin Moreland, Software Engineer
Go ahead, try out products like Nagios, Zabbix, and Zenoss and see if they provide the right feature set for you. Here's a list of requirements for consideration, should you decide to roll your own:
- Implement as a command line tool first
- Add an argument capability to run in one-shot audit mode or as a service
- Open a secure channel to server/VPS via an encrypted tunnel (agent)
- Alert levels: nominal, warning, critical
- Alert properties: name, type, query, expected nominal state, query interval
- Bad actor attempts and thresholds
- Use parallel threads for monitoring items
- Automatic restart of failed applications and services
- Start/stop maintenance mode on public-facing servers for planned upgrades
- Detect DNS cache poisioning
- Test DNS response times
- Notify web site registration expiration as the date approaches
- Detect web site registration changes
- Detect bad actors modifying DNS:
- Verify only the expected nameserver answers are returned for domain
- Verify only expected IP address answers returned for NS records
- Verify only expected answers are present for A records
- Verify only expected answers are present for CAA records
- Use HSTS preload's API to check status of domains in Firefox and Chrome
- Perform load test and benchmarking of response times during off-hours
- Ping once a minute to ensure redirect server is working and returning expected 301 response
- Ping once a minute to ensure primary secure server is running
- Perform a full non-stealth port scan (parallelize to reduce time), verifying that only expected ports are open
- Verify web site cert is signed by expected issuer only
- Verify web site cert issuer's CA (if any)
- Verify web site cert is valid only for expected domains
- Verify all certs are valid (not expired)
- Monitor network status and utilization
- Monitor processor load
- Monitor disk usage
- Monitor I/O usage
- Monitor memory usage
- Monitor temperature
- Monitor user list
- Monitor SSH auth stats
- Monitor running services and their status
- When running the tool as a service, provide a local secure server for web-gui dashboard monitoring and admin functions
- Notifications section (administer, email, and SMS alerts)
- Prevent duplicate notifications from going out for each alert type (don't let your tool DOS yourself)
- Monitor system logs (view, delete, rotate frequency, archive)
- Flag abnormal entries in system logs
- Ability to issue commands from a web-gui admin interface
- Provide a visual history/archives of items being monitored and trends
- Perform system integrity scans using checksums for packages
- Provide an interface for viewing web server logs, visitor stats, search terms, bandwidth, etc.
Bonus points for:
- Running your tool on two systems for redundancy
Other articles on this web site:
- Why create a design doc? And why you shouldn’t skip it.
The benefits of having a design document before you start coding.
- Why you should design for the mobile browser first.
Make a great first impression by focusing on mobile visitors.
- Secure Server Implementation (Cheat Sheet)
Creating a secure Java server without using a framework.
- Automating the set up of a Linux-based VPS (Cheat Sheet)
Considerations when configuring a Debian-based Linux VPS.
- Certbot Automation for Java-based Servers (Cheat Sheet)
Ideas on how to automate Letsencrypt's certbot when you are running a Java-based server such as TomCat.
- Automate everything you possibly can, starting with your environment.
Automate everything you possibly can from the very beginning before writing the first line of project code.