# 2021-11-18 Low effort website monitoring

While the most lazy website monitoring implementation would be probably to check
it in `cron` with `curl` every few minutes and send message on outage, I'm going
to present a slightly more sophisticated solution to make it a little more
usable. Note that idea below probably isn't be the best for long term, but
often any solution is better than no solution at all.

## Requirements

By operator I mean person responsible for service reliability. Ideally that
would be someone involved in website development or hosting maintainer.

> **Protip**: Notifications are the most effective when receiver knows how to
  react on them.

- Operator must be notified on outage.
- Monitoring must not rely on website hosting availability. I'm talking about
  single node deployments here. Operator must be informed not only when check
  failed, but also when check wasn't performed.
- Monitoring must provide reason for notification. Messages like *It's not
  working* aren't very insightful.
- Implementation must be simple, lightweight and cheap. Preferably allowing
  self-hosting and open-source.

## Incremental development

Let's assume that we're dealing with website hosted on a cheap GNU/Linux VPS
instance that consists of application exposed via HTTPS. It makes
sense to check liveness at minimum. For now we don't care, if our application
performs well. Availability is our only concern.

### Basic HTTP service check with notification

To keep things simple and increase reliability we will keep service checking and
reporting/notifying logic separately.

- *Service checking* - Since shell is usually available out-of-the-box we will
  build service checks using widely available command line tools.
- *Reporting/notifying* - This is a bit harder. Keep in mind that we need
  notifications also when service checking hasn't finished on time. Hosting
  anything on the same instance as application will not be sufficient. Using
  external service might introduce vendor lock-in, but we want it cheap.
  Fortunately there's some middle ground like
  [Healthchecks](https://healthchecks.io) - an open source monitoring platform,
  also hosted as a service with free subscription tier.

Using *Healthchecks* means that we will need to notify external service about
service check results. `curl` will do the job here. Let's start with some
minimal script:

```
#!/usr/bin/env bash

CURL_OPTIONS="--max-time 5 --retry 3 --fail --silent --show-error"
SERVICE_URL=https://app.local/
HEALTCHECK_URL=https://hc-ping.com/82583301-bd04-4028-9695-fc5473abc15d

curl $CURL_OPTIONS "$SERVICE_URL"
curl $CURL_OPTIONS "${HEALTCHECK_URL}/${?}"
```

Above script will send it's status code in ping to *Healthchecks*, where we
should configure notifications(Telegram integration works well) and **Schedule** -
it's crucial to receive alert when our service is not sending any pings, it
may mean that whole VPS is down. Yet we still don't have context what exactly
went wrong. Let's use [Bash strict
mode](http://redsymbol.net/articles/unofficial-bash-strict-mode/) and change
script to send `curl` error text in ping:

```
#!/usr/bin/env bash

set -euo pipefail

CURL_OPTIONS="--max-time 5 --retry 3 --fail --silent --show-error"
SERVICE_URL=https://app.local/
HEALTCHECK_URL=https://hc-ping.com/82583301-bd04-4028-9695-fc5473abc15d
RESULT=""

on_exit() {
	curl $CURL_OPTIONS --data-raw "$RESULT" "${HEALTCHECK_URL}/${?}" >/dev/null
}

main () {
	trap on_exit EXIT
	RESULT=$(curl $CURL_OPTIONS "$SERVICE_URL" 2>&1 1>/dev/null)
}

main "$@"
```

Now operator will be notified about possible outages and alert will include some
context what exactly went wrong like:

```
curl: (7) Failed to connect to app.local port 443: Connection refused
```

### Service is available, but does it perform well?

While performance can be measured in many ways, we will focus on one simple
metric: *response time*. `curl` can measure it out-of-the-box, but we will need
to invoke it differently and format output message:

```
# on_exit version supporting custom EXIT_CODE
on_exit() {
	local exit_code=$?
	if [ -n "$EXIT_CODE" ]; then
		exit_code=$EXIT_CODE
	fi
	curl $CURL_OPTIONS --data-raw "$RESULT" "${HEALTCHECK_URL}/${exit_code}" >/dev/null
	exit "$exit_code"
}

# Alert, if curl failed or response takes longer than 2 seconds
CHECK_HTTP_TIMEOUT=2
main () {
	trap on_exit EXIT
	RESULT="Response time: $(curl $CURL_OPTIONS "$SERVICE_URL" -o /dev/null --write-out '%{time_total}' 2>&1)"
	local time_total
	time_total=$(echo "$RESULT" | rev | cut -d ' ' -f 1 | rev)
	echo "$time_total"
	if [ "$(echo "${time_total} > ${CHECK_HTTP_TIMEOUT}" | bc)" -eq 1 ]; then
		RESULT="${RESULT}"$'\n'"Too high response time!"
		EXIT_CODE=1
	fi
}
```

### What about other metrics?

*Healthchecks* limits us to 10 kilobytes payload for each ping, but remember
that we want to keep it simple, so it's not much of a problem. Rearranging our
script into self-descriptive functions will be useful to include more checks:

```
# Trigger alert when used memory exceeds 95%
CHECK_MEMORY_THRESHOLD=95
check_memory() {
	RESULT="${RESULT}"$'\n'"Memory usage: $(exec 2>&1; free | grep Mem | awk '{print $3/$2 * 100.0}')"
	local used
	used=$(echo "$RESULT" | tail -n 1 | rev | cut -d ' ' -f 1 | rev)
	if [ "$(echo "${used} > ${CHECK_MEMORY_THRESHOLD}" | bc)" -eq 1 ]; then
		RESULT="${RESULT}"$'\n'"Too high memory usage!"
		EXIT_CODE=1
	fi
}

main () {
	trap on_exit EXIT
	check_http
	check_memory
}
```

Observing SSL expiration date can be enclosed in single Bash function as well:

```
CHECK_SSL_DAYS=90
check_ssl() {
	local host
	host=$(echo "$SERVICE_URL" | cut -d '/' -f 3)
	RESULT="${RESULT}"$'\n'"SSL certificate expiration date: $(
		exec 2>&1; echo |
			openssl s_client -servername "$host" -connect "${host}:443" 2>/dev/null |
			openssl x509 -noout -enddate |
			cut -d '=' -f 2
	)"
	local end_seconds
	end_seconds=$(("$CHECK_SSL_DAYS" * 24 * 60 * 60))
	if ! echo |
			openssl s_client -servername "$host" -connect "${host}:443" 2>/dev/null |
			openssl x509 -noout -checkend "$end_seconds" >/dev/null; then
		RESULT="${RESULT}"$'\n'"Certificate will expire soon!"
		EXIT_CODE=1
	fi
}
```

## Conclusion

It took us ~100 lines of Bash to check:

- Service availability via HTTP
- Service response time
- Service SSL certificate expiration date
- Instance memory usage

We can run this script every minute from cron:

```
* * * * * /opt/low-effort-monitoring.sh
```

Of course it's not the most beautiful and performance oriented way of monitoring
website, but it serves the purpose. No special software nor planning is required
here, so it may be a great ad-hoc solution while you plan long term monitoring
system like Icinga2, Prometheus or Zabbix.