Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Monitoring and Health Checks

Loading…

Monitoring and Health Checks

Relevant source files

Purpose and Scope

This document explains monitoring and health check strategies for the Docker MQTT Mosquitto with Cloudflare Tunnel system. It covers container status verification, automated health checks in the CI pipeline, logging approaches, and techniques for implementing custom monitoring solutions. For production deployment considerations, see Production Deployment Considerations. For troubleshooting specific issues, see Troubleshooting.


Container Status Monitoring

Docker Inspect Command Pattern

The system uses Docker’s native inspection capabilities to verify container health. The primary method queries container state using docker inspect with formatted output.

Container State Check Pattern:

This command returns the current container state, which can be one of: created, restarting, running, removing, paused, exited, or dead.

Container Status Fields

Docker provides multiple state fields that can be inspected for comprehensive health monitoring:

FieldPathDescription
Status.State.StatusCurrent container state
Running.State.RunningBoolean indicating if container is running
ExitCode.State.ExitCodeExit code if container has stopped
StartedAt.State.StartedAtTimestamp when container started
Error.State.ErrorError message if container failed

Sources: .github/workflows/ci.yml30


CI Pipeline Health Verification

Automated Health Check Implementation

The GitHub Actions CI pipeline implements a bounded retry pattern to verify the mosquitto container reaches a healthy running state after startup. This pattern accounts for non-deterministic container initialization times.

flowchart TD
    Start["Start Docker Compose\ndocker-compose up -d mosquitto"]
Init["Initialize retry counter\ni=1, max=10"]
Inspect["Execute docker inspect\n--format='{{.State.Status}}'"]
Check{"Status == 'running'?"}
Success["Log: Mosquitto is running\nExit 0"]
Wait["Log: Waiting for Mosquitto...\nSleep 10 seconds"]
Increment["Increment counter\ni++"]
CheckMax{"i > 10?"}
Timeout["Log: Did not become healthy\nExit 1"]
Start --> Init
 
   Init --> Inspect
 
   Inspect --> Check
 
   Check -->|Yes| Success
 
   Check -->|No| Wait
 
   Wait --> Increment
 
   Increment --> CheckMax
 
   CheckMax -->|No| Inspect
 
   CheckMax -->|Yes| Timeout

Health Check Flow Diagram

Sources: .github/workflows/ci.yml:27-39

Retry Loop Parameters

The health check uses the following parameters:

ParameterValueRationale
Max Attempts10Provides 100 seconds total wait time
Sleep Interval10 secondsBalances responsiveness and system load
Total Timeout100 secondsSufficient for cold starts and image pulls
Check Methoddocker inspectNative Docker API, no external dependencies
Success CriteriaStatus contains runningContainer process is active

The implementation uses a shell loop structure:

Sources: .github/workflows/ci.yml:28-39


Docker Native Healthcheck Configuration

Current Implementation

The system currently relies on Docker Compose’s restart: unless-stopped policy for automatic recovery but does not define explicit healthcheck directives in the compose configuration.

Current Restart Policy:

Adding Native Docker Healthchecks

Docker Compose supports native healthcheck definitions that provide more sophisticated monitoring than simple state checks. Below are example configurations for both services.

Mosquitto Healthcheck Example

This healthcheck:

  • Subscribes to the system topic $SYS/broker/uptime
  • Waits up to 3 seconds for a message
  • Receives 1 message (-C 1) to confirm broker is operational
  • Runs every 30 seconds
  • Allows 10 seconds for the test to complete
  • Requires 3 consecutive failures before marking unhealthy

Cloudflared Healthcheck Example

This healthcheck:

  • Executes cloudflared tunnel info to verify tunnel connectivity
  • Runs every 60 seconds (tunnel state changes slowly)
  • Allows 5 retries (tunnel reconnections may take time)
  • Provides 30 seconds startup grace period

Sources: docker-compose.yml:4-17


System Monitoring Architecture

Container Monitoring Topology

Sources: docker-compose.yml:1-18 .github/workflows/ci.yml:27-39


Cloudflared Tunnel Monitoring

Tunnel Status Verification

The cloudflared container maintains an outbound connection to Cloudflare’s network. Monitoring tunnel health requires checking both container state and tunnel connectivity.

Log-Based Monitoring

The cloudflared process outputs connection status to stdout:

Tunnel Status Commands

Common Cloudflared Log Patterns

Log PatternMeaningAction Required
Registered tunnel connectionTunnel successfully connectedNormal operation
unable to register connectionAuthentication or network failureVerify token and connectivity
Retrying inTemporary connection lossMonitor for recovery
SIGTERM receivedGraceful shutdown initiatedExpected during restarts
certificate errorTLS/SSL verification failedCheck system time and CA certificates

Sources: docker-compose.yml:11-17


Mosquitto Broker Monitoring

MQTT System Topics

The Mosquitto broker publishes operational metrics to reserved $SYS topics. These provide real-time broker statistics without external dependencies.

Key System Topics

TopicDescriptionType
$SYS/broker/versionBroker version stringStatic
$SYS/broker/uptimeSeconds since broker startCounter
$SYS/broker/clients/connectedCurrent connected client countGauge
$SYS/broker/clients/totalTotal clients since startCounter
$SYS/broker/messages/receivedTotal messages receivedCounter
$SYS/broker/messages/sentTotal messages sentCounter
$SYS/broker/bytes/receivedTotal bytes receivedCounter
$SYS/broker/bytes/sentTotal bytes sentCounter

Monitoring via MQTT Client

Log-Based Monitoring

Sources: docker-compose.yml:4-9


Health Check Decision Tree

Sources: .github/workflows/ci.yml:27-42 docker-compose.yml:1-18


Custom Health Check Strategies

Script-Based Monitoring

Implement a shell script for comprehensive health verification:

External Monitoring Integration

Prometheus Exporter Pattern

For production deployments, integrate with monitoring systems using exporters:

  1. Docker Container Exporter : Exposes container metrics
  2. MQTT Exporter : Subscribes to $SYS topics and exposes as Prometheus metrics
  3. Cloudflare Tunnel Exporter : Monitors tunnel status via Cloudflare API

Example Prometheus Configuration

Sources: .github/workflows/ci.yml:27-39


Logging and Log Analysis

Container Log Access

Both containers output logs to stdout/stderr, which Docker captures:

Log Patterns and Indicators

Mosquitto Startup Success Indicators

Opening ipv4 listen socket on port 1883.
Opening ipv4 listen socket on port 9001.
mosquitto version X.X.X running

Cloudflared Connection Success Indicators

Registered tunnel connection
Connection established

Log Aggregation

For production deployments, consider implementing centralized logging:

Sources: docker-compose.yml:4-17


Restart Policies and Recovery

Current Configuration

Both services use the unless-stopped restart policy, which provides:

  • Automatic restart on failure
  • Respect for manual stops (docker-compose stop)
  • Persistence across Docker daemon restarts

Configuration locations:

Monitoring Restart Behavior

Crash Loop Detection

Frequent restarts indicate underlying issues:

If restart count increases rapidly (>3 restarts in 1 minute), investigate:

  1. Check container logs for error messages
  2. Verify configuration file syntax
  3. Confirm environment variables are set
  4. Check resource constraints (memory/CPU limits)

Sources: docker-compose.yml9 docker-compose.yml15


CI Pipeline Integration

The GitHub Actions CI workflow serves as a reference implementation for automated health verification. The workflow demonstrates:

  1. Isolated Environment : Starts only the mosquitto service without dependencies
  2. Bounded Wait : Implements timeout to prevent hanging builds
  3. State Verification : Uses Docker’s native state inspection
  4. Clean Teardown : Ensures resources are released after testing

CI Health Check Sequence

Sources: .github/workflows/ci.yml:24-42


Best Practices Summary

PracticeImplementationBenefit
Bounded retriesMax 10 attempts with 10s intervalPrevents infinite waits
Docker native checksUse docker inspect for stateNo external dependencies
Log monitoringRegularly check container logsEarly problem detection
Restart trackingMonitor RestartCount metricIdentify crash loops
System topicsSubscribe to $SYS/# for MQTT statsBroker-native monitoring
Health scriptsAutomate multi-component checksConsistent verification
External integrationExport metrics to monitoring systemsProduction observability

Sources: .github/workflows/ci.yml:27-39 docker-compose.yml:1-18

Dismiss

Refresh this wiki

Enter email to refresh

On this page

  • Monitoring and Health Checks
  • Purpose and Scope
  • Container Status Monitoring
  • Docker Inspect Command Pattern
  • Container Status Fields
  • CI Pipeline Health Verification
  • Automated Health Check Implementation
  • Health Check Flow Diagram
  • Retry Loop Parameters
  • Docker Native Healthcheck Configuration
  • Current Implementation
  • Adding Native Docker Healthchecks
  • Mosquitto Healthcheck Example
  • Cloudflared Healthcheck Example
  • System Monitoring Architecture
  • Container Monitoring Topology
  • Cloudflared Tunnel Monitoring
  • Tunnel Status Verification
  • Log-Based Monitoring
  • Tunnel Status Commands
  • Common Cloudflared Log Patterns
  • Mosquitto Broker Monitoring
  • MQTT System Topics
  • Key System Topics
  • Monitoring via MQTT Client
  • Log-Based Monitoring
  • Health Check Decision Tree
  • Custom Health Check Strategies
  • Script-Based Monitoring
  • External Monitoring Integration
  • Prometheus Exporter Pattern
  • Example Prometheus Configuration
  • Logging and Log Analysis
  • Container Log Access
  • Log Patterns and Indicators
  • Mosquitto Startup Success Indicators
  • Cloudflared Connection Success Indicators
  • Log Aggregation
  • Restart Policies and Recovery
  • Current Configuration
  • Monitoring Restart Behavior
  • Crash Loop Detection
  • CI Pipeline Integration
  • CI Health Check Sequence
  • Best Practices Summary