Table of Contents
Practical Monitoring - Effective Strategies for the Real World by Mike Julian
Return to Continuous Monitoring topics or Continuous Monitoring
Table of Contents
Monitoring Principles
- Anti-Pattern #1: Tool Obsession Monitoring Is Multiple Complex Problems Under One Name
Avoid Cargo-Culting Tools
Sometimes, You Really Do Have to Build It
The Single Pane of Glass Is a Myth
- Anti-Pattern #2: Monitoring-as-a-Job
- Anti-Pattern #3: Checkbox Monitoring - What Does “Working” Actually Mean? Monitor That.
OS Metrics Aren’t Very Useful — for Alerting
Collect Your Metrics More Often
Anti-Pattern #4: Using Monitoring as a Crutch
Anti-Pattern #5: Manual Configuration
Wrap-Up
Monitoring Design Patterns
- Pattern #1: Composable Monitoring The Components of a Monitoring Service
Pattern #2: Monitor from the User Perspective
Pattern #3: Buy, Not Build It’s Cheaper
You’re (Probably) Not an Expert at Architecting These Tools
SaaS Allows You to Focus on the Company’s Product
No, Really, SaaS Is Actually Better
Pattern #4: Continual Improvement
Wrap-Up
Alerts, On-Call, and Incident Management
3. Alerts, On-Call, and Incident Management
What Makes a Good Alert? Stop Using Email for Alerts
Write Runbooks
Arbitrary Static Thresholds Aren’t the Only Way
Delete and Tune Alerts
Use Maintenance Periods
Attempt Automated Self-Healing First
On-Call Fixing False Alarms
Cutting Down on Needless Firefighting
Building a Better On-Call Rotation
Incident Management
Postmortems
Wrap-Up
4. Statistics Primer Before Statistics in Systems Operations
Math to the Rescue!
Statistics Isn’t Magic
Mean and Average
Median
Seasonality
Quantiles
Standard Deviation
Wrap-Up
II. Monitoring Tactics
5. Monitoring the Business Business KPIs
Two Real-World Examples Yelp
Tying Business KPIs to Technical Metrics
My App Doesn’t Have Those Metrics!
Finding Your Company’s Business KPIs
Wrap-Up
6. Frontend Monitoring The Cost of a Slow App
Two Approaches to Frontend Monitoring
Document Object Model (DOM) Frontend Performance Metrics
OK, That’s Great, but How Do I Use This?
Logging
Synthetic Monitoring
Wrap-Up
7. Application Monitoring Instrumenting Your Apps with Metrics How It Works Under the Hood
Monitoring Build and Release Pipelines
Health Endpoint Pattern
Application Logging Wait a Minute…Should I Have a Metric or a Log Entry?
What Should I Be Logging?
Write to Disk or Write to Network?
Serverless / Function-as-a-Service
Monitoring Microservice Architectures
Wrap-Up
8. Server Monitoring Standard OS Metrics CPU
Memory
Network
Disk
Load
SSL Certificates
SNMP
Web Servers
Database Servers
Load Balancers
Message Queues
Caching
DNS
NTP
Miscellaneous Corporate Infrastructure DHCP
SMTP
Monitoring Scheduled Jobs
Logging Collection
Storage
Analysis
Wrap-Up
9. Network Monitoring The Pains of SNMP What Is SNMP?
How Does It Work?
A Word on Security
How Do I Use SNMP?
Interface Metrics
Interface and Logging
Recap
Configuration Tracking
Voice and Video
Routing
Spanning Tree Protocol (STP)
Chassis CPU and Memory
Hardware
Flow Monitoring
Capacity Planning Working Backward
Forecasting
Wrap-up
10. Security Monitoring Monitoring and Compliance
User, Command, and Filesystem Auditing Setting Up auditd
auditd and Remote Logs
Host Intrusion Detection System (HIDS)
rkhunter
Network Intrusion Detection System (NIDS)
Wrap-Up
11. Conducting a Monitoring Assessment Business KPIs
Frontend Monitoring
Application and Server Monitoring
Security Monitoring
Alerting
Wrap-Up
A. An Example Runbook: Demo App Demo App
Metadata
Escalation Procedure
External Dependencies
Internal Dependencies
Tech Stack
Metrics and Logs
Alerts
B. Availability Chart
Index
A
- alerts-alerting, Alerting, Alerts, On-Call, and Incident Management-Attempt Automated Self-Healing Firstarbitrary static thresholds and, Arbitrary Static Thresholds Aren’t the Only Way
assessment example, Alerting
- auto-healing, Attempt Automated Self-Healing First
defining, What Makes a Good Alert?
desensitization to, Delete and Tune Alerts
email for, Stop Using Email for Alerts
flapping detection, Before Statistics in Systems Operations
maintenance period use for, Use Maintenance Periods
with Nagios, Before Statistics in Systems Operations
Amazon, The Cost of a Slow App
- analytics and reporting, Analytics and Reporting-Analytics and Reporting
anti-patterns, Monitoring Anti-Patterns-Wrap-Upcheckbox monitoring, Anti-Pattern #3: Checkbox Monitoring-Collect Your Metrics More Often
manual configuration, Anti-Pattern #5: Manual Configuration
monitoring as a crutch, Anti-Pattern #4: Using Monitoring as a Crutch
monitoring-as-a-job, Anti-Pattern #2: Monitoring-as-a-Job
tool obsession, Anti-Pattern #1: Tool Obsession-The Single Pane of Glass Is a Myth
- APM tools, Monitoring Is Multiple Complex Problems Under One Name, Monitoring Is Multiple Complex Problems Under One Name, Instrumenting Your Apps with Metrics(see also StatsD)
- application monitoring, Application Monitoring-Wrap-Upassessment example, Application and Server Monitoring-Application and Server Monitoring
build and release pipeline monitoring, Monitoring Build and Release Pipelines-Monitoring Build and Release Pipelines
health endpoint patterns, Health Endpoint Pattern-Health Endpoint Pattern
instrumenting with metrics, Instrumenting Your Apps with Metrics-How It Works Under the Hood
logging, Application Logging-Write to Disk or Write to Network?
metrics versus log entries, Wait a Minute…Should I Have a Metric or a Log Entry?
microservice architectures, Monitoring Microservice Architectures-Monitoring Microservice Architectures
serverless platforms, Serverless / Function-as-a-Service
application performance monitoring (APM) tools (see APM tools)
arbitrary static thresholds, Arbitrary Static Thresholds Aren’t the Only Way
arithmetic mean, Mean and Average-Mean and Average
audisp-remote, auditd and Remote Logs
auditd, User, Command, and Filesystem Auditing-auditd and Remote Logs
auto-healing, Attempt Automated Self-Healing First
automation importance, Anti-Pattern #5: Manual Configuration
availability chart, Availability Chart
availability reporting, Analytics and Reporting-Analytics and Reporting
average, Mean and Average-Mean and Average
B
bad habits (see anti-patterns)
bandwidth, Interface Metrics, Interface Metrics
BGP routing, Routing
blackbox monitoring, Two Approaches to Frontend Monitoring
buffers, Memory
build and release pipeline monitoring, Monitoring Build and Release Pipelines-Monitoring Build and Release Pipelines
burn rate, Business KPIs
business KPIs (see KPIs (key performance indicators))
C
caches/caching, Memory, Caching
canary endpoint monitoring, Health Endpoint Pattern
capacity planning, Capacity Planning
churn rate, Business KPIs
cloud infrastructures, Monitoring Is Multiple Complex Problems Under One Name
cloud versus traditional architectures, Anti-Pattern #5: Manual Configuration
communication liaison, Incident Management
compliance, Monitoring and Compliance
composable monitoring, Pattern #1: Composable Monitoring-Alertingalerting, Alerting
analytics and reporting, Analytics and Reporting-Analytics and Reporting
data collection, Data collection-Logs
data storage, Data storage-Data storage
visualization, Visualization-Visualization
configuration tracking, Configuration Tracking
console statement, Logging
consumption rate, Message Queues
continual improvement, Pattern #4: Continual Improvement
cost considerations, It’s Cheaper, No, Really, SaaS Is Actually Better
cost of goods sold (COGS), Business KPIs
cost per customer, Business KPIs
counters, Metrics
CPU usage, CPU, CPU and Memory
customer acquisition cost (CAC), Business KPIs
customer churn, Business KPIs
customer lifetime value (LTV), Business KPIs
D
daily active users (DAU), Business KPIs
dashboards, Visualization
data collection, Data collection-Logs
data storage, Data storage-Data storage
data visualization, Visualization-Visualization
database server performance, Database Servers-Database Servers
design patterns, Monitoring Design Patterns-Wrap-Upbuying tools versus building, Pattern #3: Buy, Not Build-No, Really, SaaS Is Actually Better
composable monitoring, Pattern #1: Composable Monitoring-Alerting
continual improvement, Pattern #4: Continual Improvement
monitoring from user perspective, Pattern #2: Monitor from the User Perspective, Monitoring the Business
DHCP, DHCP
disk performance, Disk-Disk
distributed tracing, Monitoring Microservice Architectures-Monitoring Microservice Architectures
DNS servers, DNS
DOM (Document Object Model), Document Object Model (DOM)
E
email alerts, Stop Using Email for Alerts
errors, Interface Metrics, Interface Metrics
Etsy, Monitoring Build and Release Pipelines
evicted items, Caching
F
false alarms, Fixing False Alarms
firefighting mode, Cutting Down on Needless Firefighting
flapping detection, Before Statistics in Systems Operations
flow monitoring, Flow Monitoring-Flow Monitoring
follow-the-sun (FTS) rotations, Building a Better On-Call Rotation
forecasting, Forecasting
frontend monitoring, Frontend Monitoring-Wrap-Upassessment example, Frontend Monitoring
defining, Frontend Monitoring
logging, Logging
Navigation Timing API, Navigation Timing API-Navigation Timing API
performance importance, The Cost of a Slow App-The Cost of a Slow App
Real User Monitoring (RUM), Two Approaches to Frontend Monitoring
speed index, Speed Index
synthetic monitoring, Two Approaches to Frontend Monitoring, Synthetic Monitoring
function-as-a-service, Serverless / Function-as-a-Service
G
gauges, Metrics
Google Analytics, Two Approaches to Frontend Monitoring, OK, That’s Great, but How Do I Use This?
gross profit margin, Business KPIs
H
habits, bad (see anti-patterns)
health endpoint pattern monitoring, Health Endpoint Pattern-Health Endpoint Pattern
hit/miss ratio, Caching
host intrusion detection system (HIDS), Host Intrusion Detection System (HIDS)-rkhunter
I
incident commander (IC), Incident Management
incident management, Incident Management-Incident Management
IOPS (I/O per Second), Disk, Database Servers
iostat, Disk
IPFIX, Flow Monitoring
J
J-Flow, Flow Monitoring
JavaScript, Document Object Model (DOM)
jitter, Interface Metrics
K
keepalives, Web Servers
KPIs (key performance indicators), Business KPIs-Wrap-Updetermining, Finding Your Company’s Business KPIs-Finding Your Company’s Business KPIs, Business KPIs-Business KPIs
Reddit example, Reddit-Tying Business KPIs to Technical Metrics
tying to technical metrics, Tying Business KPIs to Technical Metrics-My App Doesn’t Have Those Metrics!
Yelp example, Yelp-Yelp
L
latency, Interface Metrics
line graphs, Visualization
load, Load
load balancers, Load Balancers
log analysis, Analysis
log collection, Logs-Logs, Logging
log entries, Application Logging-Write to Disk or Write to Network?
log levels, What Should I Be Logging?
log storage, Data storage, Storage, auditd and Remote Logs
logging, Logging
LTV (lifetime value), Business KPIs
M
maintenance periods, Use Maintenance Periods
manual configuration, Anti-Pattern #5: Manual Configuration
mean, Mean and Average-Mean and Average
median, Median
memory usage, CPU and Memory
memory used, Memory-Memory
message queues, Message Queues
metricsbandwidth, Interface Metrics, Interface Metrics
CPU usage, CPU
disk performance, Disk-Disk
errors, Interface Metrics, Interface Metrics
jitter, Interface Metrics
latency, Interface Metrics
load, Load
memory used, Memory-Memory
network performance, Network
SNMP (Simple Network Management Protocol), Interface Metrics-Interface Metrics
standard OS, Standard OS Metrics-Load
throughput, Interface Metrics, Interface Metrics
versus log entries, Wait a Minute…Should I Have a Metric or a Log Entry?
metrics collection, Metrics
metrics collection frequency, Collect Your Metrics More Often
metrics storage, Data storage
MIBs (management information base files), How Does It Work?
microservice architectures, Monitoring Microservice Architectures-Monitoring Microservice Architectures
monitoringreasons for ineffectiveness of, Anti-Pattern #3: Checkbox Monitoring-Collect Your Metrics More Often
monitoring assessment example, Conducting a Monitoring Assessment-Wrap-Up
monitoring service components, The Components of a Monitoring Service-Alerting(see also monitoring components)
monthly active users (MAU), Business KPIs
monthly recurring revenue, Business KPIs
N
Nagios, Pattern #1: Composable Monitoring, Arbitrary Static Thresholds Aren’t the Only Wayalerting with, Before Statistics in Systems Operations
statistics in, Math to the Rescue!
Navigation Timing API, Navigation Timing API-Navigation Timing API
NetFlow, Flow Monitoring
network intrusion detection system (NIDS), Network Intrusion Detection System (NIDS)-Network Intrusion Detection System (NIDS)
network monitoring, Network Monitoring-Wrap-upcapacity planning, Capacity Planning
configuration tracking, Configuration Tracking
CPU and memory usage, CPU and Memory
device chassis, Chassis
flow monitoring, Flow Monitoring-Flow Monitoring
hardware, Hardware
routing protocols, Routing
SNMP (see SNMP (Simple Network Management Protocol))
spanning tree protocol (STP), Spanning Tree Protocol (STP)
voice and video performance, Voice and Video
network performance, Network
network taps, Network Intrusion Detection System (NIDS)
normal distributions, Standard Deviation
NPS (net promoter score), Business KPIs
number of paying customers, Business KPIs
O
Observability Teams, Anti-Pattern #2: Monitoring-as-a-Job
Observer Effect, The, Monitoring Is Multiple Complex Problems Under One Name
OIDs (object identifiers), How Does It Work?
on-call, On-Call-Building a Better On-Call Rotationcompensation, Building a Better On-Call Rotation
rotations for, Building a Better On-Call Rotation-Building a Better On-Call Rotation
tools for, Building a Better On-Call Rotation
OOMKiller, Memory
OS metrics alerts, OS Metrics Aren’t Very Useful — for Alerting
OSPF routing, Routing
overreliance on monitoring, Anti-Pattern #4: Using Monitoring as a Crutch
P
page load times, The Cost of a Slow App-The Cost of a Slow App
percentiles, Quantiles
persistent connections, Web Servers
pie charts, Visualization
Pinterest, The Cost of a Slow App
postmortems, Postmortems
protocol changes, Spanning Tree Protocol (STP)
pull model of data collection, Data collection
push model of data collection, Data collection
Q
QoS (quality of service) monitoring, Voice and Video
qps (queries per second), Database Servers, DNS
quantiles, Quantiles
queue length, Message Queues
R
real user monitoring (RUM), Two Approaches to Frontend Monitoring
Reddit, Reddit-Tying Business KPIs to Technical Metrics
reporting and analytics, Analytics and Reporting-Analytics and Reporting
req/sec (requests per second), Web Servers
return codes, Health Endpoint Pattern
revenue per customer, Business KPIs
rkhunter, rkhunter-rkhunter
root bridge changes, Spanning Tree Protocol (STP)
rootkits, Host Intrusion Detection System (HIDS)-rkhunter
routing protocols, Routing
rsyslog, auditd and Remote Logs
run rate, Business KPIs
runbooksabuse of, Anti-Pattern #5: Manual Configuration
example, An Example Runbook: Demo App-Alerts
linking alerts to, Write Runbooks
S
SaaS services, Pattern #3: Buy, Not Build-No, Really, SaaS Is Actually Better
scheduled jobs, Monitoring Scheduled Jobs-Monitoring Scheduled Jobs
scribe, Incident Management
seasonality, Seasonality
security information and event management (SIEM) system, Network Intrusion Detection System (NIDS)
security monitoring, Security Monitoring-Wrap-Upassessment example, Security Monitoring
auditing users, commands, and filesystems, User, Command, and Filesystem Auditing-auditd and Remote Logs
compliance, Monitoring and Compliance
host intrusion detection system (HIDS), Host Intrusion Detection System (HIDS)-rkhunter
network intrusion detection system (NIDS), Network Intrusion Detection System (NIDS)-Network Intrusion Detection System (NIDS)
server monitoring, Server Monitoring-Wrap-Upassessment example, Application and Server Monitoring-Application and Server Monitoring
caching, Caching
database servers, Database Servers-Database Servers
DHCP, DHCP
DNS, DNS
load balancer metrics, Load Balancers
log analysis, Analysis
log collection, Logging
log storage, Storage
message queues, Message Queues
NTP servers, NTP
scheduled jobs, Monitoring Scheduled Jobs-Monitoring Scheduled Jobs
SMTP, SMTP
SNMP, SNMP(see also SNMP (Simple Network Management Protocol)
SSL certificates, SSL Certificates
standard OS metrics, Standard OS Metrics-Load
web server performance, Web Servers-Web Servers
serverless platforms, Serverless / Function-as-a-Service
severity levels, What Should I Be Logging?
sFlow, Flow Monitoring
Shopzilla, The Cost of a Slow App
SLA (Service Level Availability)
SLA (service-level availability), Analytics and Reporting-Analytics and Reporting
slaves, Database Servers, DNS
smoothing, Mean and Average
SNMP (Simple Network Management Protocol), SNMP, The Pains of SNMP-Recapbackground, What Is SNMP?
codec in use, Voice and Video
command line use, How Do I Use SNMP?-That’s great, Mike. But where’s the list of OIDs I should monitor?
interface and logging, Interface and Logging
interface metrics, Interface Metrics-Interface Metrics
securing, A Word on Security
traps, How Does It Work?
versions, How Does It Work?
spanning tree protocol (STP), Spanning Tree Protocol (STP)
SPAs (single-page apps), Frontend Monitoring
speed index, Speed Index
SSL certificates, SSL Certificates
standard deviation, Standard Deviation-Standard Deviation
statistics, Statistics Primer-Wrap-Upmean and average, Mean and Average-Mean and Average
median, Median
quantiles/percentiles, Quantiles
seasonality, Seasonality
standard deviation, Standard Deviation-Standard Deviation
StatsD, Instrumenting Your Apps with Metrics-How It Works Under the Hood, Serverless / Function-as-a-Service
status endpoint monitoring, Health Endpoint Pattern
strip charts, Visualization
structured logs, Logs-Logs, Application Logging
subject matter experts (SMEs), Incident Management
synthetic monitoring, Two Approaches to Frontend Monitoring, Synthetic Monitoring
syslog forwarding, Collection
syslogd/syslog-ng, auditd and Remote Logs
systems resiliency and stability, Cutting Down on Needless Firefighting
T
TCP versus UDP, Collection
throughput, Interface Metrics, Interface Metrics
tools, Anti-Pattern #1: Tool Obsession-The Single Pane of Glass Is a Mythbuilding, Sometimes, You Really Do Have to Build It
buying versus building, Pattern #3: Buy, Not Build-No, Really, SaaS Is Actually Better
cargo-culting tools, Avoid Cargo-Culting Tools
choosing, Monitoring Is Multiple Complex Problems Under One Name-Avoid Cargo-Culting Tools
cost considerations, It’s Cheaper, No, Really, SaaS Is Actually Better
mapping to dashboards, The Single Pane of Glass Is a Myth
observation tools, Monitoring Is Multiple Complex Problems Under One Name
standardization of, Monitoring Is Multiple Complex Problems Under One Name
tool creep, Monitoring Is Multiple Complex Problems Under One Name
tool fragmentation, Monitoring Is Multiple Complex Problems Under One Name
total addressable market (TAM), Business KPIs
traditional versus cloud architectures, Anti-Pattern #5: Manual Configuration
TSDB (time series database), Data storage
U
UDP versus TCP, Collection
unstructured logs, Logs-Logs, Application Logging
user perspective in monitoring, Pattern #2: Monitor from the User Perspective, Monitoring the Business
V
visualization of data, Visualization-Visualization
voice and video performance, Voice and Video
W
web server performance, Web Servers-Web Servers
WebpageTest.org, Two Approaches to Frontend Monitoring, Speed Index, Synthetic Monitoring
weekly active users (WAU), Business KPIs
whitebox monitoring, Two Approaches to Frontend Monitoring
Y
Yelp, Yelp-Yelp
Z
zone transfers, DNS
Fair Use Source: Practical Monitoring - Effective Strategies for the Real World by Mike Julian