Monitoring Microsoft SQL Server
The 'Managed Microsoft Sql Server' platform products (DBMS, DBA, AlwaysOn) are monitored by SCOM System Center Operations Manager
The Microsoft System Center Management Packs relevant for all managed sql products are 'Management Pack for SQL Server' and 'Management Pack for Windows Server Operating System'
The monitoring suite is used internally to fulfil the service level agreements and provide extra insight on customer support cases.
Managed MS SQL DBMS
The Managed MS SQL products are 'Managed MS SQL DBMS' and 'Managed MS SQL Always On'. This section describes DBMS-SLA-Alerts.
SQL data disk alert
Monitor Name | Specific | Warning Threshold | Critical Threshold |
---|---|---|---|
Disk Space Available (%) | D: E: T: X: (**) | 7% | 6% |
Disk Space Available (Gigabyte) | D: E: T: X: (**) | 200GB | 140GB |
Note: (**) both disk symptoms (percent and gigabyte) need to be TRUE to trigger an alert
Note: (**) for examples navigate to end-of-page
Windows OS disk alert
Monitor Name | Specific | Warning Threshold | Critical Threshold |
---|---|---|---|
Disk Space Available (%) | C: Disk (**) | 10% | 2% |
Disk Space Available (Gigabyte) | C: Disk (**) | 4GB | 3GB |
SQL Services
The following SQL services are alerted if stopped and not in any maintenance state:
Service | Display name | State |
---|---|---|
MSSQLSERVER | SQL Server (MSSQLSERVER) | Not Running |
SQLSERVERAGENT | SQL Server Agent (MSSQLSERVER) | Not Running |
Managed MS SQL DBMS DBA Mode
The Managed MS SQL DBMS DBA Mode gets alerted on Operating System level only
Please also refer to Monitoring SQL DBMS DBA Mode and the corresponding Service Matrix look for "DBMS monitoring"
Windows Operating System - all managed sql products
Windows OS Services
The following Windows services are enforced by desired state configuration:
Service | Display name | State |
---|---|---|
RPC Service | All RPC dependant services | Not Running |
VirusAgnt | Virus Protection Agent | Not Running |
wuauserv | Windows Update | Disabled |
EventLog | Windows Event Log | Not Running |
MpsSvc | Windows Firewall | Not Running |
Netlogon | Netlogon | Not Running |
Schedule | Task Scheduler | Not Running |
VMTools | VMTools | Not Running |
W32Time | Windows Time | Not Running |
WinRM | Windows Remote Management | Not Running |
Alarming
If one of the defined threshold values is exceeded, an alert is sent to the Swisscom support organisation.
Disk Capacity plan
Swisscom relies on the disk capacity plan of its customers.
In an alert event we plan capacity and reconfigure based on the data available, we do not have any information of the space needs in future.
Monitoring edge cases
Not enough time to proactively reconfigure the virtual machine due to sudden growth (disk full in seconds), automation dependencies timewise (storage vmotion time) or virtual infrastructure dependencies (active VM snapshots). These cases may cause outages of applications consuming the user databases or sql tempdb.
Year over year growth towards the maximum virtual machine size and not enough time left for planning and executing a migration.
Edge case scenario: first monitoring metric under warning threshold => large transaction fills up the disk in less than 5 minutes => transaction rolls back => second monitoring metric again under warning threshold => no alert is created
All these cases are rare but are avoided mainly by creating and maintaining a capacity plan.
Examples - Disk Space Available
Small disk example
Monitor Name | Total | Current | Warning | Critical | Details |
---|---|---|---|---|---|
Disk Space Available (%) | 100% | 4% | > 7% | > 6% | percent symptom changes to TRUE |
Disk Space Available (Gigabyte) | 100GB | 4GB | < 200GB | < 140GB | GB symptom is always TRUE |
Small disk example - the most common scenario => the percent threshold is the last symptom which changed from FALSE to TRUE
It alerts as a critical severity
Very large disk example
Monitor Name | Total | Current | Warning | Critical | Details |
---|---|---|---|---|---|
Disk Space Available (%) | 100% | 3.5% | > 7% | > 6% | percent symptom TRUE and Critical |
Disk Space Available (Gigabyte) | 5TB | 180GB | < 200GB | < 140GB | GB symptom changes to Warning |
Very large disk example - the gigabyte threshold is the last symptom which changed from FALSE to TRUE
It alerts as a warning severity alert
Extra large disk example
Monitor Name | Total | Current | Warning | Critical | Details |
---|---|---|---|---|---|
Disk Space Available (%) | 100% | 0.5% | > 7% | > 6% | percent symptom TRUE and Critical |
Disk Space Available (Gigabyte) | 9TB | 50GB | < 200GB | < 140GB | GB symptom changes to Critical |
Extra large disk example - the gigabyte threshold is the last symptom which changed from FALSE to TRUE
It alerts as a critical severity alert
Monitoring with Telegraf
Using sencond day action "Set MS SQL Monitoring" the user is able to enable Telegraf monitoring agent with custom defined parameters. This can be used to create alerting or visualization dashboards.
[global_tags]
apmid = "SQLTEST"
[agent]
interval = "30s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
debug = true
quiet = false
logtarget = "eventlog"
###############################################################################
# OUTPUT PLUGINS #
###############################################################################
[[outputs.influxdb]]
###############################################################################
# INPUT PLUGINS #
###############################################################################
# Read metrics about cpu usage
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics
collect_cpu_time = false
## If true, compute and report the sum of all non-idle CPU states
report_active = false
# Read metrics about disk usage by mount point
[[inputs.disk]]
## By default stats will be gathered for all mount points.
## Set mount_points will restrict the stats to only the specified mount points.
# mount_points = ["/"]
## Ignore mount points by filesystem type.
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
# Read metrics about disk IO by device
[[inputs.diskio]]
## By default, telegraf will gather stats for all devices including
## disk partitions.
## Setting devices will restrict the stats to the specified devices.
# devices = ["sda", "sdb", "vd*"]
## Uncomment the following line if you need disk serial numbers.
# skip_serial_number = false
#
## On systems which support it, device metadata can be added in the form of
## tags.
## Currently only Linux is supported via udev properties. You can view
## available properties for a device by running:
## 'udevadm info -q property -n /dev/sda'
## Note: Most, but not all, udev properties can be accessed this way. Properties
## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
# device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
#
## Using the same metadata source as device_tags, you can also customize the
## name of the device via templates.
## The 'name_templates' parameter is a list of templates to try and apply to
## the device. The template may contain variables in the form of '$PROPERTY' or
## '${PROPERTY}'. The first template which does not contain any variables not
## present for the device is used as the device name tag.
## The typical use case is for LVM volumes, to get the VG/LV name instead of
## the near-meaningless DM-0 name.
# name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]
# Get kernel statistics from /proc/stat
[[inputs.kernel]]
# no configuration
# Read metrics about memory usage
[[inputs.mem]]
# no configuration
# Get the number of processes and group them by status
[[inputs.processes]]
# no configuration
# Read metrics about swap memory usage
[[inputs.swap]]
# no configuration
# Read metrics about system load & uptime
[[inputs.system]]
# Read metrics from Microsoft SQL Server
[[inputs.sqlserver]]
## Specify instances to monitor with a list of connection strings.
## All connection parameters are optional.
## By default, the host is localhost, listening on default port, TCP 1433.
## for Windows, the user is the currently running AD user (SSO).
## See https://github.com/denisenkom/go-mssqldb for detailed connection
## parameters, in particular, tls connections can be created like so:
## "encrypt=true;certificate=<cert>;hostNameInCertificate=<SqlServer host fqdn>"
servers = [
"Server=localhost;Port=1433;Database=TempDB;app name=telegraf;log=1;",
]
## Timeout for query execution operation
## Note that the timeout for queries is per query not per gather.
## 0 value means no timeout
# query_timeout = "0s"
## Authentication method
## valid methods: "connection_string", "AAD"
# auth_method = "connection_string"
## "database_type" enables a specific set of queries depending on the database type. If specified, it replaces azuredb = true/false and query_version = 2
## In the config file, the sql server plugin section should be repeated each with a set of servers for a specific database_type.
## Possible values for database_type are - "SQLServer" or "AzureSQLDB" or "AzureSQLManagedInstance" or "AzureSQLPool"
database_type = "SQLServer"
## A list of queries to include. If not specified, all the below listed queries are used.
include_query = []
## A list of queries to explicitly ignore.
exclude_query = ["SQLServerAvailabilityReplicaStates", "SQLServerDatabaseReplicaStates"]