Monitoring Microsoft SQL Server

The 'Managed Microsoft Sql Server' platform products (DBMS, DBA, AlwaysOn) are monitored by SCOM System Center Operations Manager

The Microsoft System Center Management Packs relevant for all managed sql products are 'Management Pack for SQL Server' and 'Management Pack for Windows Server Operating System'

The monitoring suite is used internally to fulfil the service level agreements and provide extra insight on customer support cases.

Managed MS SQL DBMS

The Managed MS SQL products are 'Managed MS SQL DBMS' and 'Managed MS SQL Always On'. This section describes DBMS-SLA-Alerts.

SQL data disk alert

Monitor Name	Specific	Warning Threshold	Critical Threshold
Disk Space Available (%)	D: E: T: X: (**)	7%	6%
Disk Space Available (Gigabyte)	D: E: T: X: (**)	200GB	140GB

Note: (**) both disk symptoms (percent and gigabyte) need to be TRUE to trigger an alert

Note: (**) for examples navigate to end-of-page

Windows OS disk alert

Monitor Name	Specific	Warning Threshold	Critical Threshold
Disk Space Available (%)	C: Disk (**)	10%	2%
Disk Space Available (Gigabyte)	C: Disk (**)	4GB	3GB

SQL Services

The following SQL services are alerted if stopped and not in any maintenance state:

Service	Display name	State
MSSQLSERVER	SQL Server (MSSQLSERVER)	Not Running
SQLSERVERAGENT	SQL Server Agent (MSSQLSERVER)	Not Running

Managed MS SQL DBMS DBA Mode

The Managed MS SQL DBMS DBA Mode gets alerted on Operating System level only

Please also refer to Monitoring SQL DBMS DBA Mode and the corresponding Service Matrix look for "DBMS monitoring"

Windows Operating System - all managed sql products

Windows OS Services

The following Windows services are enforced by desired state configuration:

Service	Display name	State
RPC Service	All RPC dependant services	Not Running
VirusAgnt	Virus Protection Agent	Not Running
wuauserv	Windows Update	Disabled
EventLog	Windows Event Log	Not Running
MpsSvc	Windows Firewall	Not Running
Netlogon	Netlogon	Not Running
Schedule	Task Scheduler	Not Running
VMTools	VMTools	Not Running
W32Time	Windows Time	Not Running
WinRM	Windows Remote Management	Not Running

Alarming

If one of the defined threshold values is exceeded, an alert is sent to the Swisscom support organisation.

Disk Capacity plan

Swisscom relies on the disk capacity plan of its customers.

In an alert event we plan capacity and reconfigure based on the data available, we do not have any information of the space needs in future.

Monitoring edge cases

Not enough time to proactively reconfigure the virtual machine due to sudden growth (disk full in seconds), automation dependencies timewise (storage vmotion time) or virtual infrastructure dependencies (active VM snapshots). These cases may cause outages of applications consuming the user databases or sql tempdb.
Year over year growth towards the maximum virtual machine size and not enough time left for planning and executing a migration.
Edge case scenario: first monitoring metric under warning threshold => large transaction fills up the disk in less than 5 minutes => transaction rolls back => second monitoring metric again under warning threshold => no alert is created

All these cases are rare but are avoided mainly by creating and maintaining a capacity plan.

Examples - Disk Space Available

Small disk example

Monitor Name	Total	Current	Warning	Critical	Details
Disk Space Available (%)	100%	4%	> 7%	> 6%	percent symptom changes to TRUE
Disk Space Available (Gigabyte)	100GB	4GB	< 200GB	< 140GB	GB symptom is always TRUE

Small disk example - the most common scenario => the percent threshold is the last symptom which changed from FALSE to TRUE

It alerts as a critical severity

Very large disk example

Monitor Name	Total	Current	Warning	Critical	Details
Disk Space Available (%)	100%	3.5%	> 7%	> 6%	percent symptom TRUE and Critical
Disk Space Available (Gigabyte)	5TB	180GB	< 200GB	< 140GB	GB symptom changes to Warning

Very large disk example - the gigabyte threshold is the last symptom which changed from FALSE to TRUE

It alerts as a warning severity alert

Extra large disk example

Monitor Name	Total	Current	Warning	Critical	Details
Disk Space Available (%)	100%	0.5%	> 7%	> 6%	percent symptom TRUE and Critical
Disk Space Available (Gigabyte)	9TB	50GB	< 200GB	< 140GB	GB symptom changes to Critical

Extra large disk example - the gigabyte threshold is the last symptom which changed from FALSE to TRUE

It alerts as a critical severity alert

Monitoring with Telegraf

Using sencond day action "Set MS SQL Monitoring" the user is able to enable Telegraf monitoring agent with custom defined parameters. This can be used to create alerting or visualization dashboards.

[global_tags]

 apmid = "SQLTEST"
 
[agent]

 interval = "30s"
 round_interval = true
 metric_batch_size = 1000
 metric_buffer_limit = 10000
 collection_jitter = "0s"
 flush_interval = "10s"
 flush_jitter = "0s"
 precision = ""
 hostname = ""
 omit_hostname = false
 debug = true
 quiet = false
  logtarget = "eventlog"
 
###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################
 
[[outputs.influxdb]]
 

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################
 

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false
 

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
 
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
 

# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]
 

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration
 

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration
 

# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration
 

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration
 

# Read metrics about system load & uptime
[[inputs.system]]
 
# Read metrics from Microsoft SQL Server
[[inputs.sqlserver]]
  ## Specify instances to monitor with a list of connection strings.
  ## All connection parameters are optional.
  ## By default, the host is localhost, listening on default port, TCP 1433.
  ##   for Windows, the user is the currently running AD user (SSO).
  ##   See https://github.com/denisenkom/go-mssqldb for detailed connection
  ##   parameters, in particular, tls connections can be created like so:
  ##   "encrypt=true;certificate=<cert>;hostNameInCertificate=<SqlServer host fqdn>"
  servers = [
    "Server=localhost;Port=1433;Database=TempDB;app name=telegraf;log=1;",
  ]
 
  ## Timeout for query execution operation
  ## Note that the timeout for queries is per query not per gather.
  ## 0 value means no timeout
  # query_timeout = "0s"
 
  ## Authentication method
  ## valid methods: "connection_string", "AAD"
  # auth_method = "connection_string"
 
  ## "database_type" enables a specific set of queries depending on the database type. If specified, it replaces azuredb = true/false and query_version = 2
  ## In the config file, the sql server plugin section should be repeated each with a set of servers for a specific database_type.
  ## Possible values for database_type are - "SQLServer" or "AzureSQLDB" or "AzureSQLManagedInstance" or "AzureSQLPool"
  database_type = "SQLServer"
 
  ## A list of queries to include. If not specified, all the below listed queries are used.
  include_query = []
 
  ## A list of queries to explicitly ignore.
  exclude_query = ["SQLServerAvailabilityReplicaStates", "SQLServerDatabaseReplicaStates"]

# Monitoring Microsoft SQL Server

# Managed MS SQL DBMS

# SQL data disk alert

# Windows OS disk alert

# SQL Services

# Managed MS SQL DBMS DBA Mode

# Windows Operating System - all managed sql products

# Windows OS Services

# Alarming

# Disk Capacity plan

# Monitoring edge cases

# Examples - Disk Space Available

# Small disk example

# Very large disk example

# Extra large disk example

# Monitoring with Telegraf