Monitoring Microsoft SQL Server

The 'Managed Microsoft Sql Server' platform products (DBMS, DBA, AlwaysOn) are monitored by SCOM System Center Operations Manager

The Microsoft System Center Management Packs relevant for all managed sql products are 'Management Pack for SQL Server' and 'Management Pack for Windows Server Operating System'

The monitoring suite is used internally to fulfil the service level agreements and provide extra insight on customer support cases.


Managed MS SQL DBMS

The Managed MS SQL products are 'Managed MS SQL DBMS' and 'Managed MS SQL Always On'. This section describes DBMS-SLA-Alerts.

SQL data disk alert

Monitor NameSpecificWarning ThresholdCritical Threshold
Disk Space Available (%)D: E: T: X: (**)7%6%
Disk Space Available (Gigabyte)D: E: T: X: (**)200GB140GB

Note: (**) both disk symptoms (percent and gigabyte) need to be TRUE to trigger an alert

Note: (**) for examples navigate to end-of-page

Windows OS disk alert

Monitor NameSpecificWarning ThresholdCritical Threshold
Disk Space Available (%)C: Disk (**)10%2%
Disk Space Available (Gigabyte)C: Disk (**)4GB3GB

SQL Services

The following SQL services are alerted if stopped and not in any maintenance state:

ServiceDisplay nameState
MSSQLSERVERSQL Server (MSSQLSERVER)Not Running
SQLSERVERAGENTSQL Server Agent (MSSQLSERVER)Not Running

Managed MS SQL DBMS DBA Mode

The Managed MS SQL DBMS DBA Mode gets alerted on Operating System level only

Please also refer to Monitoring SQL DBMS DBA Mode and the corresponding Service Matrix look for "DBMS monitoring"


Windows Operating System - all managed sql products

Windows OS Services

The following Windows services are enforced by desired state configuration:

ServiceDisplay nameState
RPC ServiceAll RPC dependant servicesNot Running
VirusAgntVirus Protection AgentNot Running
wuauservWindows UpdateDisabled
EventLogWindows Event LogNot Running
MpsSvcWindows FirewallNot Running
NetlogonNetlogonNot Running
ScheduleTask SchedulerNot Running
VMToolsVMToolsNot Running
W32TimeWindows TimeNot Running
WinRMWindows Remote ManagementNot Running

Alarming

If one of the defined threshold values is exceeded, an alert is sent to the Swisscom support organisation.


Disk Capacity plan

Swisscom relies on the disk capacity plan of its customers.

In an alert event we plan capacity and reconfigure based on the data available, we do not have any information of the space needs in future.


Monitoring edge cases

  • Not enough time to proactively reconfigure the virtual machine due to sudden growth (disk full in seconds), automation dependencies timewise (storage vmotion time) or virtual infrastructure dependencies (active VM snapshots). These cases may cause outages of applications consuming the user databases or sql tempdb.

  • Year over year growth towards the maximum virtual machine size and not enough time left for planning and executing a migration.

  • Edge case scenario: first monitoring metric under warning threshold => large transaction fills up the disk in less than 5 minutes => transaction rolls back => second monitoring metric again under warning threshold => no alert is created

All these cases are rare but are avoided mainly by creating and maintaining a capacity plan.

Examples - Disk Space Available

Small disk example
Monitor NameTotalCurrentWarningCriticalDetails
Disk Space Available (%)100%4%> 7%> 6%percent symptom changes to TRUE
Disk Space Available (Gigabyte)100GB4GB< 200GB< 140GBGB symptom is always TRUE

Small disk example - the most common scenario => the percent threshold is the last symptom which changed from FALSE to TRUE

It alerts as a critical severity

Very large disk example
Monitor NameTotalCurrentWarningCriticalDetails
Disk Space Available (%)100%3.5%> 7%> 6%percent symptom TRUE and Critical
Disk Space Available (Gigabyte)5TB180GB< 200GB< 140GBGB symptom changes to Warning

Very large disk example - the gigabyte threshold is the last symptom which changed from FALSE to TRUE

It alerts as a warning severity alert

Extra large disk example
Monitor NameTotalCurrentWarningCriticalDetails
Disk Space Available (%)100%0.5%> 7%> 6%percent symptom TRUE and Critical
Disk Space Available (Gigabyte)9TB50GB< 200GB< 140GBGB symptom changes to Critical

Extra large disk example - the gigabyte threshold is the last symptom which changed from FALSE to TRUE

It alerts as a critical severity alert

Monitoring with Telegraf

Using sencond day action "Set MS SQL Monitoring" the user is able to enable Telegraf monitoring agent with custom defined parameters. This can be used to create alerting or visualization dashboards.

[global_tags]

 apmid = "SQLTEST"
 
[agent]

 interval = "30s"
 round_interval = true
 metric_batch_size = 1000
 metric_buffer_limit = 10000
 collection_jitter = "0s"
 flush_interval = "10s"
 flush_jitter = "0s"
 precision = ""
 hostname = ""
 omit_hostname = false
 debug = true
 quiet = false
  logtarget = "eventlog"
 
###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################
 
[[outputs.influxdb]]
 

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################
 

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false
 

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
 
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
 

# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]
 

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration
 

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration
 

# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration
 

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration
 

# Read metrics about system load & uptime
[[inputs.system]]
 
# Read metrics from Microsoft SQL Server
[[inputs.sqlserver]]
  ## Specify instances to monitor with a list of connection strings.
  ## All connection parameters are optional.
  ## By default, the host is localhost, listening on default port, TCP 1433.
  ##   for Windows, the user is the currently running AD user (SSO).
  ##   See https://github.com/denisenkom/go-mssqldb for detailed connection
  ##   parameters, in particular, tls connections can be created like so:
  ##   "encrypt=true;certificate=<cert>;hostNameInCertificate=<SqlServer host fqdn>"
  servers = [
    "Server=localhost;Port=1433;Database=TempDB;app name=telegraf;log=1;",
  ]
 
  ## Timeout for query execution operation
  ## Note that the timeout for queries is per query not per gather.
  ## 0 value means no timeout
  # query_timeout = "0s"
 
  ## Authentication method
  ## valid methods: "connection_string", "AAD"
  # auth_method = "connection_string"
 
  ## "database_type" enables a specific set of queries depending on the database type. If specified, it replaces azuredb = true/false and query_version = 2
  ## In the config file, the sql server plugin section should be repeated each with a set of servers for a specific database_type.
  ## Possible values for database_type are - "SQLServer" or "AzureSQLDB" or "AzureSQLManagedInstance" or "AzureSQLPool"
  database_type = "SQLServer"
 
  ## A list of queries to include. If not specified, all the below listed queries are used.
  include_query = []
 
  ## A list of queries to explicitly ignore.
  exclude_query = ["SQLServerAvailabilityReplicaStates", "SQLServerDatabaseReplicaStates"]

Last Updated: