The SPS/Alert Manager is designed to solve this common problem. It monitors system elements, reads messages originating from various systems, modules and applications and funnels the information to a single terminal. Message filters assure that only important information is passed to the operator. Critical alerts cannot be ignored; messages will remain posted on the screen until the problem is attended to, acknowledged and given a proper entry by the operator.
The SPS/Alert Manager automatically executes user-defined recovery and corrective procedures and submits critical information to selected terminals, E-Mail users or other presentation platforms.
Access to certain Classes may restricted on a per-user basis. To restrict access to a class code, create a [class-code].ftr file in the alert_manager directory and set standard access rights (give/remove_access) to determine who has access to this Class.
give_access null Scheduler_tab.ftr -user John_Smith.*
A user can be limited to a single function. For example, Tim might need access only to the Scheduler interface. In this case, Tim will have null access to all FTR files and read access to the Scheduler_tab.ftr and go directly to the Scheduler window as soon as he logs into the system.
Features files:
Features can also be used to designate certain Classes of messages to individual users. The idea here is that some users may need to look only at some specific errors while others are interested or should have access to other error types. To create a Feature file for a Class of messages, create a [class-code].ftr file and assign standard access-right (acls) to it. While looking at the Alerts Monitor or when running reports, records with class codes to which the user has no access to will be masked off and will not appear on the Monitor or the report.
Introduction
While operating modern sophisticated systems, it is extremely difficult, if not impossible, to notice and resolve all system and application events as they occur. Applications may produce any number of logs, errors and output files. The operating system alone may generate thousands of messages every day. Often, serious production problems may be prevented if alerts are noticed and responded to in time.
Definitions
Objects
An object is any monitored element on the system. For example, CPU level, disk-space, a file, communication line etc. An object is identified with a unique logical-nameClasses (Alerts)
Classes identifies the severity level and define what actions the Server must take for specific conditions. AlertManager assigns each message a class code. A Class code can also be referred to as the state of a monitored Object or simply an Alert. For example CPU level can change from Normal (NOR) to Warning (WAR) and then back to Normal. The user may define any number of new Class codes and use them throughout the system. Thresholds
A set of numeric values that is used to indicate a certain condition. Most Objects, will have three levels of thresholds - normal, warning and critical. Time Windows
A period during which some known Thresholds apply. Time windows allow the use of different Threshold depending on the expected system load.Intervals
Intervals are short timeframes, specified in seconds or minutes, between two monitored events.Filters
Filters are predefined set of keywords and phrases. Filters are used to process operating system and application messages that may appear in any file on the system. Filters determine how messages are classified and treated. Acknowledgements
Certain important conditions may be configured to require operator's Acknowledgement. AlertManager audits this activity and record the name of the operator and time of the acknowledgement. Escalation
Messages that require operator's acknowledgement, may be escalated, if not acknowledged in time, to a higher severity level thus triggering different actions.Carry-Over
Ever day at midnight, AlertManager creates a new daily log file. Carry-Over is a process that AlertManager performs at this time, and is basically the copying over of all un-acknowledged messages to the new daily log flle.Special Keywords
AlertManager uses special keywords which are replaced during run-time when recording Alerts and when building command lines for execution (see =start_command). The following Special Keywords are used:
WEB Reports
Alert Manager can be configured to deliver reports, out files, configuration tables and directory listings directly to your web Browser. For more information on this WEB-Server functionality, see The WEB Table (sps_web.table)
Features
Features allow the SysAdmin to tailor the Alert-Manager interface by enabling and disabling different features and functions based on the user name. Features are enabled or disabled by setting up standard access rights (give-access) to the FTR (feature) files. For example: you can remove John's access to the Scheduler interface by removing the Scheduler Tab as follows:
Alerts_Monitor_tab.ftr
Batch_tab.ftr
cpu_performance_box.ftr
Disk_Usage_tab.ftr
drms_operations_menu_box.ftr
DRMS_tab.ftr
Explorer_tab.ftr
List_Users_tab.ftr
Menu_System_tab.ftr
Network_tab.ftr
operations_menu_box.ftr
performance_graphs.ftr
process_breakdown_box.ftr
Queue_Monitor_tab.ftr
recent_status_box.ftr
Scheduler_tab.ftr
System_Usage_tab.ftr
vos_daily_logs_box.ftr
Environments
Environments are complete or partial sets of configuration files. Using multiple Environments allow running multiple instances of AlertManager servers each working with its own separate configuration set (Environment).
| System Meters | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Monitored Object | Warning Level | Critical Level | Alert Message(s) | ||||||||||||||||||||||||||||||||||||||||||||||||
| CPU | 40% | 80% | CPU level at XX% Core level at XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Empty-Idle | 30% | 10% | Empty Idle level at XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Memory Used | 60% | 80% | Low on memory. Used: XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Paging Used | 60% | 80% | Low on paging area. Used: XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| I/O rate | Not Set | Not Set | I/O rate at XX per second. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Page Faults | Not Set | Not Set | Page Faults at XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Interrupts | Not Set | Not Set | Interrupts at XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| All SYSTEM meteres are within normal range | Performance meters are back to normal levels. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Other Meters | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Monitored Object | Warning Level | Critical Level | Alert Message(s) | ||||||||||||||||||||||||||||||||||||||||||||||||
| Disk Space Used | 80% | 90% | Disk [disk-name] is at XX% used. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Disk Read Busy | 40% | 60% | Disk [disk-name] is read-busy at XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Disk Write Busy | 40% | 60% | Disk [disk-name] is write-busy at XX%. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Disk I/O Rate | Not Set | Not Set | Disk [disk-name] is busy at XX I/O per second. | ||||||||||||||||||||||||||||||||||||||||||||||||
| All DISK meteres are within normal range | Free space on [disk-name] and I/O rate back to normal levels. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| VOS Queues | 100 msgs. | 200 msgs. | Pending message count on [queue-name]is XXX. Pending message count on [queue-name] is back to normal levels. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Process(es) | Conditions: running / not-running on schedule
Resources: CPU, I/O rate, memory, idle too long. | Process[es] [process-name] not running. Process[es] [process-name] is running; Not intended to run now. Process[es] [process-name] has been idle in the last XX minute(s). Process[es] [process-name] consuming XX% cpu. Process[es] [process-name] performing XXX I/Os per second. Process[es] [process-name] using XX% of total memory.' Process[es] [process-name] running now. (as planned!). Process[es] [process-name] are not running. (as planned!). | |||||||||||||||||||||||||||||||||||||||||||||||||
| File Watchdog | Conditions: file arrived / created, file missing | File arrived/created: [nick-name] File [nick-name] has not arrived/created. | |||||||||||||||||||||||||||||||||||||||||||||||||
| VOS System Log (syserr) | Conditions: any user-supplied filtering / conditions. | Original syserr_log.(date) message. | |||||||||||||||||||||||||||||||||||||||||||||||||
| VOS Security Log (syserr) | Conditions: any user-supplied filtering / conditions. | Original security_log.(date) message. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Application Logs | Conditions: any user-supplied filtering / conditions. | Original log message. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Security Filters | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Filter name | Class Code | Action taken / Notes
| Internal filter | LIO | Records all login and logout events; records user name and session length in the daily log and in a separate database.
| Internal filter | PRI | Records all privileged commands executed by non-privileged users.
| Security_01 | S01 | Detected changes to login_admin; AM automatically restores the default settings.
| Security_02 | S02 | Detected changes to logout_admin; AM automatically restores the default settings.
| Security_03 | S03 | Detected changes to audit_admin; AM automatically restores the default settings.
| Security_04 | S04 | Detected changes to password-security; AM automatically restores the default settings.
| Security_05 | S05 | Detected changes to system tuning parameters; AM automatically restores the default settings.
| Security_06 | S06 | Detected an unauthorized attempt to modify user's start_up.cm macro. AM automatically logs the user out of the system and optionally bans him from the system.
| Security_07 | S07 | Detected an unauthorized attempt to access one of the SPS audit trace files. AM automatically logs the user out of the system and optionally bans him from the system.
| Security_08 | S08 | Detected an unauthorized or authorized access to a monitored file , marked for object-audit.
| Security_09 | S09 | Detected a failed attempt to log into the system. This includes unauthorized attempts, bad-passwords and users whose account has been terminated trying to get in.
| Security_10 | S10 | Detected an access-right violation - user attempting to access a directory or file to which he has no sufficient access.
| | |||||||||||||
------------------------------ SPS/Alert Manager ----------------------------- vos_console: 0 -command: -name: |
As a SysAdmin you can configure the Console to disable selected tabs based on a user name. This is done by giving null access to one or more of the *.tab files. For example, the following command will remove Joe's access to the Alerts Monitor tab: give_access null Alerts_Monitor.tab -user joe_smith.dev
The Monitor has two mode or operations:
In Search Mode you can:
To switch back to Tail Mode either click the Tail link or simply wait for one minute and the system will switch it automatically to Tail Mode for you.
If there are any messages that require acknowledgement and have not yet been acknowledged, the Pending-Ack counter will turn red and will become a clickable link. Clinking it will present a screen with only Ack-Required messages. The operator can then Ack. individual messages or click the Ack-all-pending link to acknowledge all pending messages in the database.
Here is an example for a HTML-formatted report (click to enlarge):
The second area of the Monitor shows graphs and information on:
The third part of the display lists any performance related conditions that have been identified by Alert Manager. If all meters are within their Normal operating range this area will not be displayed.
The second part of the display lists any disk-related conditions that have been identified by Alert Manager. If all meters are within their Normal operating range this area will not be displayed.
The information may be sorted by CPU, Faults, Memory, I-O Rate, User-name and number of processes. To change the sort order, simply click on any of the table's headings.
In the example below we can see that the 3 DRMS servers are taking 26.5% of the module's CPU and 8.3% of its memory.
The second part of the display lists any performance related conditions that have been identified by Alert Manager. If all meters are within their Normal operating range this area will not be displayed.
In the example below we can see that the DRMS servers are not running.
The information may be sorted by Queue-name, ON-queue, Pending count, Highest, Total, TXN (transaction processing rate). To change the sort order, simply click on any of the table's headings.
In the example below we can see that DRMS-X queues are processing at a rate of 70 messages per second and that there are only one or two messages in the queue (pending).
The second part of the display lists any performance related conditions that have been identified by Alert Manager. If all meters are within their Normal operating range this area will not be displayed.
In the example below we can see that the DRMS_App queue went into his Critical state because there are 5973 messages pending processing on it.
In the example below we can see a directory listing all TIN files. You may change the sorting order of the table by clicking any of the table's headings.The Alert Manager Console
The Console is the first screen that is presented to the operator. It is used to open all other system monitors but the most significant area is its Recent Status box that lists any objects that are marked as having a problem - i.e, not in their expected Normal State.The Alerts Monitor
The Alerts Monitor is the heart of the system. All system and application events that were recorded into the database can researched here. The operator can use the monitor to detect potential problem, react to critical events and acknowledge certain important messages.
![]()
VOS Alerts
The Search Mode
The Search Mode can be recognized by a new line that offers some input fields - MsgID, Class and Match followed by a Go button. The Search Mode allows the operator the freely move around the database and search for any message(s). The Monitor stops tailing today's log and goes into the Search Mode simply by clicking one of the following:
Creating reports
Clicking the Reports link will open the following form. Note that dates and time fields must follow VOS standards, i.e yy-mm-dd and hh:mm:ss format. To Email the report to an authorized recipient, you may used the Email-To field and specify a Nick-Name as defined in the sps_email_server configuration table.The System Usage Monitor
The System Usage Monitor shows standard VOS meters: CPU, page-faults, Interrupts, empty-idle, file-io, disk-io over the last 10 seconds, 1 minute, 5 minutes and one hour. The Monitor refreshes the information every 10 seconds. For more information, see the sps_performance configuration table.
The Disk Space Monitor
The Disk Space Monitor shows two graphs for every disk on the system - Free/used space and I/O rate with a breakdown of reads and writes. For more information see the sps_disks configuration table. Note the scale for the I/O rate graphs is determined by the =max_disk_io field in the alert_manager.tin configuration file.The Grouped List-Users Monitor
The List-Users monitor is based on logical groups of processes that are defined in the sps_processes configuration file. A Group is defined by a standard VOS user name (*.System) and a process-name. Each group is graphed to show its relative CPU, page-faults, memory consumption as well as I/O rate, and number of processes running.The List-Users Monitor
The List-Users (All) monitor lists the processes that are taking to most resources. The information may be sorted by clicking on one of the table's headers: CPU, Faults, Memory, Reads, Writes etc.
The Queue Monitor
The Queue monitor is based on queues defined in the sps_qmon configuration configuration table. Queues can be server queues (1 or 2 way) and message queued.The Batch Monitor
The Batch monitor does not require any configuration as it automatically detects the active queues on the system.![]()
Batch Monitor
The Network Monitor
The Network monitor does not require any configuration. It monitors all STCP and UDP activities on the system.![]()
Network Monitor - Devices
The Explorer Menu
The Explorer Menu is created based on a list of allowed files, reports and directory listings as specified in the sps_web configuration table.
output_device | AlertManager may write messages to a dedicated device like a dedicated printer device. | ||||||||||||||||||||||||||
max_allowed_subprocesses | Based on configuration and filtering rules the AlertManager server may start additional sub-processes as configured by the user in the sps_clasess.table. To limit the risk of starting too many sub-processes (filtering rules needs adjustment), you may specify a limit beyond AlertManger will not start any additional sub-processes. | ||||||||||||||||||||||||||
max_carry_over |
max_carry_over defines how ack-required messages are copied over to the next day:
| ||||||||||||||||||||||||||
network_normal_class | The class/state code that should be assigned when all modules are online. | ||||||||||||||||||||||||||
network_critical_class | The class/state code that should be assigned when one or more modules become offline. | ||||||||||||||||||||||||||
user_logins_class | If given a valid class code, AlertManager will create entries whenever users log into or out of the system. To disable this feature, remove the class code (blank). | ||||||||||||||||||||||||||
priv_command_class | If given a valid class code, AlertManager will create entries whenever users execute a Privileged command. class code (blank). | ||||||||||||||||||||||||||
netstat_device_NN | A name of a Streams devices you wish to monitor. | ||||||||||||||||||||||||||
netstat_critical | The critical threshold of device utilization as a percentage of the line's capacity. | ||||||||||||||||||||||||||
netstat_warnging | The warnging threshold of device utilization as a percentage of the line's capacity. | ||||||||||||||||||||||||||
netstat_normal_class | The class code given to a message written to the log once the Streams meters are back to normal levels - below the warning threshold. | ||||||||||||||||||||||||||
netstat_critical_class | The class code given to a message written to the log once the Streams meters are higher than the the critical threshold. | ||||||||||||||||||||||||||
netstat_warning_class | The class code given to a message written to the log once the Streams meters are higher than the the warning threshold. | ||||||||||||||||||||||||||
netstat_options |
A 16-bit encoding that sets the initial options in the Network-monitor window as follows:
| ||||||||||||||||||||||||||
netstat_states |
A 16-bit encoding that sets the initial State options in the NetworK-monitor window as follows:
| ||||||||||||||||||||||||||
output_ip_addr | AlertManager may be configured to send message via TCP/IP. Specify the IP address of the target computer. | ||||||||||||||||||||||||||
output_ip_port | Specify the port number of the target computer. | ||||||||||||||||||||||||||
output_ip_udp | Set to 1 if you are forwarding messages to a UNIX/syslog system. | ||||||||||||||||||||||||||
output_ip_for_lams | By default, when sending messages via TCP/IP, AlertManager formats the message based on the =output_q_message field in the Classes table. Set this switch to 1 if the target module is your Master AlertManager. This applies only to environment with multiple Stratus modules each running its own copy of AlertManager. | ||||||||||||||||||||||||||
disks_interval | An interval, specified in seconds, that defines how often system's disks should be checked (see sps_disks table). | ||||||||||||||||||||||||||
performance_interval | An interval, specified in seconds, that defines how often system's performance meters should be checked (see sps_performance table). | ||||||||||||||||||||||||||
processes_interval | An interval, specified in seconds, that defines how often running processes should be checked (see sps_processes table). | ||||||||||||||||||||||||||
qmon_interval | An interval, specified in seconds, that defines how often application queues should be checked (see sps_qmon table). | ||||||||||||||||||||||||||
fwd_interval | An interval, specified in seconds, that defines how often the File-Watchdog layer should be evoked to look for new file arrivals(see sps_fwd table). | ||||||||||||||||||||||||||
files_interval | An interval, specified in seconds, that defines how often the application log files should checked (see sps_files table). | ||||||||||||||||||||||||||
tcp_interval | An interval, specified in seconds, that defines how often the Streams devicesshould checked. | ||||||||||||||||||||||||||
web_window | A unique-id number for the modlue. | ||||||||||||||||||||||||||
web_session_timeout | An interval, specified in minutes , that defines how the session's inactivity timeout after which the user will be forced to log into the system. | ||||||||||||||||||||||||||
web_command_timeout | An interval, specified in seconds, that defines the maximum allowed execution time for a command triggered by a Menu item. | ||||||||||||||||||||||||||
scaling_factor |
The scaling factor to be used on V-Series modules. For more information see VOS documentation on display_system_usage and how to adjust the scaling factor for hyper-threaded processors.
| ||||||||||||||||||||||||||
web_xxx_interval |
An interval, specified in seconds, between each screen refresh. These intervals apply to all Browser sub-windows monitors - console, alerts, system-usage, disk-info, list-users, queue monitor, batch monitor and network monitor.
| ||||||||||||||||||||||||||
web_no_lines |
The number of lines you wish to set for the Alerts sub-window.
| ||||||||||||||||||||||||||
web_lui_no_lines |
The number of lines you wish to set for the List-Users-All sub-window.
| ||||||||||||||||||||||||||
web_show_scrollbar |
By default, most windows are opened without a scrollbar. Set this switch to enable scrollbars.
| ||||||||||||||||||||||||||
web_initial_menu |
The SPS/Menu can be access via the web interface. If you are licensed to use this package, you may specify any existing menu name as the initial menu.
| ||||||||||||||||||||||||||
web_require_vos_pass |
By default, the initial login screen requires the user to enter a valid user-id and a valid password that is checked against VOS registration file. If this is not required, you can reset this switch to zero and the user can just enter any name that will identify him to the system in which case it will not have to be a real user-id and a password will not be required.
| ||||||||||||||||||||||||||
web_console_menu_text | You may configure up to 5 commands that the user can execute from the console window. Each command has a text description that appears on the screen and the associated VOS command-line. | ||||||||||||||||||||||||||
web_console_menu_command | See explanation on web_console_text. | ||||||||||||||||||||||||||
drms_options |
A 16-bit encoding that sets the initial options for the DRMS window as follows:
| ||||||||||||||||||||||||||
max_processes | The maximum number of processes running on the system. | ||||||||||||||||||||||||||
retain_logs | As part of Alert-Manager's Midnight processing, it will automatically purge old logs depending on the number of days specified in retain_logs. | ||||||||||||||||||||||||||
trace |
Use the following values:
|
class_code |
A unique code that identifies the class. You may make up your own codes or use a numeric scheme. Suggested codes are:
|
ignore_on_holidays | When set, Class processing will suspended during holidays. |
days_mask | Use to suppress the Class during certain days such as weekends. For example,to suppress on Saturday/Sunday, use nyyyyyn. |
active_from-to | Use hh:mm format These fields are optional. By default Classes are always active. When disabled,there will be no logging and no action taken for the specified Class. |
dups_interval | A time in minutes. Any duplicate (IDENTICAL) message that arrive within this interval will be ignored. |
help_text | If given, the text appear on the terminal's status line (25th) when the users tabs to the message. When using a Browser, clicking on the CLASS code link will open a message box with the help_text. |
wait_for_acknowledgment |
Used to set messages that require the operator's acknowledgement.
|
acknowledgment_escalation_time | For alerts that require acknowledgement, you can set an escalation procedure where after a certain time (if not acknowledged), AlertManager will produce another alert with a different class code acknowledgment_escalation_class. acknowledgment_escalation_time is specified in minutes. AlertManager executes the escalation feature every 2 minutes so messages may be escalated within up to 2 minutes of their assigned escalation time. |
acknowledgment_escalation_class | The new Class code that AlertManager will use for the Escalated message. |
output_q_message | Send to output_q if given in the sps_lam.table control rec. Also used to reply to a Client reading the database. Standard keywords can be used (see below under start_command). |
email_user_1-10 | Specify up to 10 email addresses to receive AlertManager messages. |
start_job_1-10 |
Specify up to 10 jobs / process-names to be started by the Scheduler. For more information, refer to the The SPS/Application Scheduler & Monitor (PCS) User's Guide.
|
scheduler_env |
The name of the Scheduler's Environment that governs the started job(s). For more information, refer to the The SPS/Application Scheduler & Monitor (PCS) User's Guide.
|
start_command |
Commands may be started by the Logs Manager. You may include the following keywords that will be substituted at execution time:
|
minimum_reporting_interval | A time interval given in minutes within if a message is classified with this Class code, it will not be logged or executed. |
minimum_execution_interval | A time interval given in minutes within if a message is classified with this Class code, it will be logged but the associated command (=start_command) if any will not be executed. |
require_min_occurrences |
The require_min occurrences is used to throttle execution of commands (see =start_command). The interval is specified in minutes. Example: =require_min_occurrences 10 =require_min_occurrences_interval 5 |
daily_max | The maximum number of messages that can be recorded daily for the class code. |
color | The unique color you wish to assign the Class. To find out more on choosing colors goto http://www.w3schools.com/html/html_colornames. |
logical_name | A unique name that identifies the Object type. For example, use "pre-market-open" for peak hours. |
active_from-to | Use the HH:MM format to define the time window for this check. This feature allows multiple settings for peak/low periods. |
cpu_usage |
The percent CPU used by the current module during the last minute. AlertManager will produce an alert when the system meter exceeds the specified threshold. Example: CPU level is at [percent CPU]%. |
empty_idle |
The percent empty-idle used by the current module during the last minute. AlertManager will produce an alert when the system meter exceeds the specified threshold. Example: Empty Idle level is at [percent E.I.]%. |
io_rate |
The critical level of I/O rate performed by the current module during the last minute. AlertManager will produce an alert when the system meter exceeds the specified threshold. Example: I/O rate is at [I/O rate] per second |
page_fault | The critical level of I/O rate performed by the current module in the last minute. AlertManager will produce an alert when the system meter exceeds the specified threshold. AlertManager will use the "performance_critical_class" class code and post the following message: Page Fault rate is at [PF Rate] per second. |
interrupts |
The critical level of interrupt rate performed by the current module in the last minute. AlertManager will produce an alert when the system meter exceeds the specified threshold. Example: Interrupt rate is at [Int. Rate] per second. |
core |
The critical level of percent core used by the current module in the last minute. AlertManager will produce an alert when the system meter exceeds the specified threshold. Example: Core level is at [percent Core]%. |
used_memory | The critical level of percent-used memory pages. AlertManager will produce an alert when the meter goes beyond this threshold. |
used_paging | The critical level of percent-used of Paging area. AlertManager will produce an alert when the meter goes beyond this threshold. |
normal_class | The class/state code that should be assigned to the system performance object when all meters are within the allowed ranges. The code must identify a valid record in the Classes table. |
warning/critical_class | The class/state code that should be assigned to the system performance object when one of the meters violates its threshold and an alert is produced. The Performance object will remain in the critical state until all meters are back to normal levels. The code must identify a valid record in the Classes table. |
logical_name |
A unique logical name for the Object. Any name that starts with Any_Process allow to create an alerts if any of all running processes exceed the critical_cpu threshold. Multiple Any_Process records can be set for different time-windows using the active_from/active_to fields - for example:
=logical_name Any_Process_morning =active_from 08:00 =active_to 17:00 =critical_cpu 40 =logical_name Any_Process_night =active_from 17:00 =active_to 23:00 =critical_cpu 30
|
process_name | A unique name that identifies the process. The name must be identical to the process name as started by VOS. To get an accurate, up-to-date list of valid process names, execute the "list_users" command and record the process names that appear in parenthesis. You may used star-names. |
user_name | The name or the starname (e.g. *.Operator) of the user that is executing the processes. To get an accurate, up-to-date list of valid user names, execute the "list_users" command. You may used star-names. |
active_from-to | Use the HH:MM format to define the time window for this check. This feature allows multiple settings for peak/low periods. |
no_of_processes | The number of processes you expect to be up and running under the same user-name and process-name. |
must_be_up_from-to |
Use the hh:mm format to define the time of day the process is expected to be up and running. Note that if "must_be_up_from" and "must_be_up_to" are not specified, then AlertManager assumes that the process should be up and running at all times. When the process violates this condition, AlertManager will post a message with the "process_up_critical_class" class code along with the message: Process [logical_name] is not running. |
must_be_down_from-to |
Use the hh:mm format to define the time of day the process is not expected to execute. This feature is useful to cover situations where a running process may interfere with normal production activity. When the process violates this condition, AlertManager will post a message with the "process_down_critical_class" class code along with the message: Process [logical_name] is running; Not intended to run now. |
critical_cpu |
The Critical CPU consumption threshold represented as the percentage CPU by the process(es) every interval. AlertManager will bypass with check if you do not supply a value (zero). To find what is the current consumption of a process: When the process violates this condition, then AlertManager will post a message with the "process_busy_critical_class" class code along with the message: Process [logical_name] is consuming XX %CPU. |
critical_io_rate |
The I/O rate threshold represented as the number of disk I/Os performed by the process every second. AlertManager will bypass with check if you do not supply a value (zero). To find what is the current consumption of a process: Start "list_users -interval 10". Record the typical number of Reads(DDKR) and Writes (DDLW) the process performs. The average I/O rate is: DDKR+DDKW / interval-of-list_users (10). When the process violates this condition, then AlertManager will post a message with the "process_busy_critical_class" class code along with the message: Process [logical_name] performing XX I/Os per second. |
critical_memory | The critical level of memory utilization as a percentage of the memory pages on the system. |
process_normal_class | The class/state code that should be assigned to the monitored process when the process is in its Normal state. The code must identify a valid record in the Classes table. |
process_down_critical_class |
The class/state code that should be assigned to the monitored process when the process is in its Critical state, specifically when it is NOT running as expected. The code must identify a valid record in the Classes table. When a process is in violation of this condition, AlertManager produces the following message: Process [logical name] is not running. |
process_up_critical_class |
The class/state code that should be assigned to the monitored process when the process is in its Critical state, specifically when it is running during times when it is scheduled to be down. The code must identify a valid record in the Classes table. When a process is in violation of this condition, AlertManager produces the following message: Process [logical name] is running; Not intended to run now. |
process_busy_critical_class | The class/state code that should be assigned to the monitored process when the process is in its Critical state, specifically when it is violating its performance related thresholds. |
logical_name | A unique logical name for the Object. |
q_path | The relative or full path-name of the monitored queue. |
active_from-to | Use the HH:MM format to define the time window for this check. This feature allows multiple settings for peak/low periods. |
pending_alert | Set the "Pending Message count" threshold. An alert will be produced when the queue's actual pending messages exceeds this limit. |
normal_class | The class code that AlertManager assigns to monitored queues that are in the normal state. The code must identify a valid record in the Classes table. |
warning/critical_class |
The class code that AlertManager assigns to monitored queues that are in the critical state. The code must identify a valid record in the Classes table. When a monitored queue in violation of its threshold, AlertManager produces the following message: Pending message count on [logical name] [# of msgs.]. |
logical_name | A unique logical name for the Object. |
active_from-to | Use the HH:MM format to define the time window for this check. This feature allows multiple settings for peak/low periods. |
input_path | The relative or full path-name of the monitored file. |
must_arrive | Set to "1" if file must appear within the specified time window. If the file does not arrive within the specified window, AlertManager will log and execute the critical_class class-code. |
allow_lockers | By default, AlertManager will not log or execute the file_arrival_class as long as there is one or more processes locking the file. When the switch is set, AlertManager will execute the file_arrival_class regardless if there are any lockers. |
normal_class | A class code that AlertManager should set for the object while waiting for the file to arrive. |
file_arrived_class | A class code that AlertManager should log and execute upon file arrival. Important: It is the responsibility of this class-code to delete or rename the file to prevent unnecessary executions. |
critical_class | The class/state code that should be assigned to the File Monitor object when the file does not get created within the required time window provided that must_arrive is set. |
logical_name | A unique logical name for the Object. |
path | The actual name of the disk (%sys#d01, %sys#d02). Paths can also be any directory or file on the system. In the case of a file, AlertManager will create an alert when the number of block exceed either the warning_percent_used or the critical_percent_used meters using the =file_max_block (start-names are supported). In the case of a directory, AlertManager will create an alert when the number of objects (files,links,dirs) exceed either the warning_percent_used or the critical_percent_used meters using the =dir_max_objects. |
active_from-to | Use the HH:MM format to define the time window for this check. This feature allows multiple settings for peak/low periods. |
critical_percent_used |
In the case of a disk - the percent of used space. When the disk pack is in violation of this condition, AlertManager produces the following message: Disk [logical name] is at [percent used]% used. In the case of a file - the percent of disk blocks used compared to =file_max_blocks. When the file is in violation of this condition, AlertManager produces the following message: File [logical name] is NNN blocks - NN% of its allowed size. In the case of a directory - the percent of number of objects compared to =dir_max_objects. When the dirctory is in violation of this condition, AlertManager produces the following message: Dir [logical name] has NNN objects - NN% of its allowed limit. |
critical_io_rate |
The number of I/O rate performed on any given disk on the current module during the current interval. AlertManager will produce an alert when the system meter exceeds the allowed threshold. When a disk pack is in violation of this condition, AlertManager produces the following message: Disk [logical name] is busy at [# of I/Os] I/Os per second. |
critical_read_busy |
A threshold for the disk's %read-busy meter.
|
critical_write_busy |
A threshold for the disk's %write-busy meter.
|
file_max_blocks |
A threshold for file size given in block when monitoring a file. When the file exceeds the given threshold (actual number of blocks reaches the warning/critical% used settings), AlertManager produces the following message: File [logical name] is NNN blocks - NN% of its allowed size. |
dir_max_objects |
A threshold for maximum number of objects (files,links,dirs) that is expected for a directory. When the directory exceeds the given threshold (actual number of objects reaches the warning/critical% used settings), AlertManager produces the following message: Dir [logical name] has NNN objects - NN% of its allowed limit. |
normal_class | The class code that AlertManager assigns to monitored disks that are in the normal state. The code must identify a valid record in the Classes table. |
warning/critical_class | The class code that AlertManager assigns to monitored disks that are in the critical state. The code must identify a valid record in the Classes table. |
organization: relative; index : logical_name no_duplicates; fields: logical_name char (32) var, input_path char (256) var, input_event_path char (256) var, input_ip_addr char (32) var, input_ip_port bin (15), start_pos bin (15), end_pos bin (15), tcam_log bit (1) aligned, classify_once bit (1) aligned, user_exit char (32) var, use_filter_01 char (32) var, use_filter_02 char (32) var, use_filter_03 char (32) var, . . use_filter_16 char (32) var; include_classes char (256) var, exclude_classes char (256) var; end;
logical_name | A unique logical name for the Object. |
input_path | A path of a file or a 1-way-server-q from which AlertManager will read incoming messages. You may use star names (e.g xxx*.out). AlertManager will automatically use and switch to the most recently created file that matches the star-name. If possible, the user should avoid using star-names. Use star-names when a time-stamp is used as part of the file name(s). |
input_event_path | The path of an event file that is associated with the input file. For the syserr_log. files, the event file is syserr_log_event. Using this field is optional, if the user does not specify an event file, AlertManager will use the interval parameter. |
input_ip_addr | The TCP/IP address if Alert Manager is required to read alerts form another computer. This could be a to another system or a network-ed Stratus running AlertManager. |
input_ip_port | The TCP/IP port number if Alert Manager is required to read alerts form another computer. This could be a to another system or a network-ed Stratus running AlertManager. |
start_pos | Specifies the position from which AlertManager should examine the messages. The VOS syserr_log files have a time stamp in the first 8 position followed by two spaces. The user should therefore set the starting position to 11. |
end_pos | Specifies the last position that AlertManager should use when examining the contents of messages. By default, the server will use the entire message buffer. |
tcam_log |
When set (on TCAM system only), AlertManager will strip off leading data fields of the message that are not necessary for display. Example: 07:09:50 CTPS: (TPServer_1) [task 5] tp$server.pm :%INFO- Session starting. AlertManager will log: TPServer_1(i): Session starting. |
classify_once |
If classify_once is set (the default): In this mode, each message will be classified only once. As soon as the message complies with a filter, any other subsequent filters are not evaluated. Using this feature may simplify the configuration process and is therefore recommended. When using this method, make sure that the more important filters (e.g. Fatal, Critical) will appear first, before any information-level filters. If classify_once is NOT set, all filters will be examined. With this approach, a message can comply with one or more filters in which case, they can be classified more than once. |
use_filter_01..16 | The use_filter_XX fields are links to Filter records (see the Filters Table). Filters are used to classify messages. You may use up to 10 filters. |
include_classes, exclude_classes | Include/exclude are used ONLY when processing an AlertManager log on a remote module. These are lists of class codes separated by commas that you wish to be included or ignored (exclude_classes) by the "central" AlertManager Server. |
an_alert_manager_input | Set to 1 (yes) if this is an incoming flow of messages that originate from a remote AlertManager server. |
filter_name | A unique name that identifies the Filter. |
class_code | A code that links the Filter to a Class record. This field is required. AlertManager uses it to classify messages and take certain actions. |
match_and_01..16 | All match_and strings must exist in the message, otherwise AlertManager will ignore the message. |
match_or_01..16 | At least one of the match_or strings must exist in the message, otherwise AlertManager will ignore the message. |
omit_and_01..16 | AlertManager will ignore messages if all omit_and keywords appear in the message. |
omit_or_01..16 | AlertManager will ignore messages if at least one omit_or keywords appear in the message. |
msg_prefix | The user may specify a message prefix. AlertManager will append this text to the beginning of incoming message. This may be useful for easy message identification. |
caseless | A switch that defines the case sensitivity. Set to 0 for case sensitive matching (default). |
logical_name | A unique name that identifies the Report. |
category | User-defined categories, for example: reports, configuration, release_notes etc. |
report_path | The full path name of the report file or files (star-names are supported but limited to 1024 files). At least one of the match_or strings must exist in the message, otherwise AlertManager will ignore the message. |
notes | Notes, description, instructions etc. |
acl_name |
A name of a FTR control file that governs the user's access to the item. This will be best explained by an example. Consider the following entry:
/ =logical_name my_file =category special =report_path #d01>special>file_1 =note This is my file. =acl_name security_1 In the SPS>alert_manager directory, create a security_1.ftr file. If the current user has access (read or write) to the security_1.ftr file, the item will be displayed on the list otherwise it will not. This provides an easy way to control privileges based on the user-id.
|
report |
A unique name that identifies the Report. To produce a report execute: alert_manager_server.pm -command report -name [report-name] |
from | A standard-VOS data-time that define the time start date/time selection criteria. |
to | A standard-VOS data-time that define the time end date/time selection criteria. |
include_class_NN | A list of one or more classes to be included in the report. Star-names may be used. |
exclude_class_NN | A list of one or more classes to be excluded in the report. Star-names may be used. |
sources | You can specify any number of Sources from which messages originated. |
modules | You can specify any number of module names from which messages originated. |
match | If you choose a match-string, then only messages that contain the string will appear in the report. |
output_path | The relative or full path name of the report file. |
| You can choose to send the report via you E-Mail Server to selected users. Simply enter their email nick-names as defined in the sps_email_server.table. |
The E-Mail Facility
The Email Facility sends you and your staff selected reports and performance graphs directly form your Stratus. Automated reports and graph generators with the Email capabilities are now integrated into all SPS reports. For more information and samples, click here.
You'll need to define your Email Server in the SPS>alert_manager>sps_email_server.tin and then create the table.
organization: relative;
index : server_name no_duplicates;
fields :
server_name char (32) var,
server_ip_address char (32) var,
server_port_number bin (15) default ('0'),
user_name char (32) var,
password char (32) var,
use_domain char (32) var default ('softmark.com'),
nick_name_XX char (32) var,
e_address_XX char (256) var,
debug bit (1) aligned default ('0'),
trace bit (1) aligned default ('0');
end;
server_name | A logical name for the Email server. The default settings should not be changed. |
server_ip_address | The server IP address of the Server. You might need to get this from your Network manager. |
server_port_number | The server IP address of the Server. You might need to get this from your Network manager. |
user_name | The user_name is only required if the Email server requires authentication. |
password | The password is only required if the Email server requires authentication. |
use_domain | The domain name from which the message will arrive. |
nick_name_xx | Set up to nick-names for persons authorized to receive reports and alerts from the system. |
e_address_xx | The corresponding email addresses for everying nick-name that was defiened. |
debug | Set to 1 if you neet to troubleshoot this service. |
trace | Set to 1 if you neet to troubleshoot this service. |
Example:
/ =server_name sps_mail_server =server_ip_address outgoing.myserver.net =user_name Bob =password my_pass123 =nick_name_01 Joe =e_address_01 job@mycompany.com =trace 0
If your Class definition (sps_classes.tin) includes:
=output_q_message '@time: @msg'
then your messages on UNIX/Linux will look like the following example:
Oct 7 14:40:08 localhost 14:40:58: All networked modules are online.
Installation:
In SPS>alert_manager>alert_manager.tin define the following fields where the =output_ip_saddr and =outupt_io_port are the IP Address of your UNIX. box and any port number you've assigned to LamsClient program.
The AlertManager-Netview/Tivoli interface
Alert Manager allows you to send alarms and notifications to UNIX/Netview/Tivoli/Openview. You can choose to send all meesages or restrict the message flow to a few selected Class Codes. Message can be formatted using special keywords as follows:
Example:
=output_ip_addr XXX.XX.XX.XX
=output_ip_port 514
=output_ip_udp 1