Key Performance Indicators
From GEANT2-JRA1 Wiki
This page serves as a discussion forum for Key Performance Indicators (KPIs). KPIs are based on a concept from economics to define indicators for the proper operation of a company. This concept is e.g. part of the balanced scorecard approach from Harvard Business School. In the area of IT management commercial products are available which show KPIs in some kind of dashboard with respect to service level agreements (e.g. Mercury SLM).
Why should KPIs be used in JRA1? We are monitoring many different parameters, but we do not have an overview about the use of all those measurements. To get an overview the user sometimes needs to check all of them. If something is happening in the network, it is not easily observable and it is not possible to review what has happened in the past. The KPI would be a summary of an aspect of the network performance. Its variation indicates that something is happening.
The definition of KPIs is dependent on the user's perspective. Therefore, a definition of KPIs which are derived from the basic metrics measured in JRA1 needs to be designed with respect to different user groups.
Contents |
User perspective
Besides of overview information (from a non-technical management perspective?) these KPIs could contain:
- connection quality to favorite partners expressed by aggregate values for e.g. status, delay, available bandwidth
- expected application quality assuming known metric requirements (e.g. is it reasonable to schedule a videoconference tomorrow afternoon or will the expected utilisation lead to a lot of packet loss)
Management perspective
Indicators providing a quick overview of the general network status.
- aggregated status calculated from status of network components (open issues: weight the importance of components
differently? If yes, how can this be done (e.g. higher bandwidth interfaces are more important than lower speed ones)?)
- aggregated delay (differentiation between IP Premium, Best Effort?)
- aggregated utilisation
- aggregated availability
Example: At the Leibniz Supercomputing Center (DFN Munich) there is a KPI for the general availability of the Munich Scientific Network which is provided on daily and weekly basis. It is aggregated from the availabilities of all interfaces. The interfaces are equal weighted which has led to some discussion as e.g. the access to the German Research Network can be considered much more important than the connection to a smaller research institution (but no other weighting formula has been found).
Project perspective
Indicators concerning project partners and their interconnections.
- aggregated status
- aggregated delay
- aggregated utilisation
NOC/PERT perspective
Indicators for presence/absence of errors
- are there faults anywhere?
- utilisation threshold exceeded somewhere?
- delay, jitter threshold exceeded somewhere?
A dashboard could show these indicators. In case of warnings/faults they should be linked to the originating problems.
KPI Examples
As food for thoughs [of interest to: n = NOC, u = users, M = managers]
- Infrastructure:
- equipment availability
- percentage expressed as an average for all the equipment of the network. [n, m]
- average uptime period for all the equipment of the network.[n, m]
- percentage of equipment up or number of equipment up (variation are highlighted) [n, m]
- interface availability
- percentage expressed as an average for all the interfaces of the network.
- average uptime period for all the interface of the network.[n]
- percentage of interfaces up (variation are highlighted) [n, m]
- equipment availability
- Protocols:
- IGP : [n]
- BGP:
- total number of neighbours (or BGP peerings up) - highlight variation
- total number of routes seen [n]
- performances
- Losses: input errors, output drops, owpl: total number, average per interface
- total sum of the input errors or outpur drops or OWPL for the whole network [n]
- average of the sum of the input errors or outpur drops or OWPL per interface [n]
- total sum of the input errors and outpur drops and OWPL for the whole network [n,u, m]
- average of the sum of the input errors and outpur drops and OWPL per interface per interface [n,u, m]
- Delay
- average delay over the network (sum of the delay divided by the number of path) [n.u.m]
- variation over a minimum trend (e.g. monthly delay trend) [n,u, m]
- TCP throughput
- Average TCP throughput over the network [n,u, m]
- Throughput over 900Mbps [n,u, m]
- Losses: input errors, output drops, owpl: total number, average per interface
Suggestion: talk about Network Performance Indicators (to avoid confusions with the KPI used more generally - covering as well finances).
KPI Display
To help visualising how a KPI can be represented and how it can be used.
- http://trafic.lesoir.be/?act=infotraf
- http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000bx
- http://dashboardspy.wordpress.com/2006/04/03/an-executive-console-kpi-dashboard-screenshots-showing-financials-sales-production-it-system/ (bottom page)
- stock exchange like: http://www.serence.com/site.php?action=ser_products,klipfolio_enterprise
-
- several KPI display together http://www.flickr.com/photos/40452729@N00/1283829277/in/set-72157603036269679/
- Nice because indicate if increase of decrease (arrow at the bottom right of the jauge - youmay have to move from one view to the other one: ) http://www.flickr.com/photos/40452729@N00/sets/72157603036269679/show/with/1283829277/
