Hades Status Analysis

From GEANT2-JRA1 Wiki

Contents

Introduction

Work based on two specification:

NREN Status Matrix page: http://www.win-labor.dfn.de/cgi-bin/hades/nren2nren.pl


ToDo, quirks and feature proposal list


Necessary and useful common data analysis optimizations and corrections

This optimizations and corrections are not only of interest for status monitoring, because they are for the common data analysis! Most likely it is useful to do all the changes described below at a single blow.

IPDV/OWDV inter group calculation

IPDV/OWDV should only be calculated using the packets from one group. In other words: Don't calculate the variation between the first packet of a group and the last packet of the previous group.

Reason: If there is one group with the last or all packets having a higher delay, there is congestion detected. For this group there is a high maximum in the IPDV, because the delay is obviously rising. The following group has one very low IPDV, because here it is the other way round. The first packet is much faster than the last packet (of the previous group). See example.

Implementing this change may take some time, and lead to some problems with existing code.

IPDV/OWDV minimum and median calculation

There is another issue with IPDV/OWDV aggregation in the common data analysis. As postulated by the RFC Hades calculates the delay variation by just calculating the difference between two packets leaving the sign untouched afterwards. Therefore there is negative delay variation. This is OK, but has a problematic influence on calculating minimum, median, and maximum. At the moment the data analysis is using the sign of the variation as is. Most likely it would be much better here to use the absolute value at least for the classification of min, med, and max.

Just think about it! If one of the delays is much higher than the other (let's say about 10 ms), you will have a maximum delay variation of about +10 ms. That looks OK at a first glance, but it isn't! Perhaps the jump back to "normal" is a higher distance (let's say -11 ms)! But instead of taking this into account, the analysis sets the (negative) jump back value as minimum. The median on the other hand is normally very near to 0. I think this should change, but it is a big change! The plots from this new calculation will look completely different than the current ones. Re-analysing the old data is not really practicable. Also the question arises whether the plotting should use absolute values, too, to prevent strange looking plots.

Suggestion: Calculate both, RFC conform and useful delay variation!

Partial data analysis

For on the fly status analysis on the boxes and faster status analysis on the central server, we need partial and step by step common data analysis. Important new property here: Packet timeout = When is a packet considered lost?

Bug in common data analysis

There is an important problem in the common data analysis. Because of a floating point number overflow, the aggregated data and the timestamps are rounded. The data is not really incorrect, but imprecise. But this leads to quite a lot of issues for the status analysis. At them moment workarounds are in place. But this issue has to be addressed as soon as possible!! Solution is to use integer arithmetic. But this has influences on quite a lot of code fragments, which will lead to time consuming implementation and especially testing.

Lost group and data gap handling

There are major issues regarding completely lost groups and other gaps in the data, that are extremely problematic for status analysis.

New measurement infrastructure on the boxes (hadesd)

Needed for near real time status analysis! A lot of work to do!

On the fly data analysis for IPPM data

Depends on Hades_Status_Analysis#Partial_data_analysis.

Rework of traceroute support

Many issues to investigate and solve.


Status data storage issues

Sophisticated storage system [SOLVED]

There is a storage problem that has to be addressed as soon as possible! Most likely it's OK till after Berlin, though. All status information is stored in one file (status.dat). This file is growing quite fast!

Ideas: Save data on a day by day basis? This might lead to display and performance problems. Use a database? Use RRD?

[SOLVED]: Using day by day file based solution.

Packet loss

At the moment the packet loss has no influence on the quality of a link. There should be a threshold for the loss rate!

Lost groups

Completely lost groups are ignored! That makes the history problematic: If a link was OK, then down for weeks and afterwards comes up with status OK, the history shows the complete downtime as OK. Somehow this is also a question of specification. Indeed this behavior can be seen as valid: "The link was OK, because there was neither congestion nor re-routing."

But most likely it will be better to put this as "blue"/"no data" state into the history.

Include all measurements on path

At the moment the status information is only calculated for one measurement per link (the first, default measurement). There are two possibilities to incorporate the others:

  • Calculate status information for every measurement, just as if they were different links. Perhaps aggregate them a little like "If one has the status poor, the link has the status poor. Only if all are OK, the link is OK."
  • Aggregate all measurements into one link status. Needs some more thinking and most likely quite a lot of coding and testing, but it should be possible.


Threshold handling

There is a configuration (in YAML) that defines the thresholds for every measurement for which status information is calculated. A simple script was used to create this configuration file with default values. Except for the baseline OWD, which was determined by looking at the data of one day and taking the minimum of the minimum OWDs of this day.

Better calculation

Make baseline calculation more sophisticated and enable some sort of calculation for the other thresholds.

More flexible configuration script

Enhance the configuration script to enable reconfiguration, consistency checks and so on.

User interface

Make it easy for normal users to change these values. Most likely via a web interface.


Traceroute support

See also [1].

Depends on Hades_Status_Analysis#Rework_of_traceroute_support.


BWCTL support

What can we do here? On demand tests when problems detected via IPPM data? Take scheduled measurements into account for status data analysis?


Visualisation enhancements

Implement Link Status Overview

See [2].

NREN-to-NREN zoomed view

NREN-to-NREN matrix should provide a detailed/zoomed view. See also [3].

Another addition to this: The duration of the current state should be displayed.

Additional information in the detailed link status (history) table

The Display of OWD and OWDV of the first group, and the OWD in % is not really helpful and misleading.

What else can we display here?

The data has to be included in status information (at the moment a file, but perhaps soon a data base) for easy reference and performance. So keep it small!

Should we use the data from the last group? Which last group? Normally the last group is the first group showing the behavior of the next state. If there is no next group (e.g. the current state!), we can use the second last group. But this one might be a rather unusual group for the state. Use some sort of aggregation? Max OWD over the complete duration?

New approach to anomaly detection

Based on the diploma thesis "Statistical Analysis of IP Performance Metrics in International Research and Educational Networks" by Thomas Holleczek, it is planed to enhance the status analysis with more sophisticated approaches for anomaly detection.

There are also slides available from a presentation given at the GEANT2 Technical Meeting in Berlin on 27.06.2008.

Personal tools