Lightpath Monitoring
From GEANT2-JRA1 Wiki
We are aware since the beginning of the project that we will have to face the monitoring of lower layers. However, the e2e lower layers services are not yet (AFAIK) fully defined by JRA3 and thus it is not possible for us at this stage to work on the exchange of lightpath monitoring information on an multi-domain bases.
At this stage, we are keeping (a rather loose) eye on what's happening on the lower layers. We have had so far informal discussion with members of JRA3 (Michael, Maarten), Canarie (Rene), and Internet2 (Matt). From those discussions several views about what is "lightpath" monitoring have emerged.
Lightpath discussion by Victor Reijs
Contents |
Current Issues
The main problems we are facing right now:
- There is not clear definition of what is a lightpath and what are its characteristics (which need to be monitoring)
- There are various technologies which can be involved and which don't all offer the same visibility on the metric which need to be defined on a lightpath.
- There are several different utilisation for which you would wish to monitor a lightpath
- Set-up the lightpath
- Health verification by the NOCs and the users
- Diagnose where a problem lies.
JRA1-JRA3 distribution of work
Once the lightpath will have been defined by JRA3 as well as their monitoring (for which JRA1 can contribute), it will have to be specified with JRA3 who will design and implement which part of the monitoring infrastructure. The current trend is that JRA3 would be working on the tools and the visualisation tools themselves and JRA1 on the way to exchange the monitoring information between the domains.
Once there will be a clear definition of the lightpath monitoring, then we would have to see with JRA3 how to split the work between the two activites. I would foresee the following points to be modified to include lightpath monitoring: - integration of new measurement retrieval methods within a MP Service. - integration of existing data archives within the MA services. - modification of the interface to deal with those new type of data. - modification of the look-up/topology service to take into account the existence of L1 and/or L2 infrastructure (most challenging part for the framework itself) - creation of visualisation tools and potentially of test tools (most interresting part for the lightpath monitoring and troubleshooting) - a challenge will be to retrieve information relevant to a lightpath (this include tracking the lightpath on different equipments).
Different Visions of lightpath monitoring
Depending to who you are talking to, you got different answers/view of the situation:
"Canarie" lightpath monitoring approach (SDH lightpath)
You don't need an infrastructure like the one you are building bcs the lower layers used to carry the lightpath are providing all the information you need to both edges (SDH). So you know if there is a problem. Comment: you shouldn't apply L3 concept to the layer2 world.
There are few points I am not sure about because of my lack of understanding of SDH.
- Problems:
- It assumes that a single technology is being used and the transport of monitoring information is done by this layer. This will most likely not be the case within the European community.
- However in a multi-domain environment, you won't see in which domain there is a problem, you will just be able to know if there is a problem and verify if it is within your own network (but I am no expert in SDH).
"I2 lightpath monitoring approach" (piPEs like approach, divide the problem in tiny pieces)
Usage troubleshooting and setting-up a lightpath and to some extend e2e monitoring of the lightpath. They would place dedicated boxs located in each PoP at the boundaries of network with which you could test the setting-up of the lightpath or a section of it or for troubleshooting.
- Setting-up of a lightpath: Each time you set-up a new portion of a lightpath to the next domain, you test the current path from side to the end of the new portion of the path you have set-up. If the test is successful, you add another portion of the lightpath towards the destination. If the test is unsuccesful, you know that you have a problem with the latest section. (the goal is to avoid to send 200 emails when setting-up some ligthpath).
- Troubleshooting a lightpath when some problems are being experienced. You cut the lightpath in two and run tests between an edge to a middle box where you have cut you lightpath. If you notice a problem between one one half, you repeat the same operation until you have isolate the segment where there is a problem.
- Problems (?)
- It requires the installation of dedicated testing boxes within the PoPs, which may not realistic accross all the NRENs.
- The technique is IP based as far as I understood. This presents advantages (PC capable of blasting loads of traffic and permanenetly connected to a switch or to a router, this is something we are use to do - even if it is not always easy to generate 1Gbps flows) and dissadvantages (to use IP to evaluate the performance of a L1-L2 services is not the best thing to do, if you need to complain to your provider, they won't understand that there are so many packet lost and it's difficult to convert IP metrics into L1-L2 type of information).
- If you want to monitor your lightpath e2e, you need to do it from the edge with active probes. But most likely to measure it accurately neough, you would need to generate a quantity of traffic which will be intrusive.
- With this technique, you don't cover the day to day operational monitoring of the whole lightpath. You may even or may not able to see what's in your own network (this will depend on the technology used to carry your lightpath).
- You don't keep the lightpath in a piece, you need to break it to evaluate each pieces of it.
JRA1/SONAR type of approach
Assuming that for all the technologies deployed along a lightpath, a common subset of metrics have been suggested and are measurable. The use of the JRA1/SONAR framework would be to make available the status and performance of the sub-part of lightpath in a given network. If you have access to this information from all the networks along the path, then it should ease the problem localisation. The main challenge will be to find the information along given ligthpath. It would also provide the capability to start lightpath tests between machines located in different points (the AA service would be use to authenticate a NOC member allowed to start such type of tests even to a distant machine he is not managing). What is currently not covered is the capability to "break" the light path to perform some partial path analyses.
It is foreseeable that the final lightpath monitoring solution will be a mix of the methods abovementioned.
Notes from various discussions
Routing
If an IP network is using a lightpath and if it is also connected to the general IP network, one has to be very careful that the traffic is going where it is expected to go. For instance that the regular IP traffic won't be going through lightpath and that the lightpath traffic won't go through the regular IP connection.
How to verify that this?
IP OWD tests
What would provide the deployment of OWD measurement boxes at site connected to a lightpath?
- OWD would provide you a flat line (except if the lightpath is using MPLS in a domain). A OWD variation would provide information about the gross errors or queuing problems when something is going wrong as most of the time, you would see a flat line. This is also ideal for the user to measure the up-time of the lightpath (regular tests).
- If the tools perform some traceroute, it is ideal to verify if there is not major routing problem (cfr previous point on routing), but doesn't mean that if traceroute goes over the lightpath, it doesn't mean that part of the IP traffic isn't going over the lightpath or that part of the lightpath traffic isn't going over the IP network).
- An advantage of the method is it's relative non-intrisuvity in respect with the lightpath traffic.
- What would it be used for?
- Daily monitoring: This provides the hartbeat of the lightpath. Short probing interval helps to spot easily big major fault or changes of path. Can be use to verify the up-time of the lightpath.
- Set-up a lightpath: May be used to get some calibration information once the lightpath has been created. Traceroute test may be helpful to check the path used (no easy to spot routing problems). This wouln't bring much information except calibration hints (interresting in case the lightpath is being carried over MPLS).
- Diagnostic: End-to-end historical information can be used to verify when something could have happened and to try to get a first grasp at the cause of the problem. If not spotted with this method, then the throughput method will be the one to use.
TCP/UDP Throughput Tests
What would provide the deployment of TCP/UDP trouhgput probe at each site connecte to a litghpath?
- It enable the verification that a given level of TCP/UDP throughput is achievable. (1Gbps can easily be verified, 10Gbps is much more difficult).
- It helps spotting more fine grained queueing and burst problems than the OWD tests.
- As it is generating a large amount of data, it allows the detection of low rate errors.
- Main drawback, it is very intrusive and can compete with the operational traffic.
- What would it be used for?
- Set-up a lightpath: End-to-end test to verify that the lightpath is working fine. Such tests gave confidence that the path is established and the the level of error is acceptable. If such throughput tools have been made available along the path by network providers, the lightpath can be build bits by bits and tested between one edge and the new end of the lightpath (where is connected a throughput box).
- For a 1Gbps lightpath, it would even be possible to find out if the total bandwidth has been properly allocated to the lightpath.
- This requires to have dedicated, properly tuned equipment (server, interfaces, OS, TCP stack), otherwise the tool would test the end-system itself rather than the connection.
- For a 10Gbps, it is more difficult to test the capacity as I haven't heard about any test running at 10Gbps so far (7Gbps has been the maximum which was achieved and this is typically done by tuning performances "gurus").
- Daily monitoring: objectives: check the lightpath health. Regular tests performed during period of lower utilisation of the lightpath.
- Useful for long term statistics and trends to see the evolution of the health of the lightpath as well as to spot if there is a degradation of the lightpath performances (in respect with the previous tests).
- Diagnostic: find out what a problem can be and where it is located.
- When a performance problem occurs, the tool can be used to help finding out where the problem is coming from (the ligthpath itself, the end-hosts, etc).
- Once the type of the problem has been identified and if it seems to be coming from the lightpath, one has to find out where the problem is located along the lightpath. This is the most difficult part of all (because of the different technologies involved which are not all providing the same information). When it is possible to have access to several test probes along the path (and this is an ideal case), then it is possible to break the lightpath and perform some tests between those test probes. Even though you find out the segment of the lightpath from where the problem seems to be coming from, there may be several pieces of equipment in between. So further investigation (which ones?) need to be performed to find out where the problem is exactely located.
- Set-up a lightpath: End-to-end test to verify that the lightpath is working fine. Such tests gave confidence that the path is established and the the level of error is acceptable. If such throughput tools have been made available along the path by network providers, the lightpath can be build bits by bits and tested between one edge and the new end of the lightpath (where is connected a throughput box).
Ethernet Monitoring
The information you can retrieve from an ehternet switch are quite restricted: circuits utilisation, drops on the switch, runts and CRC errors. When using a ethernet service over a WAN, when a problem arise, there is no way of exactely finding out where the problem is coming from between two switches. (add notes of cases where this is a dissadvantage).
- Ethernet provide te capability to monitor circuits/VLANs utilisation. If you have several VLANs on a ethernet circuits, you can get access to the utilisation of each of them.
- It also allows to get access to drops, runt and CRCs on the switch, which indicates a degradation/over-utilisation of the services.
- What is the meaning of a drop, a runt or a CRC error on a ethernet switch?
