MDM Pilot 28 February 2008

From GEANT2-JRA1 Wiki

Date and Time: Monday, the 28th of January 2008, 11:00 - 12:00 CET.

  1. We will use the Hungarnet MCU,
    • GDS numbering plan JRA1: 0036100309901
    • Phone access: +36 1 450 3099 then type the GDS number 0036100309901
  2. Backup DFN

Contents

Actions

  1. NS to investigate with Stijn the status of MP: are all the commands working? Since which release? How to move ahead: install 3.0 or change a configuration file?
  2. Add former meeting actions to the list of actions
  3. NS to indentify examples of pSUI and traceroute discrepancy.
  4. NS: contact TR about having the PERT using the self-pace training and looking at some data (based on exercices).
  5. NS to send pointer to the start page.- Done
  6. NS ensure that SWITCH and Hungarnet Telnet/SSH MP is being monitored on the status page
  7. NS ensure that SWITCH circuit status MP is being monitored on the status page
  8. CW to describe what he expects more precisely in respect with the IPv4 and IPv6 addresses and pSUI
  9. NS investigate Telnet/SSH MP issue with Hungarnet
  10. LM to check if MDM-pSUI works better than the general public one.

Minutes

Attendance: CW, GI, LM, NS

  1. Last Meeting Actions
    • Skipped for today
  2. deployment status update
    • SSH/Telnet - proposal: deploy 2.2 withouth ping and traceroute - rejected for time being, too many commands not working in the previous phase. How to move forward? Wait for next release or simply change the configuration file?
  3. Training the NOC
    • LM: Self-pace training is a good way to train the NOC and PERT, to give an introduction to the tools
    • Suggestion: Give examples on practical cases is good. See how it can be observed with perfsonarUI. CW: Should be in a separate training module. Improve resolution of the picture (some look fuzzy - poor compression, cannot see the letter right). Example of use: BWCTL test and see the results + look at the link utilisation along the path of the tests to check if the BW is the right one.
    • Suggestion: run a traceroute + ping before or after the BWCTL test would help the user to understand the path followed by the test and the delay information to tune the TCP stack.
    • LM reported that sometimes the pSUI wasn't showing the utilisation graph corresponding to the traceroute
    • Suggestion: have the PERT member looking at the training whilst on duty and provide some exercices to look at the data.
  4. Looking at measurements and understanding them
    • LM: Nothing unusual. No possibility of having some test to a GEANT2 BWCTL server. Interfaces all OK, BWCTL test: inside FCCN got what we were expecting.
    • CW: didn't find anything interesting. Everything looked normal. Hades looked OK. Telnet/SSH MP and status page not monitored on the status page. BWCTL test didn't work for any machine.
      • Friday CNM DoS-ed SWITCH RRD MA (CPU was around 180%). Tomcat crashed every few hours. CW spent quite a bit of time investigating the issue. He installed some filters against CNM. SD worked with CNM-team and the issue was resolved.He also installed pS-ps from internet2 but hadn't had the opportunity to go much further. A limitation of the request received will be investigated.
      • Issue in respect IPv4 and IPv6 addresses. They represent the same interface. To be enhanced.
    • GI: RRD MA almost all there, Telnet/SSH MP: all routers configured. Tested 10% of them. It was not working (no answer recieved). There was an internal policy decision that forbid Hungarnet to provide VRF commands.
      • BWCTL: looking at the last two weeks of data, there are only 6 combination out of 12. Found a machine with an older version of iperf (running on a JRA1 machine) on which there was an issue. Issue raised with Erlangen. On-demand test worked (except on the old "JRA1" box).
      • Hades measurement - mis-understanding of what was expected. Found some value higher, though it was due to a file transfer and consider them as OK.
    • General comment, this is an unusual way of looking at the data. Most of the time you look at them when there is an issue. But OK for trining purpose.
  5. Training
    • GI fave a training for basic concepts. Awaiting for a more stable SSH/Telnet MP to be up and running to do a second round.
    • CW: Pitty Telnet/SSH MP doens;t work OK right now. Training should only happen on things that really work. (pSUI: there are always one or two things that don't work properly, e.g. dual stack routers)

Agenda

  1. Last Meeting Actions
  2. deployment status update
    • SSH/Telnet - proposal: deploy 2.2 withouth ping and traceroute (to avoid VTY exhaustion when targeting a end host * * *)
  3. Training the NOC
  4. Looking at measurements and understanding them
  5. Chris demo - information
  6. Next steps to finish the MDM Pilot

Data Investigation

I suggest to use the MDM In Service - Check List and to perform the checks for the RRD MA, BWCTL MP and the Hades tool.

The objectives of those checks are to verify that for the last two weeks

  1. the measurements are all there
  2. they are working properly (RRD MA - all data there; BWCTL MP - average throughput over 900Mbps; Hades - delay variation around few ms, no packet loss, traceroute stable)
  3. that the network is working properly

Fill the Check List and send it back to me and flag any variation in the measurement behavior and report them at: http://wiki.perfsonar.net/jra1-wiki/index.php/Cases_to_Investigare.
I am expecting that the exercise will not take more than one hour. <br?

You can also conduct additional investigation, looking at the delay between the servers and the link utilisation/capacity to understand the behavior. This would allow to gain a better understanding of the observed behavior.


What's next?


We would then investigate those anomalies (JRA1, MDM participants, MDM NOC, PERT if need be) to understand what is happening (eradicate the problems related to the measurement and investigate the network problems).

A guide for interpreting the data (possibly in the same format as: http://wiki.perfsonar.net/jra1-wiki/index.php/Measurement_Observations - format to be confirmed) would be produced.

This exercise is pretty important as it will be used to flag anomalies in the work JRA1 does in assessing the reliability of the measurements and in being a starting point for data interpretation documentation (starting from the observations you made).

Personal tools