Tools and Roles

From GEANT2-JRA1 Wiki

Contents

Introduction

As several administrative domains will provide the data to PerfSONAR users, it is essential to have a minimum common understanding of users' roles, otherwise there cannot be a consistent data release policy. Along with the definition of a common set of user groups, there must be an agreement on what data they are entitled to receive, based on their group and the relationship among their home domain and the domain they are requesting data from.

It has to be noted that the text presented below is targeted towards the GEANT2 community. The same work should be carried out for the collaboration within North America, ALICE, etc and have global agreemend between GEANT2, Internet2/ESnet, ALICE, TEIN, etc.

User groups

Several groups of users have been identified (some overlapping):

  • NOCs (NOCs not responsible for the local network)
  • PERT
  • Projects or virtual organisation spread in several physical location (GRID, video-conferencing, etc.)
    • L2 projects (making use of CBF services or making use of e2e gigabit ethernet services)
    • L3 projects (making use of IP services - IPv4, Best Effort or IPv6).
  • General (unqualified) user.

See below for a refinement of the different groups.

Usage

Those groups of people can have one or several usage of the data and of the functionalities:

  • Operation (Service health check, Project service health check, SLA verification)
  • Advance troubleshooting, troubleshooting and project troubleshooting.
  • Security
  • Planning
  • Tailored functionalities.


JRA1 User segmentation

Each of those groups have provided in the DJ1.1.1 some indications in respect with the type of data they wish to access and their usage.
User segmentation

  • [Optional] means that those are potential functionalities that can be provided to the user group depending on the user need. This will be agreed with each project.
  • The security monitoring is being handeled by JRA2 which mentioned that at this stage of the project, they don't foresee any need for multi-domain monitoring information. This is why the row is empty.
  • The network researchers is a very specific group whose needs will vary from one researcher to another one. This is the reason why the row is empty as there is no common need defined. But overal, providing them access to any montoring information will provide them an advantage as they will be able to confront their research against real network information as well as better improving the networks model they are using in their simulation.
  • It has to be noted that a NOC has as currenlty a complete view of the data within its own domain. We are highlighting it the additional functionalities that a NOC could get from other domains.

Consistency

Within GEANT2, there is the strong need to offer a common set of information and functionalities to those user groups, otherwise the service wouldn't bring any benefit if every network was providing entirely different set of functionalities and data. Between the GEANT2 community and Internet2/ESnet, we also need to have the same group of users recognised by both groups. SEveral categories have been identified so far: Remote NOC, Trusted remote NOC member (e.g. PERT), end-users. Those groups should be recognised by all the domains involved.

(**) Under restriction: the security group is a highly trusted group of people from the NRENs.

Note: it has been mentioned to create a distinction within each categories : trusted users and the other. E.g. a trusted remote NOC or PERT member can for example tun TCP tests for several minutes test or for an unlimited UDP tests, whilst a generic remote NOC user would be limited to 10 or 30 secs for TCP and to 100Mbps for UDP.

Usage

This section aims at providing high level indications about what are the different usage of network monitoring information or actions.

Service Health Verification

A NOC is looking after a service and checks of it behaves e2e or across multiple network (either in a preventive way or when starting to troubleshoot, so that it can see a normal behavior and the the changes). It is mostly about having access to historical information and a to assess how the network/service behaves. It is mostly about looking at the status, the load and health parameters as delay, path and throughput and those information being display on a map or on a dashboard. Alarms should be raised when threshold are not respected.

SLA

Verification that the service is within the Service Level Agreement Envelope. It is a specialisation of Service Health Verification, where the health of a service for a group of users is summarise by few Key Performance Indicators which need to stay under a given threshold and triggering alarms if the threshold are not respected. Visualisation in dashboard and monthly report.

Troubleshooting

Actions required to investigate an incident so that a work-around can be propose or so that the root cause of a problem can be found. It requires to have access to more detailed information than the service health verification and to be able to trigger on-demand tests. It require to have access to service health verification data and SLA data as well. It is about starting on-demand tests (TCP thoughput, limited UDP throughput, traceroute, looking at different parameters giving indication about the cause - input errors, output drops, show commands, etc).

Advance Troubleshooting

Same as troubleshooting, but when the root cause of an incident is more difficult to identify and requires also the analyses of the transport protocol, the tuning of transport protocol stacks, the OS, the application, etc. It requires to go more into details and it is typically done by the PERT. A trusted body of network engineer that are allowed to go more into the details. The don't have any limitation on the type of tests they run, they can have access to packet capture tools, access more information within the MIBs, etc.

Project Troubleshooting

It is basically giving access to a project operational group to some troubleshooting functionality and data that the NOC have access to. The access is to be seen on a case by case basis.
Once they have found out that the problem is not coming from the project side but form the backbone networks, the project pass the problem to the NOCs which will follow their usual procedure (see Service Health Verification and Troubleshooting section).
We may wish to ask the project to provide access to its data (end-to-end measurements or local site information) to the NOC and the PERT. Those data would present a complementary view to the one we got from the backbone. It would then essential to have the NOC and PERT role being recognised by the projects as well as a way of bridging the project AA infrastructure with the JRA5 one. (Assumption made, the project has deployed perfSONAR compliant services within its premisses).

Project Service Health Check

It is a subcase of the NOC Service Health Check, the data analysis and presentation is tailored for the project users.

Taylored Added Value Functionalities

Taylored visualisation for end-users or for some projects. Currently un-specified.

Other

Other utilisation can be made of the data, but are not detailed because they are subcase of the ones mentioned above:

  • Planning: looking into the future for a service: what are the trends, where to we stand, what’s still available. Basically access the Service Health Verification data. Not expected to be a function done through perfSONAR (unless s.o. provide automated scripts to do so).
  • Security: accessing data to evaluate if there are security threads (not expected to be done through pS as JRA2 is taking care of it).
  • Added Value services: tailored visualisation offered to the projects



User Group Definitions

Note, this is presented from the point of view of who could have access to the data you have locally.

NOC

This consists of the people in charge of constantly looking at the health of the network, solving e2e problems (e.g. when a customer complain that his application doesn't work over the network) and verifying that the SLAs are respected. This will covers the European NRENs NOCs and the E2ECU

The "Remote NOC" role covers:

  • the European NRENs Network Operation Centers.
  • the European Regional and Metropolitan network NOCs.
  • [optional], the end-institution network or system administrators.

Additionally, we could think of extending it to other NOCs from the research non Europeant peering networks. As the internet is end-to-end crossing loads of network boundaries, it shouldn't be restricted to only the Europeans NRENs: Abilene, Alice, Canarie, Russia, Japan, etc.

PERT

The Performance Enhancement and Response Team (PERT) is a group in charge of troubleshooting problem, not stopping at the network level, but also investigating the transport protocols, the application, the operating system and the hardware. It can be seen as a specialisation of a NOC troubleshooting function which may need a finer gained visibility for a given problem.

Projects

Projects are groups of end-users or institution having a common goal and who needs network services. Some of those projects may need to access monitoring data between their site to verify that the service is working fine, that the SLA is respected or to investigate a problem (and potentially ruling out a network problem). There are multiple projects which can have different access rights. E.g. evlbi, EGEE, LHC, etc. Those projects can be classified as:

  • L3 projects when using IP services (mostly using IPv4 best effort)
  • PiP projects when making use of the Premium IP service (it is a sub-category of the L3 projects)
  • L2 projects when making use of e2e circuits (CBF or e2e gigabit ethernet services).

Different projects within a category may have different access, unless specified otherwise.

The projects may already have their AA scheme and may wish to have hundreds of people accessing network monitoring information or they could not have any AA infrastructure and only let a few number of their members to access the data/functionalities. It has to be investigated on a case by case basis.

An open issue is whether the identities for project members are provided by the NRENs and the GIdP or whether a project has to have its own certification authority to provide its members with AA certificates.

NREN Non Technical Staff

They are mostly interrested in a high level view of the network, of its health and of the SLAs. Their access will mostly be through visualisation or analysis tools.

End User

An end-user is somebody who actually uses the network and associated services and who sits within a research institute, an university a school or any institution connected to the networks connected to European NRENs (or European regional networks).

Security Users

The security users would be the ones working for the CERT and which need either to analyse what is happening within an attack or discover the pattern of an attack (source, etc). The security monitoring is being handeled by JRA2 which mentioned that at this stage of the project, they don't foresee any need for multi-domain monitoring information.

Network Researchers

The network researchers are researchers studying the network or doing research in areas which are making use of the network (protocols, applications, etc). Their needs vary depending on the type of research. Providing them access to any montoring information will provide them an advantage as they will be able to confront their research against real network information as well as better improving the networks model they are using in their simulation.

It is not clear whether it makes sense to define an extra group for these people. Maybe the research can be defined as a small project.

Attributes (JRA1 perspective)

A set of attributes needs to be specified for which an NREN can specify the rights that a user should have. The following lists therefore does not specify the rights itself, but deals with the attributes that are needed from a user in order to specify its rights.

  • user login: login id for identifying the user
  • user groups: user belongs to which one of the user groups defined above
  • home NREN: needed to differentiate the rights (NREN non-technical staff will e.g. be allowed to see such more data in their home network than in other networks)
  • home subdomain: each domain can define subdomains to distinguish between different groups (e.g. when they have an internal and external NOC); it can be decided to make a single field out of this and the previous field and to enforce some kind of structuring like Internet domain names
  • project identification (project only): Needed to distinguish people from the project member group so that they can only access information about the project they belong to.
  • user identity: This should be some kind of pointer allowing to find out the user's identity. Needed to really get to know who is using a tool and to allow for revoking the user rights in case of abuse (e.g. for BWCTL).

A comparison with the requirements from AMPS shows that these are basically similar so that Maurizio provided the following summary.


Attributes (both JRA1 and SA3/AMPS)

  • Attributes relative to the "role" of the user. I.e. what the user does within an organization. There are some slight differences between SA3 and JRA1:
    • JRA1 attempted to define a vocabulary (allowed values) for that: NOC, PERT, NREN-non-techn-staff, End User (ordinary), Security User, Network Researcher.
    • SA3 did not define a vocabulary, but just gave examples
    • JRA1: attribute must be multiple valued
    • SA3: attribute is single valued
    • Both: the attribute can have has an associated "hierarchy": E.g. urn:geant:edugain:component:perfsonar:user:gr.ntua. SA3 wants to do a "longest prefix match" AuthZ. For example, if group is urn:geant:edugain:component:perfsonar:user:gr.ntua, this user may get only generic priviledges, but if group is urn:geant:edugain:component:perfsonar:user:gr:ntua:noc this user may get higher priviledges, and if it is urn:geant:edugain:component:perfsonar:user:gr:ntua:noc:head_of_noc, even higher.
    • Both: the attribute can be used for AuthZ
  • Attributes relative to the "project" or "scope" the user belongs to. This information is "orthogonal" to the previous one (and posibly overrides - or complements it for AutZ purposes). It is likely to be more short-time-lived than the previous one.
    • JRA1: examples are L3 projects, PiP project, L2 projects
    • SA3: examples are default, egee, gn2
    • Both: the attribute must be multiple valued
    • Both: the attribute is non-hierarchical, although JRA1 thinks it may be useful to extend it with more specific information like "project:role_within_the_project:sub_role_within_the project, etc... Harmonising the role and sub-role definition across projects is of course too complicated, but an application may decide to take its AuthZ decision considering this attribute just up to the "depth" it is able to understand.
    • Both: the attribute can be used for AuthZ
  • Attributes relative to the "network connectivity", or "network domain" of the user, which is "normally" (but nor always) NREN:<end_institution>.
    • Both: the attribute has a hierarchy associated with it
    • Both: th attribute is single-valued
    • Both: the attribute can be used for AuthZ
  • Attributes relative to who the user *is* and how she/he/it can be contacted. Thus: Full name, e-mail, phone, work-address, etc. Thiese attribute is not used for AuthZ, but may be logged for other purposes (e.g. to trace back abuse or misuse, etc.). AuthZ may be denied if these attributes are not transferred for privacy restrictions, or if there is not at least the possibility to obtain a at least an opaque, persistent unique ID that the resource can use as a key for obtaining (possibly off-line) the full information from the IDP in case of abuse.
  • Attributes relative to the login of the user. Looks like a "nickname" or mnemonic userid. Also used for logging purposes rather than AuthZ.

Actions

Mostly two types of actions can be performed by the users: (get) data and (perform) test. Below a list of point to keep in mind as they will have some implication interms of the accesses given to different user group.

  • Privacy sensitive data: data which cannot be provided except if very good operational or security reasons.
    • Examples of such data: raw netflow, or packet capture tool output.
    • Usage: Security, advance throubleshooting of specific problems, anomaly detection.
  • Intrusive Active test: test which generates a large amount of data on the network and which can harm the network. Those tests would most likely only be accessible to NOCs and PERT.
    • Example: 1Gbps TCP/UDP throughput tests.
    • Some category of tests will only be available to PERT (e.g. as large UDP throughput tests)
    • Those tests will very limited parameters/boundaries can be openend to the project operational teams.
  • Active test: test which generate a light load on the network and does not risk to harm the network.
    • Example: ping, traceroute, OWD test, light TCP throughput test.
    • Usage: Troubleshooting.
    • They can be given to any operational group.
  • Intrusive and non-intrusive active test data: the result of a test not triggered by the users.
    • Example: retrieve data from ping, traceroute, OWD test, TCP or UDP throughput test.
    • Usage: troubleshooting, service verification
  • Passive data: data retrieved from network equipment
    • Example: interface load, interface errors, filter configuration
    • Usage: troubleshooting, service verification.

Personal tools