Distributed LS discussion
From GEANT2-JRA1 Wiki
Discussion on distributed LS document. To be edited by authors of the document only.
Overall document comments
A (MG): To review the document and make sure it specifies only things which are agreed and done. Add a new section where mention things mhich may be done in next steps.
- new comment:
In fact, new section.
- JB comment:
There are many places in the document that describe things that 'could' be done. They no longer belong in this document unless they are used to describe the way things are done, and why a particular decision was made. For example:
"The service may know the name of an LS via static configuration (the most common case for legacy deployments), or other forms of bootstrapping such as multicast may occur."
The 'or other forms of bootstrapping such as multicast may occur' only belongs in this document if we are specifying exactly how this is done, and specifying when it should be done.
Upper/lower scope vs. global/local scope
A (JZ): To make an edit pass of the specification document and change scope references to general terms instead of only two tiers. [done]
- MG comment:
I think we all agree to use upper/lover terms. The document should be corrected in all places where global/local scopes remained (mostly comments)
- JB comment:
I think the algorithms should be written with simply higher or lower level descriptions (relative to the current level). I don't really want *any* assumption of only two levels anywhere in this document. There are some locations in the document where upper clearly means the one above, and any above it. However, there are other locations where it is clear the idea is only 2 tiers.
I have no problem with the initial deployments being done that way, but this document is a design document. The algorithms need to be written to handle more than 2 levels from the beginning.
To be clear - I'm fine with the 'top' scope being defined by a domain name - and only subdividing by geography if we need to later. What I'm not fine with is the idea that there is only one level of scope below the domain.
- JZ comment:
Fixed, although I may have missed things, or created more confusion in the process of trying to 'generalize' please proofread for correctness.
Radix algorithm, summary computing
A (MS, KR): Provide algorithm description and first Java code (until 16 Nov).
- MG comment:
Thanks to Jason we have some description of the algorithm in general, but I believe we need one of those: a) detailed description how it works for IP address summary (with examples) b) pseudo code c) implementation in one of known languages (perl/java) with some comments. This may be in attachment
- MG comment:
As far as I understand the Lower Scope summarization will be based just on transformation (XSLT/XQ) and Upper Scope will use Radix Tree for summarizing IP addresses. That was JZ's proposal, but is it already confirmed by MS and JB - AFAIR they wanted to use Radix on both levels but with different 'precision'.
- MS comment:
I have a student that I can devote to helping out to write some experimental summarization code here.
- JB comment:
One important part about the summary computation is knowing when you are done. When is it compressed enough? One possible solution is to have a 'target' size for each dimension of the summary data (IP address is one dimension). Then you can use the radix structure to determine which 'branches' to collapse to reduce the list. (You might want to minimize the difference between the maximum/minimum mask value. Or alternatively minimize the amount of false-positive address space that is consumed by masking.) These decisions all need to be part of the algorithm that is described in this document. It MUST be this detailed.
When summarization occurs
A (all): Agreed that it is left up to the implementation.
- MS comment:
It is left up to the implementation when the summarization occurs (i.e. at message send time, or also as a periodic event).
- MG comment:
Agree.
Leader election metric - content size
A (JZ, MS): Update specification document about ID computation [JZ done, MS pending]
- MG comment:
Question: I am not convinced "contentSize" should be the only criterion. When new LS joins the ring it may have empty database, but in some time it'll get a lot of metadata. The mechanism of leader election bases on that, so its quite important. If such LS will "update" its "contentSize" value not all other LS-es may know about it immediately (token passing cycle) and they may not have the same knowledge about the leader. I"d prefer more deterministic way of leader election.
- JZ comment:
I have added some more reasoning behind this(...). I am not in favor of something complex when calculating a leader; we must choose something that everyone can figure out [instantly] so that the ring continues to function even during times when bad things are happening. Everyone will be able to see (at any given time) how much information someone has, and the notion that an unburdened system should be given more work (as opposed to a system that may be very busy with local queries yet chosen anyway based on separate criteria) is very appealing.
- MS comment:
I don't think that we can debate this in comments. I agree with Maciej. There's no way this should be the only option. Again, we describe the generic protocol based on an ID, and discuss, in one place in the doc, how to choose that ID.
- MG comment:
contentSize - I'm ok with the idea. If it's well tested in Gnutella. My only concern is what to do if contentSize of particular LS changes (new metadata comes). How does it influent on leader election. How to prevent passing leadership from one to another and back. This wasn't described in the document clearly (or at all).
- MS comment:
I looked at the doc including the comments. I agree with Maciej that the proposed metric for leader election isn't perfect (...) In the first proposal, we used the ip address + a priority to allow an admin to specify or to degrade to random. Jason's proposal to follow the gnutella model is good, but there are issues. The key is that we can describe the protocol and have N ways to compute the identifier. As long as a given scope does it the same way it doesn't matter.
- JB comment:
In my opinion - the leader election 'identifier' should not be something that can change too easily. If we go with the gnutella model, two LS's that are close to the same size could effectively 'fight' over the leader token. I would prefer the algorithm determine a more stable result. (We could go with the gnutella model and add some kind of *inertia parameter* to make it more stable. For example, the previous election winner could get some small number of bonus points.) Additionally, I would prefer not to use different 'identifier' algorithms at different scopes. The same LS instance is operating at multiple scopes and I would prefer to keep this simple until/unless it is proven that a single algorithm is insufficient.
- MS comment:
(From the comments inside document) My laptop with 2 elements might not be less busy than an 8 core superbox with 10K elements.
- JZ comment:
All references to contentSize (algorithms and examples) are removed. Because I am unclear on the direction we are taking this now, this is all I can do for this section. I will await MS's additions finalizing the ID based approach.
ID computing in isolated function
A (all): Agreed to be an isolated function.
- MS comment:
(...) code to compute the ID can be isolated into a function.
- MG comment:
Do we all agree? For me that's clear. But we need such description in the document - algorithm of computing the ID
- JB comment:
I agree that we should isolate this and do some implementation testing. However, I believe *this needs to be decided and finalized* in the document *BEFORE* public deployments. I'm concerned with trying to change this once we have deployments - it could be very complex to in in a backwards compatible way.
- MS comment:
There could be a static ID based on a scope (in my little world, my biggest machine is always the leader, but in a larger scope, it's dynamic). I think that with a URI (or a namespace?!?) to Identify the "ID scheme" in use, that we could slot another function in easily.
- JB comment:
I agree, the administrator defined priority is a prime example of why you might define the 'id' algorithm differently at different levels. I also agree that it can be isolated as a function for some experimentation.
- JB comment:
The more I think about this the more I believe the algorithm should be the same for all scope levels. The parameters that are used at each level (say an administrator defined portion) can be different. But, since a different instance can ascend to supremacy at any time, you need all instances knowing all algorithms.
Join: Broadcast summary after join vs. waiting for token, see 2.3.1, 2.3.3
A (MS): To send his thoughts to MG A (MG): Work on final solution based also on information received from MS. [done]
- MG comment:
I am still not convinced to sending broadcast summary when joining. It seems to be good idea, it's well motivated in the description. I can see more advantages of such solution than disadvantages, but my concern is how we prevent false 'registrations'. There must be such mechanism as 'community' password/hash or something. I am not sure whether our AA will solve this problem. Perhaps such "hash" should be generated by Leader and passed inside token between LS ring? Then the LS candidate joining the ring would be given the password and use it in broadcasting his summary (at the join phase). Otherwise ring participants won't know who is allowed to join the ring (and from whom to accept summary information, especially when it's not in the token)
- MG comment:
Section 2.3.3 contains two approaches. I think we should choose only one for dLS and argue why.
- JB comment:
I agree there is a possible DOS vector here. It is likewise true for normal service registrations. We should have a common solution for both.
- "MS comment"
I hadn't thought about security here. I think that we need to solve the security problem separately. Even if an LS should wait for the token, there's no good way for the other LSes to verify that it has it. I think that
immediate registration is good for responsiveness and that we should solve security in another way.
- JZ Comment:
This is listed as done above, but this is an ongoing discussion in e-mail (unless the thread has been dropped).
Token Rotation Time
A (all): Agreed to MS's proposal of simple computations of TRT as above.
A (MS): To describe leader election mechanism.
- JZ comment:
(in algorithm) 4. LS1 waits for some amount of time
- MG comment:
As far as I understand JZ proposed constant amount of time rather than time evaluated dynamically. But my concern is that time set once may be too low in case of many new LSes come or holding token time increases (if many metadata). Then new token will be generated very often. In my opinion there should be an algorithm for computing token rotation time (I thought it was realted to TTRT). Next idea could be to start from pre-set time and then compute time basing on previous token rotation times (with assigning some weight. All nodes could save their last timestams into token and the leader would evaluate full-rotation time (x2 or something). It'd be smart and quite easy to be implemented.
- MS comment:
(...) there is some concern about the token rotation time. I agree with Maciej that computing the summary on the fly will cause delay. But again, I think that we can say that the leader computes the TTRT and for day 1, it can be as simple as "2 minutes X #nodes." Easy. Get more complicated as we need to. And again, that should not be a barrier to getting started.
Token algorithm - who updates what?
A (MG): To propose solution for the updates in token rotation algorithm [done]
- MG comment:
When receiving the token the LS must update its local peer list (LSRing) but it should update the token peer list as well (add newly joined LSes)
- JZ Comment:
This is listed as done above, but this is an ongoing discussion in e-mail (unless the thread has been dropped).
Deadlines
A (all): Deadlines are known to all and we do our best to keep them.
- JB comment:
My biggest concern is in getting the document finalized before we have real-world deployments. I'm all for test deployments before that to try out some of these things.
- ST comment:
Our target date for prototype is end of this year and we must do our best to keep it. In the other case Nicolas will kick us off the project (the next release which should include dLS is planned even earlier)
- MG comment:
We need to address all open issues as quick as possible and begin the implemnetation (even if we do not finalize all things)
- JZ comment:
Because we will be working independently for the most part I am not in favor of leaving parts of the document unspecified (even a rough level of specification is very necessary). I understand there are deadlines to meet but creating sound, inter-operable software should be the most important aspect we are trying to achieve, and the specification is the first step in this.
- JB comment:
I agree we should do our best to keep to timelines.
Discovery Algorithm
A (JB, MS): To work on a more detailed description of discovery algorithm.
- new comment:
In fact, new section
- JB comment:
Success must *always* also return the referral to the next higher scope. Otherwise, a positive match in the local scope will hide any potential matches in a remote scope.
- "MS comment"
+1 except, it we should stop referring "up" when we get to someone who know there's nothing up there. The lower scope leader will know. (Here's a place where lower and upper don't make sense and where local and global make more. If the "upper" scope is internet2.edu and I'm in slc.internet2.edu, then I have to go to something that touches the global scope.
It might make sense to add something to the summarization algorithm here. If the leader LS sees something on both upper and lower scopes that match, it sends a pointer annotation down.
