taskAdun - Tasks: task #6928, Speedup Wildcard Interaction...

 
 
Show feedback again

You are not allowed to post comments on this tracker with your current authentification level.

task #6928: Speedup Wildcard Interaction Search Section

Submitted by:  Nils Drechsel <nilsdrechsel>
Submitted on:  Tue 26 Jan 2010 03:16:16 PM UTC  
 
Should Start On: Tue 26 Jan 2010 12:00:00 AM UTCShould be Finished on: Sun 21 Feb 2010 12:00:00 AM UTC
Category: UL Priority: 1 - Later
Status: In ProgressPrivacy: Public
Percent Complete: 60%Assigned to: Nils Drechsel <nilsdrechsel>
Open/Closed: OpenPlanned Release: Adun 0.83
Effort: 0.00

(Jump to the original submission Jump to the original submission)

Wed 27 Jan 2010 10:14:41 AM UTC, comment #6:

Alright, I'll implement your approach. If it's still too slow, I'll add mine.

Nils Drechsel <nilsdrechsel>
Project MemberIn charge of this item.
Tue 26 Jan 2010 05:31:27 PM UTC, comment #5:

Yes - if an interaction was on the bottom of the list you would have to iterate over all of them - but this would only happen to a few. A similar amount should be near the top of the list. This is also the way it was before the wildcard matching was added and then the build only took a couple of seconds at most (rather than minutes).

So if current the time m*n*x is (m iterations for n atoms each iteration taking x time). And I want to make it a hundred times faster i.e. (m*n*x)/100 - I can do it either by reducing the number of iterations by a factor of a 100 for each atom (maybe the tree would do even more? - but m is unlikely to be bigger than a couple of hundred which limits the speed up there) or I could reduce x by a factor of 100.

The reason I would do the X first is that from the profile it seems it is the string operations, not the number of iterations, that is main expense. I'm sure that simply removing all these operations would radically speed up the search - it would be similar to just commenting out the wildcard check and if i do this here the speed goes up dramatically (the search for multiple torsions then becomes the main bottleneck)

i'm sure that having a more efficient search would give further speed-up but I think getting from a couple of minutes to a couple of seconds is alot, and it would require very little extra programming.

Michael Johnston <mjohnston>
Project Administrator
Tue 26 Jan 2010 04:57:36 PM UTC, comment #4:

This would speed up the comparison itself, but it would not reduce the number of comparisons, or am I wrong?
If the interaction I'm interested in is at the very bottom of the interaction list, I still have to go through all the others, because I search sequentially for a match. I'd like to access those elements like a tree (as if we would not use wildcards at all)

Nils Drechsel <nilsdrechsel>
Project MemberIn charge of this item.
Tue 26 Jan 2010 04:38:54 PM UTC, comment #3:

It probably easier if I explain a bit what happens when the FFML is parsed.
As the program runs through the FFML it instantiates classes for each XML element (depending on its type). What I propose is modifying ULParameterNode (the class for parameter FFML elements) so on instantiation it checks
1 - If it is a wildcard
2 - The types of each of the atoms in the interaction.

Then you add three new methods to ULInteractionNode
/**
Returns true if the parameter entry describes a wildcard interaction
*/
- (BOOL) isWildcard;
/**
Returns the type for the i'th atom in the interaction. This is
ULStandardAtomType - A normal atom
ULGeneralWildcardType - An atom type of X
ULSpecificWildcardType - An atom whose name contains *
*/
- (ULParameterAtomType) typeForAtom: (int) atomIndex;
/**
Returns True if string \e type matches the type of the interaction atom at index \e atomIndex.
It matches if
- The type at atom index is ULGeneralWildcardType
- The type at atom index is ULSpecificWildcardType and the first characters of the two atoms are the same e.g. C* and CA
- The atomTypes are identical
*/
- (BOOL) atomTypeAtIndex: (int) atomindex matchesAtomType: (NSString*) type;

In the last method you can have cached what the first character is of each ULSpecificWildcardType so you don't have to call
substringWithRange: all the time.

Then in the ULInteractionsBuilder you go

//Replace checking if something is a wildcard with a call to the interaction instance
if([interaction is Wildcard])
{

}

Then when checking for a match you can just do (forwards and backwards)

atomMatchDetected = YES;
for(i=0; i<numberAtoms; i++)
{
myAtomType = [atoms objectAtIndex: i]
if(![interaction atomTypeAtIndex: i matchesAtomType: myAtomType]
{
atomMatchDetected = NO;
break;
}
}

Michael Johnston <mjohnston>
Project Administrator
Tue 26 Jan 2010 04:19:31 PM UTC, comment #2:

I'm not quite sure if I understand your idea.

I totally agree, that we have to reduce the number of comparison operations and the number of interactions checks which lead to no further improvement of the quality of the interaction.

So, you suggest, that we go through the interaction list one time and note which interactions have wildcards and which have none. When looking for a match, we quickly access the "pure" interaction list for a match and when none is found sequentially search the "wildcard" interaction list?

I think I'm getting it quite wrong.

Nils Drechsel <nilsdrechsel>
Project MemberIn charge of this item.
Tue 26 Jan 2010 03:40:14 PM UTC, comment #1:

Seems like an interesting idea, but running a profile on the code it seem that it is the string comparison operations that are most costly (finding if its a wildcard and checking what the types of each atom in the wildcard is)

Another way I would suggest is that when you parse the FFML library the first time you note which entries are wildcards, and what type each atom is in it. These can then be set as ivars of the ULInteractionNode instance associated with each parameter entry. Then instead of potentially (where m is number of wildcards and n is number of atoms)
- m*n checks to see is a parameter entry a wildcard
- up to m*n subsequent checks to see is atom i in the wildcard is an X
- up to m*n subsequent checks to see if atom i in the wildcard contains a *

You would only have to do each m times. This would not only remove a lot of operations but would give richer information about each parameter entry in the FFML through the programmatic interface (which could be used again in other situations e.g. an FFML editor etc.)

I think you should also move the check if you already have a strong match outside the check to see if something is a wildcard. Currently once you've found a sting match you no longer take any others - but you still check if any of the other wildcards match, leading to lots of redundant operations (as far as I can see)

Michael Johnston <mjohnston>
Project Administrator
Tue 26 Jan 2010 03:16:16 PM UTC, original submission:

I have a plan to create a custom tree-based datastructure which can be accessed like a dictionary, only that it allows fuzzy search, so that we can find interactions with wildcards in O(m * log n) instead of O(m * n) (which we have at present). I wanted to present it at the conference, but there was no time.
I want to use this datastructure again for other wildcard search problems (I have a few in GB-OBC).

Nils Drechsel <nilsdrechsel>
Project MemberIn charge of this item.

 

No files currently attached

 

Depends on the following items: None found

Items that depend on this one: None found

 

Carbon-Copy List
  • -unavailable- added by mjohnston (Posted a comment)
  • -unavailable- added by nilsdrechsel (Submitted the item)
  •  

    Do you think this task is very important?
    If so, you can click here to add your encouragement to it.
    This task has 0 encouragements so far.

    Only logged-in users can vote.

     

    Please enter the title of George Orwell's famous dystopian book (it's a date):

     

     

    Follows 1 latest change.

    Date Changed By Updated Field Previous Value => Replaced By
    Wed 28 Apr 2010 08:06:57 AM UTCnilsdrechselPercent Complete0%=>60%
    Show feedback again

    Back to the top


    Powered by Savane 3.1-cleanup