« Robot Chicken : Season 1 | The Accent Archive » |

I was helping a co-worker puzzle out an algorithm or two today, and he got me to go back through the steps I took to come up with an appropriate algorithm. It's a skill I have that comes from the combination of experience and a little bit of ability, and it's to the point where I'm actually not sure how the solutions even pop out of my head.

It's a skill I'd like to help others acquire, though, since putting together workaday algorithms can be really helpful in day-to-day development work.

So I'll try my best to explain what was going on, and intersperse some general commentary in between.

For this particular problem, take two XML trees:

*Source Tree*

Root

+-Users

..+-User

+-Export

..+-Artifact run-before=false

..+-Artifact

*Target Tree*

Root

+-Users

..+-User

+-Export

..+-Artifact

The objective of the code was to fill in the blanks of the target tree, adding any attributes (like 'run-before=false') to the target, and filling in any missing items.

The code to do this had actually been working pretty well before, but since items were now allowed to have attributes, you could actually have two of the same kind of item in these trees. Looking up items by name is now no longer sufficient, since asking for Artifact twice simply by name in the first tree will return the first one both times.

This leads me to my first rule of thumb:

#1 Algorithms that assume only *one* item can be involved can break down horribly when you now need the involvement of *multiple* items

This can totally catch you off guard. If you code for a loading up a single configuration and now require multiple configurations to co-exist, or you code up something that happens once in the lifetime of an oil well, and new business practices suddenly require it to be done on a semi-daily basis, you will find out how fragile your algorithm is.

Back to the problem at hand, I asked whether the processing that needed to be done could be done at any node, that is, could we consider just one node at a time. The answer was yes, which leads me to number two:

#2 If your algorithm loops or recurses, but the processing you need can work independently with a single item, take the looping or recursion as given and work on the process that happens on each item

(I'm sure there's a way I could more succinctly put these, but hey, it's not a best-selling book :) )

We started cracking out a few cases:

(A) Source item has no children

(B) Source item has one child

(C) Source item has multiple children

..(a) Source item's children are unique

..(b) Source item's children are not unique

When we looked at the problem, and laid out how a copy from target to source would work, number of children was not the contentious issue, because it worked exactly the same for multiple children as for no children:

1. "If Target item is missing a particularly-named child, then add a copy of Source item child to Target item"

2. "If Target item has a particularly-named child, then copy attributes from Source item child to Target item child"

If the Source item had no children, it did not need to bother. If the Source item had one child, it would perform these steps. (A), (B) and (C)(a) collapse into one feature, since they can all be phrased in the manner "for every child under the Source item", whether that be no children, one child, or many children.

So rule number three:

#3 Any cases that differ by number, but not behaviour, should be collapsed into one case

It looked here like the real issue was with children with duplicate names. At this point, we start looking at scenarios involving this.

(i) Source item has no children, Target item has one child

(ii) Source item has one child, Target item has one child

(iii) Source item has two non-unique children, Target item has one child

(iv) Source item has two non-unique children, Target item has two non-unique children

(v) Source item has three non-unique children, Target item has one child

Rule number four:

#4 Your scenarios usually only need a difference of two to simulate a difference of many

What that means is that, for example, if I proved that my algorithm works for case (v), where Source item has *two* more children than the Target item, then chances are it would work even if Source item had 159 items more than that. A difference of one usually isn't enough to discover many of the algorithm's problems.

Upon asking further, I found out that he wanted the algorithm such that the first non-unique child of the Source would map to the first non-unique child of the Target, second to the second, etc. If the Target did not have a corresponding non-unique child (so, say, Source has two [artifact]s, but Target has only one), he wanted one created and cloned from the Source. This check could be done for every item:

1. "If Source has an *n*th child item of the same name, and so does the Target, copy the attributes from Source child to Target child"

2. "If Source has an *n*th child item of the same name, but the Target does not, create a Target child item and copy it entirely"

Getting down to the nitty-gritty, we should set aside such functions as "copy the attributes from Source child to Target child" as functions to be implemented. So, number five:

#5 Find out what simple functions your algorithm needs to do, and set those aside as functions to be created

We get down to the point here where our checks now need to know information that may not be directly available. In this case, we now need to know:

- Given a child item, what is its position in the list of child items of the same name
- Given a position and a name, what is the specific child item

As it turns out, there seemed to be enough information present to answer these questions. Each node can find out its previous sibling and next sibling, and that's enough to tell us. With those two functions, we can now ask a child item "which [artifact] are you?", to which it can answer 'first', 'second', etc. This gets us past the problem that we could not determine just by *name* which item we were referring to.

So, a rule of thumb number six:

#6 Find out what questions your algorithm needs to ask, and set those questions aside as functions to be created

So all in all, we have the big recursion to implement, a couple of functions to implement for copying and adding, a couple of functions to ask about children and their order in the list, and now we've got one final step to do: see if our algorithm to handle special case actually covers off other cases.

It turns out that it does, for if you take the logic of our special case, it turns out to apply to the normal cases as well. The question "which child are you?" will simply always return "the first". This means that we do *not* have to separate out the "non-unique children" case from the "all-unique children" case any more. This is **good**.

Next to last rule:

#7 Take your special case algorithms and see if they apply to other cases. If so, and no performance concerns are present, take out the normal case.

...and the final one (for now):

#8 If you can combine your special case and other case algorithms with a little extra effort, *do* it

Algorithms you put through #7 and #8 are ones that can often survive future change a lot better.

My apologies in advance for the poor wording :) As I say, it's something that can be pretty hard to explain when you do it nearly automatically. If I have further insight, I will keep you posted :)