It’s only been a couple of weeks since the last update about Review Board, but too much is happening to stay quiet.
Ever go through a long patch review process only to find it’s getting harder and harder to review any new changes due to the size of the patch? I’m sure most large projects have had to deal with this at one point or another.
Review Board can now display interdiffs, which are diffs between versions of diffs. This makes it easy to see what was modified since the last change. Currently, there’s no UI for this, as there’s still a few backend changes that need to be made before we allow commenting on interdiffs, but it’ll be a full, proper feature very soon.
New Diff Algorithm
We were previously using Python’s SequenceMatcher, along with some hacks, to generate the diffs. SequenceMatcher, unfortunately, just isn’t a desirable diff generator, due to its algorithm, quirks (assuming every change between matching blocks consists entirely of inserts, deletes, or changes instead of a mix of them), and lack of extensibility (no leading whitespace trimming to simplify diffs containing lots of indents).
I’ve known for a while that we needed a new algorithm, so I finally sat down to write one. The obvious choice was the algorithm used in GNU diff (amongst other programs) — Eugene Myers’ O(ND) Difference Algorithm. After a few frustrating nights that left me nearly bald (ok, not really), it worked! And the diffs were much better than the older diffs. I think this is the first Python implementation of this algorithm.
This also gives us the ability to add some heuristics and other logic to clean up the diffs. The result is much more readable diffs that no longer show indentation changes or huge misaligned fragments.
Improved support for project maintainers
Project maintainers can now quickly download the uploaded diff from the diff viewer in order to apply the patch to their own tree. They’re also able to mark contributors’ review requests as submitted. While small additions, these make Review Board much more useful to open source projects.
Better browser and SCM compatibility
We’re working on better browser compatibility and better SCM compatibility.
CVS is almost implemented. Contributors are working on Git and BZR support. We’re working on the best ways to get these distributed systems integrated cleanly into the system.
Subversion post-review script
We just received a patch for adding a Subversion-specific post-review script. Soon, we’ll combine the two and add support for CVS and other systems. This will give us one tool for creating review requests in all systems.
We’ve had a number of good contributors come out of the wood work, and from the sounds of it, several companies are using it now. I’m planning to put up a page listing teams, companies and projects using Review Board, so let me know if you’d like to be on the list! 🙂
Maybe you could have made use of the diff code that Google published a week or so ago. I think they might have beaten you to it!
It could have saved you some work … oh well. Maybe yours is better. 😉
“This also gives us the ability to add some heuristics and other logic to clean up the diffs. The result is much more readable diffs that no longer show indentation changes or huge misaligned fragments.”
How will one know if a diff trashes the indentation or leaves whitespaces all over the place?
Apan: Very valid point. The whitespace issue isn’t a problem unless it’s leading whitespace. Trailing whitespace, as well as lines containing only whitespace, are left alone.
The issue of trashing indentation is a potential problem. There are things we may be able to do, such as mark up indentation-only changes specially while keeping them out of the actual diff chunks. It’s also possible for us to add an option for showing the diff with/without whitespace changes.
So far, the potential for a cleaned diff far outweighs the risks of a change completely screwing up indentation, and there are definitely things we can do about that.
It’s too bad I couldn’t bring it to your attention before you spent the effort to implement the GNU diff algorithm, but there’s a diff algorithm that a couple VCSes are using that gives really good results (when used on source code). Basically, rather than doing a longest common sub-sequence, it finds all lines that occur exactly once on both sides and matches on those first. Then it subdivides the file based on those matches and uses the same algorithm on those sub-parts (i.e. matching on lines that are unique within that section of the file). This algorithm seems to have a much higher probability of producing short, readable diffs, and a much lower chance of producing a diff that seems “wrong” to a human.
It was developed by Bram Cohen. You can find one Python implementation of it included in the file attached here (which is actually a merge algorithm that uses this diff algorithm internally):
Bram mentioned recently that bazaar is now using it:
Ken: It’s interesting that you said that. As I was going to sleep last night, I started wondering about that same approach and figured it would be especially good to find out if lines have been moved, so that we could mark them up in a certain way. I have some notes with almost that exact approach, but I wasn’t sure how much better it would actually work. If this is indeed being used in other VCSes, maybe I’ll give it a shot.
Thanks for the links!
ChipX86: Bram’s algorithm will give good diff results for cases where text has been moved, but it will still appear as “lines deleted here, lines added there.” The main thing it avoids is the kind of confusion GNU diff can produce, like “these curly braces on lines by themselves are unchanged but everything around them is different.”
P.S. FWIW, the merge tool I personally rely on (and a bunch of microprocessor designers at Intel rely on) is using Bram’s diff algorithm (actually the whole precise Codeville text merger which he developed).
It is likely that Apple will be using Review Board to some extent internally, although, as with most things at work, I can’t really talk about it.
Mercurial has an improved diff module as they experienced some problems with the difflib Python module. It is faster and gives good results in terms of patch size: http://www.selenic.com/mercurial/wiki/index.cgi/FAQ#head-4173e68b299fdcfb38baa08f08c3fbb1868e7d1a
It also uses a C module for the speed-critical portions and it can be found here: http://selenic.com/repo/hg/file/tip/mercurial/mdiff.py