Archive for August, 2008

Cache implementation tests

I have just committed the test suite I’ve been using in order to test the cache implementation. You can find it at:

I have created a basic plain text format to simulate repositories and perform queries to the cache. This is an example:

A	/trunk
A	/branches
A	/tags
A	/trunk/project
A	/trunk/project/foo.txt
M	/trunk/project/foo.txt
A	/trunk/project/bar.txt
M	/trunk/project/foo.txt
M	/trunk/project/bar.txt
D	/trunk/project/foo.txt
test	/trunk/project/foo.txt	3
3	A
4	M
6	M
7	D

The first line clears the cache. The second line says that the cache is going to be loaded with data. The following lines contain the log messages that will be saved into the cache. The lines with a number start a revision. That number is the revision number. The other lines contain the changed paths information. For example A /trunk indicates that the /trunk folder is created (A). The author, date and log messages are not indicated. The action (A) and the path (/trunk) are separated with a tab. Every argument in each line is separated with a tab. To finish loading data into the cache the update command is used. Some change paths require more information. For example:

A	/branches/1.0	/trunk	2

That changed path indicates that the folder “/branches/1.0” is copied from “/trunk” at revision “2”.

To test the queries a against the database the test command is used. The test keyword is followed by a path and a revision number that identify the file. Then the cache creates a graph of the specified file in memory. To test verify the graph is ok all branch paths are listed followed by their nodes. This is:

3	A
4	M
6	M
7	D

Those lines tell that the branch /trunk/project/foo.txt should have the following nodes: A at revision 3, M at revision 4 and 6 and D at revision 7.

The test files are stored in a folder called “testfiles” and to run them you just need to run test.CacheTest using JUnit.

Note: the code depends on the following projects:

  • org.tigris.subversion.clientadapter
  • org.tigris.subversion.subclipse.core
  • org.tigris.subversion.subclipse.graph

Leave a Comment

Near to the end

The Google Summer of Code is finishing. The final evaluation starts next Monday and it finishes 1st September according to the timeline.

I think I’m doing a good job. I’m going to continue working on the revision graph these weeks and I also would like to continue working on it after the GSoC. It’s being a great experience.

This is a screenshot that shows the current status of the project.

Click to enlarge

Comments (1)

I was at Campus Party 2008

Last week I was invited to Campus Party 2008 by Google as I told in a previous post. Thanks to my friend plunchete, that put them on touch with me. I got two invitations so Daniel Latorre came with me.

I met Clara Rivera from Google Spain and Joaquín Cuenca co-founder of Panoramio (bought by Google). And we also were invited to the Google Developer Day on September in Madrid.

There were also other important people like Jon “maddog” Hall, and Tim Berners Lee.

Daniel, like me, is also working on the Google Summer of Code so we talked a little about our experience and the timeline of the program at the Google stand. The video was recorded and here it is (in Spanish, starting at minute 4).

In other areas there were talks about programming with the iPhone, how to create a startup, JSF with Oracle,… Some of them very interesting. It was a great experience.

Here you can see some photos and more photos.

Leave a Comment

Cache implementation reloaded (II)

Last week I talked about many changes in the cache implementation. Well, yesterday I made more changes.

My proposal on Summer of Code was based on using an embedded database. But I changed the algorithm that builds the graph and the queries against the database also changed. These queries now are very simple so I’ve decided to remove the database dependency and use binary files instead. Now I’m using two files to store the log messages:

LogMessages file

This file stores all log messages in a binary format. I use RandomAccessFile to read and write in it. RandomAccessFile lets me move to a certain byte position with its seek() method. Also it lets me know in which position I am, thanks to the getFilePointer() method.

Revisions file

This file indexes the previous file. It stores the offsets of each log message in the logMessages file. This is, from byte 0 to 7 it saves the offset of the first log message; from byte 8 to 15 it saves the offset of the second log message; and so on. That is, for the Nth revision the offset is stored at position (N-1)*8.

If I need to read the log message of a certain revision this is more or less what happens:*8);
long offset = revisions.readLong();;
logMessage = readNext();

Very simple, very fast 🙂

With this file it’s also very simple to know which is the latest revision number stored:

long latestRevision = revisions.length() / 8;

Now it takes about 3 seconds in my computer to build a graph from a repository with 4,000 revisions. The later implementation took 8 seconds. Also the database size is lesser and no initialization is required (the embedded database took about 10 seconds to initializate).

My plan now is to improve the graph output (maybe using Zest) and try to include merge information.

Leave a Comment

Cache implementation reloaded

It’s several days since I don’t talk about my task on Google Summer of Code. I’ve been having a great discussion in the development mailing list with Stefan Fuhrmann. Stefan has developed the revision graph for TortoiseSVN. Thanks to him I noticed my implementation didn’t scale. The algorithm wasn’t lineal, it had and exponential performance. These were very bad news.

The problem was with deleting or copying directories. When I got a “D /trunk/subfolder” I was marking all subfolders and subfiles as deleted too in that revision. Clearly the performance degrades when the number of files grows. And obviously this is not acceptable since copying a directory is very common: is what is done when you create a branch!

During this week I’ve rewritten most of the code. Now the cache follows these steps:

  1. Store all log messages without processing them and using batch updates. Since it doesn’t process the messages (i.e. mark any calculation about subfiles or subfolders) and it uses batch updates this step has a lineal perfomance and is faster than in the older implementation.
  2. Given a path and revision number it finds the first revision and path of that file (i.e. the root branch). It finds the first “A” action that is not a copy.
  3. From the root branch it iterates over the next revisions finding branches.

Now the performance meets my expectations. Steps 2 and 3 take about 8 seconds in a repository with 4,000 revisions. Both steps just make two queries against the database. They don’t execute any write statement so they are very fast. All the graph is calculated in step 3. The graph information is holded in memory and is serializable so it gives me an opportunity to save it in a file for a future use. That’s currently not implemented but I’m discussing it in the development list. Also I’m thinking a way to avoid the use of the embedded database since the current queries and insertions are very simple and probably I could use just files to store the information. I would like to remove the database dependency because it takes about 10 seconds to be initalizated (i.e. making the getConnection() call).

Leave a Comment