Posts Tagged svn

Cache implementation reloaded (II)

Last week I talked about many changes in the cache implementation. Well, yesterday I made more changes.

My proposal on Summer of Code was based on using an embedded database. But I changed the algorithm that builds the graph and the queries against the database also changed. These queries now are very simple so I’ve decided to remove the database dependency and use binary files instead. Now I’m using two files to store the log messages:

LogMessages file

This file stores all log messages in a binary format. I use RandomAccessFile to read and write in it. RandomAccessFile lets me move to a certain byte position with its seek() method. Also it lets me know in which position I am, thanks to the getFilePointer() method.

Revisions file

This file indexes the previous file. It stores the offsets of each log message in the logMessages file. This is, from byte 0 to 7 it saves the offset of the first log message; from byte 8 to 15 it saves the offset of the second log message; and so on. That is, for the Nth revision the offset is stored at position (N-1)*8.

If I need to read the log message of a certain revision this is more or less what happens:

revisions.seek((revision-1)*8);
long offset = revisions.readLong();
logMessages.seek(offset);
logMessage = readNext();

Very simple, very fast 🙂

With this file it’s also very simple to know which is the latest revision number stored:

long latestRevision = revisions.length() / 8;

Now it takes about 3 seconds in my computer to build a graph from a repository with 4,000 revisions. The later implementation took 8 seconds. Also the database size is lesser and no initialization is required (the embedded database took about 10 seconds to initializate).

My plan now is to improve the graph output (maybe using Zest) and try to include merge information.

Leave a Comment

Cache implementation reloaded

It’s several days since I don’t talk about my task on Google Summer of Code. I’ve been having a great discussion in the development mailing list with Stefan Fuhrmann. Stefan has developed the revision graph for TortoiseSVN. Thanks to him I noticed my implementation didn’t scale. The algorithm wasn’t lineal, it had and exponential performance. These were very bad news.

The problem was with deleting or copying directories. When I got a “D /trunk/subfolder” I was marking all subfolders and subfiles as deleted too in that revision. Clearly the performance degrades when the number of files grows. And obviously this is not acceptable since copying a directory is very common: is what is done when you create a branch!

During this week I’ve rewritten most of the code. Now the cache follows these steps:

  1. Store all log messages without processing them and using batch updates. Since it doesn’t process the messages (i.e. mark any calculation about subfiles or subfolders) and it uses batch updates this step has a lineal perfomance and is faster than in the older implementation.
  2. Given a path and revision number it finds the first revision and path of that file (i.e. the root branch). It finds the first “A” action that is not a copy.
  3. From the root branch it iterates over the next revisions finding branches.

Now the performance meets my expectations. Steps 2 and 3 take about 8 seconds in a repository with 4,000 revisions. Both steps just make two queries against the database. They don’t execute any write statement so they are very fast. All the graph is calculated in step 3. The graph information is holded in memory and is serializable so it gives me an opportunity to save it in a file for a future use. That’s currently not implemented but I’m discussing it in the development list. Also I’m thinking a way to avoid the use of the embedded database since the current queries and insertions are very simple and probably I could use just files to store the information. I would like to remove the database dependency because it takes about 10 seconds to be initalizated (i.e. making the getConnection() call).

Leave a Comment

Integrating the graph into Eclipse

Until now the graph was shown in a sepparate window. Now I’ve integrated it as an Eclipse View.

Also, now the code runs as a background task. Thanks to Steve Elsemore 🙂

Leave a Comment

The cache design

The biggest challange in my GSoC project is the design and implementation of a cache system. Why a cache system is needed? For two reasons:

  1. To show the revision graph of a file I need a lot of information. If the file is at revision 123 I need to ask the repository for its history from the base revision to the 123th revision. Next time, suppose the file is at revision 234, so I’ll need to ask the repository for its history from the base revision to the 234th revision. But, I’ve asked twice for the information from the base revision to the 123th revision. This is, the information I asked will never change. I should only ask for updates.
  2. The second reason is the most important one. There is no way to ask SVN for the whole revision history of a file. I mean, its history and the history of its branches. If you ask for /trunk/Foo.java, SVN won’t tell you anything about its branches. If you ask for /branches/2.x/Foo.java SVN won’t tell you anything about other branches or changes in /trunk/Foo.java since the 2.x branch was created.

So what to do? The solution appears to be building a cache system. The cache will store information previously asked and also will store the information about branches in order to query easily for the whole history of a given file.

My cache design is based in an embedded database (Apache Derby) with the following structure.

The “files” table

I’ve decided to give a unique ID to each file. The same ID is shared with all its branches. So /trunk/Foo.java and /branches/2.x/Foo.java have the same ID. This table tells you for example:

/trunk/Foo.java was added in revision xxx, deleted in revision yyy and has the ID 123

With this table you can ask for exmple

Which ID has the file /foo/bar.c in revision 123?

And when you get the ID of a file you can ask for all its branches and when they were created.

The other tables

There are other two tables: revisions and change_paths. These tables just store all the log information: author, revision number, date, action,…

The algorithm

When someone wants to see the graph of a file. I follow these steps.

  1. Fetch all the log information from the repository. Or only the updates from the last revision that was stored in the cache.
  2. Store the fetched information in the cache. Calculating IDs and branches.
  3. Query the cache for the *whole* history of a file. This is a simple query: Give me all the log messages of the file with ID xxx.

The problem is that the first two steps can be high time comsumption tasks. But in my opinion the big goal is to make the third step as fast as possible. As I learnt in database design the queries are the most important. However I’ve thought about possible changes to improve performance.

Possible alternatives

To only use the “files” table. Storing just the branching information will make the cache smaller. With the information of this table you know when files were added, deleted and branched. The information about authors, comments, dates and M (modify) actions will be fetched on demand. This approach could have two steps. First building a simple graph with the branches and then the information about authors, comments,… could be loaded asynchronously.

To only store the log messages of selected files. This is, not to store all log messages but only the log messages of those files that are required to show the graph. This is the same than the previous approach but storing the log messeges once they have been fetched from the repository.

These approaches make the cache smaller. But they need to connect to the repository several times. I will only change my main approach to one of these if queries are slow.

My focus now is to resolve some bugs and improve the graphical part. This is, I want to integrate the graph into an eclipse view / editor and make it work with small but real-life repositories. I also want to show a progress bar while the cache is fetching information from the repository or while is calculating the IDs and branches.

Leave a Comment

First commit

After the initial import, the first commit comes. Before the weekend I did my code to work with a sample repository. Now I’m trying to make it work with a bigger repository: the subclipse repository, with near to 4.000 revisions. I found some bugs and I’m fixing them. For example the replace (“R”) action wasn’t handled at all.

I’m going to test my code on larger repositories. I want to know how long it takes to process all the data with the current implementation. I have several ideas about improving the performance. But all depends on the performance of each step:

  1. Fetching the information from the repository.
  2. Calculating branches and storing the information in the cache.
  3. Querying the information once the cache is full of information.

Depending on how long it takes each step I will make some improvements or others. I think the most important is the third step because the first and second ones will be big time consumption tasks only for the first time. In advance the first step will only ask for udpates, so the information will be much less compared with the first time. I will post some metrics and will discuss my thoughts in the sublclipse-devel mail list.

In the last post, as Mark noticed, I made a big mistake: posting a screenshot from TortoiseSVN instead of making a screenshot from subclipse 😛 I use TortoiseSVN very much because I use developing environments others than Eclipse when developing in programming languages such as Python or PHP, so I don’t always use subclipse 😛

So here is a screenshot of my code from Eclipse with Subclipse.

Leave a Comment

Initial import

Two days ago I got access to the subclipse repository and I have just imported my current code.

Thank you very much Mark!

Comments (3)

First screenshot

This is the first screenshot about my progress in the Google Summer of Code.

Of course there is a lot of work to do. This is just the first visual thing I can show you. I have been working with the Draw2d API and I have tried to do something similar to the RevtreePlugin. Each column is a branch and each change is ordered by revision number from the top to the bottom. At the top you can see the names of the different branches.

My project at Google Summer of Code can be divided into two tasks.

  • The first task is the “cache”. Since I need a lot of information from the repository I need a cache system. I am developing a cache using an embedded database writen in pure Java: Apache Derby. By the moment the cache works, but at the moment is very simple. I will need to do some perfomance enhancements to be able to work with real-life repositories. I will talk about this in a future post.
  • The other task is the graph. Once we have taken the information about the revisions of a given file I need to show it. This is what I’m showing you. I have spent some time reading this article and playing with the Draw2d API and that is the first result. You can see it like the first version of the Wikipedia (something that will be good in the future, but not at this moment 😉 ). As you can see the graph is shown in a separated window. However it should open inside an Eclipse View. So this is work in progress :P.

Leave a Comment

Setting up the environment

These are the steps I took to have my environment ready for coding.

  1. At first you need to checkout the subclipse code. Here you’ll find instructions about accesing the source code repository. Use ‘guest‘ for username and password.
  2. Make a clean installation of eclipse. Since sublicpse is a plugin for Eclipse I downloaded Eclipse Europa for RCP/Plug-in Developers.
  3. Sublicpse itself is also needed.  So install it.
  4. Import the projects under trunk/subclipse with File -> Import… -> General / Existing Projects into Workspace. Then choose the trunk/subclipse folder.
  5. The build of the project org.tigris.subversion.clientadapter.svnkit will fail, but it doesn’t matter. SVNKit is pure Java library for accessing SVN repositories. By default (at least in Windows) JavaHL is used instead. You can change this in the preferences page: Window -> Preferences -> Team / SVN.

Once I set up the environment I made some changes in the source code and I tested it. I think the easiest change is adding a component in the preferences page. The source code for the preferences page can be found in the org.tigris.subversion.subclipse.ui.preferences.SVNPreferencesPage class. I added the second line in this example code:

showCompareRevisionInDialog = createCheckBox(composite, Policy.bind("SVNPreferencePage.showCompareMergeInSync"));
createCheckBox(composite, "foo bar");

Now you need to test that this code works. To do this open the plugin descriptor file (plugin.xml) for the org.tigris.subversion.subclipse.ui project. You’ll see an editor pane with many tabs at the bottom. In the default tab (Overview) there are two options for Testing: “Launch an Eclipse application” and “Launch an Eclipse application in Debug mode”. Click one of them. A new Eclipse application will be launched with your code changes. Now open the SVN Prefrences Page to see the changes.

Leave a Comment