Posts Tagged performance

Cache implementation reloaded

It’s several days since I don’t talk about my task on Google Summer of Code. I’ve been having a great discussion in the development mailing list with Stefan Fuhrmann. Stefan has developed the revision graph for TortoiseSVN. Thanks to him I noticed my implementation didn’t scale. The algorithm wasn’t lineal, it had and exponential performance. These were very bad news.

The problem was with deleting or copying directories. When I got a “D /trunk/subfolder” I was marking all subfolders and subfiles as deleted too in that revision. Clearly the performance degrades when the number of files grows. And obviously this is not acceptable since copying a directory is very common: is what is done when you create a branch!

During this week I’ve rewritten most of the code. Now the cache follows these steps:

  1. Store all log messages without processing them and using batch updates. Since it doesn’t process the messages (i.e. mark any calculation about subfiles or subfolders) and it uses batch updates this step has a lineal perfomance and is faster than in the older implementation.
  2. Given a path and revision number it finds the first revision and path of that file (i.e. the root branch). It finds the first “A” action that is not a copy.
  3. From the root branch it iterates over the next revisions finding branches.

Now the performance meets my expectations. Steps 2 and 3 take about 8 seconds in a repository with 4,000 revisions. Both steps just make two queries against the database. They don’t execute any write statement so they are very fast. All the graph is calculated in step 3. The graph information is holded in memory and is serializable so it gives me an opportunity to save it in a file for a future use. That’s currently not implemented but I’m discussing it in the development list. Also I’m thinking a way to avoid the use of the embedded database since the current queries and insertions are very simple and probably I could use just files to store the information. I would like to remove the database dependency because it takes about 10 seconds to be initalizated (i.e. making the getConnection() call).

Leave a Comment

First commit

After the initial import, the first commit comes. Before the weekend I did my code to work with a sample repository. Now I’m trying to make it work with a bigger repository: the subclipse repository, with near to 4.000 revisions. I found some bugs and I’m fixing them. For example the replace (“R”) action wasn’t handled at all.

I’m going to test my code on larger repositories. I want to know how long it takes to process all the data with the current implementation. I have several ideas about improving the performance. But all depends on the performance of each step:

  1. Fetching the information from the repository.
  2. Calculating branches and storing the information in the cache.
  3. Querying the information once the cache is full of information.

Depending on how long it takes each step I will make some improvements or others. I think the most important is the third step because the first and second ones will be big time consumption tasks only for the first time. In advance the first step will only ask for udpates, so the information will be much less compared with the first time. I will post some metrics and will discuss my thoughts in the sublclipse-devel mail list.

In the last post, as Mark noticed, I made a big mistake: posting a screenshot from TortoiseSVN instead of making a screenshot from subclipse 😛 I use TortoiseSVN very much because I use developing environments others than Eclipse when developing in programming languages such as Python or PHP, so I don’t always use subclipse 😛

So here is a screenshot of my code from Eclipse with Subclipse.

Leave a Comment