Darin Keever

 

DiffSync

DiffSync was created for transfer of large quantities of files across a bandwidth limited medium.  It is FTP on steroids.

 

Why

Where I work, we have a huge database of files for our buildtree (10,200 files that comprise 450Megs of storage).  For the most part, it is easiest to build with every file at it’s most recent.  This would usually mean scraping the previous tree and starting from the beginning.  Over VPN, this could mean at least 4 hours just downloading the files. 

 

Actually, I did a little experiment.  Using my LAN, I put all the files on one machine, then ftp’d them across to another machine – 2 hours.  I then tried zipping the files up first, then sending them across, then unzipping them – 25 minutes.  Wow!  The real problem is the folder by folder “get” it must do.  Just moving from to each folder must be most of the time.

 

The more I thought about it, the more I thought this must be the situation for most developers in a good build framework.  If not, then we seriously need to get our’s open-sourced!  J

 

When

Currently, the first implementation is under construction.  It is truncating files, and skipping others… so it’s far from finished.  But I thought I should find a good place to start and get some suggestions.

 

How

The basic algorithm is:

 

  1. Server starts
  2. Server adds appropriate shares
  3. Client connects to server
  4. Client queries server for it’s repository (the files it wants to share)
  5. Server sends an xml representation of the files it’s sharing (the fileInfoRepository)
  6. The client diffs this FileInfoRepository with some folder (maybe it already has an older version of some files).  This produces a new fileInfoRepository of all files that are newer than the client’s current version.  This keeps the client from having to get files that it already has (or has newer version of).
  7. The client sends this FileInfoRepository to the server.
  8. The server compresses the requested Files and sends them back to the client.
  9. The client unzips the files into the appropriate place.

 

From a higher level:

  1. User starts DiffSync (although each client is also a server, the first one must be just the server, as it cannot connect to anyone else) that becomes the server.  This is usually a server that everyone can get to (not behind a firewall or accessible through a firewall).  In my case, this is my home machine.
  2. User then sets the appropriate shares, if necessary.
  3. The user can start instances of DiffSync and connect to the server (these are my work repositories). 
    1. not yet implemented Further, it can initiate a request through an instance.  In my case, let’s say I’m on my other home machine (not the server).  I can start DiffSync on the 2nd machine and connect to the server.  Then I can query for a list of connections from the server.  A better analogy:  A is connected to B.  C connects to A.  C queries A for its connections.  C then can try to connect to B.
    2. not yet implemented  Upon disconnect, it will try to poll (once every 5 minutes) main server (the first server that it attached to) for 24 hours.  Not sure how this will affect authentication…
      1. No authentication until first “get?”

In my case, now I can get all my files from work I need… basically from anywhere.   Using another DiffSync, I can connect to the “server” (my home machine) and get to any work repository.

 

It also has some optimizations. 

  1. During the “GetFiles” process, the client begins 3 threads.  The idea was that one would be sending data, while one receives data, and one determines the next “chunk” of the fileInfoRepository.  The threads have are synchronized so that no two are in the same process.
  2. The “chunk” is limited so that the process isn’t “stuck” in one of the above states.  This allows the server to begin sending before all the files are compressed.  Also, it keeps the client/server from taking up too many resources!

 

What’s Next?

  1. Security
    1. Right now, it’s mostly a public read-only folder!  There should be some FTP “rights” added.  Include integration with Server’s users’ accounts?
    2. Authentication must be encrypted.
    3. The files should probably be encrypted as well as compressed during transfer.
  2. Ability for servers to share more than a directory (although, it will share the entire directory tree)
    1. This is a limitation for ease of use… I need one folder, so it uses one folder!
  3. Add interfaces to Source Management Software (version control)
    1. This should be implemented to interface with all those crappy source management frameworks, like SourceSafe, CVS, etc..  Getting a buildtree from one of those takes forever!
  4. Add WinDiff like interface so that you only get the files that you want.
    1. Maybe after the diff, you get two trees.. the new and the old?  Maybe be just like windiff and tree it out.  I dunno…   that where the suggestions part comes in.
  5. Add file restrictions to the GetLatest
    1. Like… *.c, *.cpp, etc..  so you only get the files you are concerned about.  Not sure how that would work with the WinDiff stuff.
  6. Add more ease-of-use features
    1. On first use, it should ask for

                                                               i.      Port number of server

                                                             ii.      If client apps can be downloadable

1.      I envision the app accepting HTTP requests and returning the pre-configured app (it will automatically connect to the server from which it was downloaded)

                                                            iii.      What kind of security it will use

  1. Ability to initiate a connection from a another server
    1. See 3a above
    2. Further, maybe bridge connections?  A connects to B.  Determines B is connected to C.  So, A tries to connect to C but it can’t connect because both A & C are behind VPN.  So B can become the bridge between A & C.  Just an idea.

 

Things TBD before I put the source on the Sourceforge

  1. Move VB source to C#
    1. I started using VB as my UI, and C# as my object code.  It really is unnecessary to have two (meta) languages.
  2. Create User Manual
  3. Fix sendingConnection/listeningConnection defect
    1. Right now I open two connections to send and receive.  This was definitely not the optimal method for solving this problem.  I will combine these two objects and create a better thread-remoting interface.