Darin Keever
DiffSync
DiffSync was created for transfer of large quantities of
files across a bandwidth limited medium.
It is FTP on steroids.
Why
Where I work, we have a huge database of files for our buildtree
(10,200 files that comprise 450Megs of storage). For the most part, it is easiest to build with every file at it’s
most recent. This would usually mean
scraping the previous tree and starting from the beginning. Over VPN, this could mean at least 4 hours
just downloading the files.
Actually, I did a little experiment. Using my LAN, I put all the files on one
machine, then ftp’d them across to another machine – 2 hours. I then tried zipping the files up first,
then sending them across, then unzipping them – 25 minutes. Wow!
The real problem is the folder by folder “get” it must do. Just moving from to each folder must be most
of the time.
The more I thought about it, the more I thought this must be
the situation for most developers in a good build framework. If not, then we seriously need to get our’s
open-sourced! J
When
Currently, the first implementation is under
construction. It is truncating files,
and skipping others… so it’s far from finished. But I thought I should find a good place to start and get some
suggestions.
How
The basic algorithm is:
- Server
starts
- Server
adds appropriate shares
- Client
connects to server
- Client
queries server for it’s repository (the files it wants to share)
- Server
sends an xml representation of the files it’s sharing (the
fileInfoRepository)
- The
client diffs this FileInfoRepository with some folder (maybe it already
has an older version of some files).
This produces a new fileInfoRepository of all files that are newer
than the client’s current version.
This keeps the client from having to get files that it already has
(or has newer version of).
- The
client sends this FileInfoRepository to the server.
- The
server compresses the requested Files and sends them back to the client.
- The
client unzips the files into the appropriate place.
From a higher level:
- User
starts DiffSync (although each client is also a server, the first one must
be just the server, as it cannot connect to anyone else) that becomes the
server. This is usually a server
that everyone can get to (not behind a firewall or accessible through a
firewall). In my case, this is my
home machine.
- User
then sets the appropriate shares, if necessary.
- The
user can start instances of DiffSync and connect to the server (these are
my work repositories).
- not
yet implemented Further, it can initiate a request through an
instance. In my case, let’s say
I’m on my other home machine (not the server). I can start DiffSync on the 2nd machine and
connect to the server. Then I can
query for a list of connections from the server. A better analogy: A is connected to B. C connects to A. C queries A for its connections. C then can try to connect to B.
- not
yet implemented Upon
disconnect, it will try to poll (once every 5 minutes) main server (the
first server that it attached to) for 24 hours. Not sure how this will affect authentication…
- No
authentication until first “get?”
In my case, now I can get all my files from work I need…
basically from anywhere. Using another
DiffSync, I can connect to the “server” (my home machine) and get to any work
repository.
It also has some optimizations.
- During
the “GetFiles” process, the client begins 3 threads. The idea was that one would be sending
data, while one receives data, and one determines the next “chunk” of the
fileInfoRepository. The threads
have are synchronized so that no two are in the same process.
- The
“chunk” is limited so that the process isn’t “stuck” in one of the above
states. This allows the server to
begin sending before all the files are compressed. Also, it keeps the client/server from
taking up too many resources!
What’s Next?
- Security
- Right
now, it’s mostly a public read-only folder! There should be some FTP “rights” added. Include integration with Server’s
users’ accounts?
- Authentication
must be encrypted.
- The
files should probably be encrypted as well as compressed during transfer.
- Ability
for servers to share more than a directory (although, it will share the
entire directory tree)
- This
is a limitation for ease of use… I need one folder, so it uses one
folder!
- Add
interfaces to Source Management Software (version control)
- This
should be implemented to interface with all those crappy source
management frameworks, like SourceSafe, CVS, etc.. Getting a buildtree from one of those
takes forever!
- Add
WinDiff like interface so that you only get the files that you want.
- Maybe
after the diff, you get two trees.. the new and the old? Maybe be just like windiff and tree it
out. I dunno… that where the suggestions part comes
in.
- Add
file restrictions to the GetLatest
- Like…
*.c, *.cpp, etc.. so you only get
the files you are concerned about.
Not sure how that would work with the WinDiff stuff.
- Add
more ease-of-use features
- On
first use, it should ask for
i.
Port number of server
ii.
If client apps can be downloadable
1.
I envision the app accepting HTTP requests and returning the
pre-configured app (it will automatically connect to the server from which it
was downloaded)
iii.
What kind of security it will use
- Ability
to initiate a connection from a another server
- See
3a above
- Further,
maybe bridge connections? A
connects to B. Determines B is
connected to C. So, A tries to
connect to C but it can’t connect because both A & C are behind
VPN. So B can become the bridge
between A & C. Just an idea.
Things TBD before I put the source on the Sourceforge
- Move
VB source to C#
- I
started using VB as my UI, and C# as my object code. It really is unnecessary to have two
(meta) languages.
- Create
User Manual
- Fix sendingConnection/listeningConnection
defect
- Right
now I open two connections to send and receive. This was definitely not the optimal method for solving this
problem. I will combine these two
objects and create a better thread-remoting interface.