Sunday, December 19, 2010

For a Random String, Don't Hash!

When you're creating an application for more than just your team, you reach a point where you need to generate unique, random strings, usually for file uploads. The problem is generating these random strings in a quick manner while keeping the names short. If there are only a few users and you aren't concerned about someone accessing the files directly, then you can use an incrementing counter and give them numbers of 1, 2, 3, etc.

Most of the time you'll have a lot of people and you'll want to name your files so people can't guess them, hence we need something better than a simple counter. Enter the hash.

A hash is a function that takes any size string and outputs a fixed-length string. MD5 is a basic hash function that produces a 32 character hexidecimal result, so "RandomString" becomes 927ba77eb97e3d92ea12e710e0f5da1b. There are lots of hash functions out there, some outputting up to 160 characters.

So a basic file upload program could take the file and save it as a random filename by generating a hash of something, say the original filename, the file itself, or a "random" number. This makes it hard for the user to guess the filename, and provides us with a random string that shouldn't be the same as anything else on the server.

The problem is collisions. Let's create an imaginary hash function (called IHF) which gives us the last three digits of any number, so IHF(31415) is 415 and IHF(20) is 20. It's easy to see that IHF(27182) and IHF(8182) will both give 182 as a result, and this duplication happens for all hash functions. We can put anything into a hash function, so if you put in all the possible results plus one more thing, you've input more things than you can output; the function has to repeat. When two strings give the same hash, they're said to collide.

A better solution is to do a base conversion on the upload time. When the file is being saved, get the system time with as much precision as you can (in PHP you should use microtime(true)) and multiply the number by either 1,000 or 1,000,000 so that it's no longer a float. Now you have a unique number for that exact millisecond or microsecond (it varies between programming language and computer). Then use a base conversion function to change the number from base 10 to base 36. In PHP the code is as follows:
base_convert(microtime(true) * 1000000, 36)
which gives "cpy9ocjrjk" for a microtime of 1291932672.593456

Why should you choose base conversion over hashing?
  1. It gives a smaller result (only ten characters compared to 32 for MD5, or 25 characters if you convert the hash from hexadecimal to base 36)
  2. It doesn't take up any more space to code and it's just as fast.
  3. There's no chance of collisions unless two files are uploaded at the exact same millisecond or closer.
  4. The file names stay hard to guess just like the old hash system.
  5. If you're a risk-taker, you can forgo checking the filename and simply assume it's unique; we've never had a collision!
Further Reading:

No comments:

Post a Comment