Sunday, December 19, 2010

For a Random String, Don't Hash!

When you're creating an application for more than just your team, you reach a point where you need to generate unique, random strings, usually for file uploads. The problem is generating these random strings in a quick manner while keeping the names short. If there are only a few users and you aren't concerned about someone accessing the files directly, then you can use an incrementing counter and give them numbers of 1, 2, 3, etc.

Most of the time you'll have a lot of people and you'll want to name your files so people can't guess them, hence we need something better than a simple counter. Enter the hash.

A hash is a function that takes any size string and outputs a fixed-length string. MD5 is a basic hash function that produces a 32 character hexidecimal result, so "RandomString" becomes 927ba77eb97e3d92ea12e710e0f5da1b. There are lots of hash functions out there, some outputting up to 160 characters.

So a basic file upload program could take the file and save it as a random filename by generating a hash of something, say the original filename, the file itself, or a "random" number. This makes it hard for the user to guess the filename, and provides us with a random string that shouldn't be the same as anything else on the server.

The problem is collisions. Let's create an imaginary hash function (called IHF) which gives us the last three digits of any number, so IHF(31415) is 415 and IHF(20) is 20. It's easy to see that IHF(27182) and IHF(8182) will both give 182 as a result, and this duplication happens for all hash functions. We can put anything into a hash function, so if you put in all the possible results plus one more thing, you've input more things than you can output; the function has to repeat. When two strings give the same hash, they're said to collide.

A better solution is to do a base conversion on the upload time. When the file is being saved, get the system time with as much precision as you can (in PHP you should use microtime(true)) and multiply the number by either 1,000 or 1,000,000 so that it's no longer a float. Now you have a unique number for that exact millisecond or microsecond (it varies between programming language and computer). Then use a base conversion function to change the number from base 10 to base 36. In PHP the code is as follows:
base_convert(microtime(true) * 1000000, 36)
which gives "cpy9ocjrjk" for a microtime of 1291932672.593456

Why should you choose base conversion over hashing?
  1. It gives a smaller result (only ten characters compared to 32 for MD5, or 25 characters if you convert the hash from hexadecimal to base 36)
  2. It doesn't take up any more space to code and it's just as fast.
  3. There's no chance of collisions unless two files are uploaded at the exact same millisecond or closer.
  4. The file names stay hard to guess just like the old hash system.
  5. If you're a risk-taker, you can forgo checking the filename and simply assume it's unique; we've never had a collision!
Further Reading:

Thursday, December 16, 2010

Developing a product within a service company

For this week's Lean Coffee Toronto, we discussed the ins and outs of developing a product inside of a service company.

A few interesting discussion points from my notes:
  • Products often come from satisfying an internal need
  • Working on products have many intangible benefits, including:
    • Increased motivation
    • "Bragging rights" for developers
  • How do you know when to "unbind" the product from the service company? No clear answers:
    • When the product needs investment in marketing, support, enhancements, etc.?
    • When the market is established?
  • How do you know when to "kill" the idea?
    • Treat it the same way as you would any investment
    • If it is being used and generating value internally, this question isn't relevant
  • Ask yourself: Is the product a hobby, or a business?

I found this discussion extremely valuable as we begin to productize our own internal system: http://softwarepms.com/

Thanks to everyone who attended, and especially to Andrew for facilitating, and Jet Cooper for hosting.

Thursday, December 9, 2010

Lean and Agile Terminology

As part of Lean Coffee Toronto #13, we discussed defining some of the terminology behind Lean and Agile.

This is a great way to "speak the same language" about things that we might already do without even knowing what it's called.

Here is a quick list of some of my favorites:
  • WGMGD: What gets measured gets done
  • Six Sigma: A method that focuses on increasing process performance and decreasing process variation through a variety of tools.
  • Anti-Pattern: Repeated practices that appear initially to be beneficial, but ultimately result in bad consequences that outweigh the hoped-for advantages.
  • Refactor: The disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior. Usually referred to when fixing bad code.
  • Iteration: (or "Sprint") It is a period of time where an agile team works to complete a set of Stories. This is typically from 1 to 4 weeks.
  • User Story: (or "Story") A feature described in the language of the customer.
  • Backlog: The set of stories that are not yet done.
  • Stakeholder: Any party that has an interest in the product/service produced by an organization's value stream.
Sources: