A major predictor in many of our models (enrollment, retention, etc.) is “distance from campus…” When it was decided that we would include this variable in our models, he response from on high was along the lines of, “Distance from campus? That can’t *possibly* be calculated by ordinary mortals. Let’s throw money at the problem!” Subsequently, and unbeknown to me, Rock City U drops a ton o’ bling to get this data every year.
Granted, this is not an easy problem, per se, but it sure ain’t hard. It just takes some effort and digging. I can promise you, the amount of time I spent (considering my annual salary) solving this problem cost RCU a hell of a lot less money than buying numbers from some data warehouse somewhere. And because of that (and, hell, since I already spent the time doing this, why should you?)… I’m going to show you how I figured out this “impossible” problem.
First, the data we bought was connected to zipcodes. Seeing as that was the *only* bit of data we were sending to the data warehouse folks, I figured all their calculations must somehow be based on these zips. Hm, I thought… I wonder if there’s an equation somewhere that calculates the distance between to points on a map using, what… maybe latitude and longitude? Hit 2 on my Google search turned up this gem of a website:
There it is, a really convoluted and complicated (but accurate) equation for calculating the distance between two points. (The link above is excellent and it very much gets into all of the math involved, but for the purposes of this post, I’m going to just skip over the details and let you take a look if you’re curious).
Now all we need is a file that converts zipcodes to lat/long and we’re good to go. After a very small amount of Googling, I found a handful of websites with zip info in databases, .csv files, lookups, batch queries, etc., etc. Just by way of thank you, here are a few good ones:
Or, do what I did and put your tax dollars to work: http://www.census.gov/tiger/tms/gazetteer/zips.txt
For the most part, I used the US Government’s info but, being as it’s from 1990, zips have been added so I had to do some digging here and there (and what I include below is not likely a *complete* dataset, so you might have to dig for a few stray numbers… but, with over 40K entries, it’s gotta’ be close (see step 2 below to download my final database).
So now what?
1. Open up whatever file needs distances added and make sure your variable with zipcode information is *numeric* and named “Zip.”
2. Download and open this Excel file and import into SPSS.
3. Sort both files in ascending order by “Zip.”
4. Merge Lat/Long variables into the file that needs distances added.
5. Find your zipcode in the above Excel file, copy the equation below into an SPSS syntax file, insert your corresponding latitude and longitude where indicated (in bold):
compute step1 = (sin(YOURLAT/57.2958) * sin(lat/57.2958)) + (cos(YOURLAT/57.2958) * cos(lat/57.2958) * cos(long/57.2958 – (YOURLONG)/57.2958)).
compute distance = 3963*artan(sqrt(1-(step1*step1))/step1).
6. Run SPSS code.
7. Behold the awesome.
Does it work? You ask. Of course it does. It’s science.
That said, here’s my little disclaimer: All distances are based on “as the crow flies” distances between somewhat arbitrary “midpoints” of each zipcode. That said, all of my distances were within 20 miles of the distances we bought from the data warehouse, and the average distance between my calculations and the distances we bought was < 2 miles.