Visualizing Achievement Gaps with Tableau (part 1)

It’s been a minute! I’m not sure how this will last, but I’m hoping to put together semi-regular blog posts again. Ideally, someone out there will find *something* I have to say useful! As always, what I’m sharing with you today is based on work I’ve done at my own institution–but the visualizations in this post have been created using fabricated data.

Today’s topic? Visualizing achievement gaps in students. I plan to break this topic into two parts. In part one, I’ll be building a scatterplot looking at failure rates on a course-by-course basis. This scatterplot will then be used as a filter for part two; part two will lay out differences in failure rate by gender.

The final product is a dashboard with the scatterplot at the top. Each dot represents a course; the further to the right the dot falls, the higher the D, F, or W (withdraw) rate for that course. The higher it is place on the chart, the larger the enrollment for the course. The size and color of each dot represents the raw number of students who failed that course.

The bottom of the dashboard shows the achievement gap between men and women on a course-by-course basis. For the purpose of this visualization, we are defining “achievement gap” as the difference in pass rates by gender. Here is a look at the end result:

Database

I built the database for this project out of our demographic and enrollment census files. The demographic files are unique at the level of [Student x Term], so each student is only included once per term. The enrollment files are unique at the level of [Student x Term x Course], so each individual student may have multiple rows in a given term, assuming they took more than one course. I included fields indicating Term, Course Department, College Name, Course Level (Grad, Undergrad), Course Subject (e.g., PSY, PHL), Catalog Number, Course Description, and Course Grade and Gender. I also included a field we call “Component” which indicates the *type* of course (e.g., Lab, Lecture, Fieldwork, etc). It’s kinda’ tiny, but here’s what the data looks like:

2018-05-11_16-25-01

Now the fun part. I’m making the assumption that you, dear reader, have some basic Tableau knowledge so I’m not going to step through the initial setup for the Tableau workbook; except to say that I fired up Tableau and connected to the data source. My first order of business was  to turn the Course Grade field into a binary. I coded all “failing” grades as 1 and all passing grades as 0. This way, when you average the binary you get a percentage of failing grades. As a point of clarification: for this project, we are considering grades of F, D, W [withdraw], and U [unsatisfactory] to be failing). This new field was named “Fail.”

Tableau Visualization

Once I recoded the grades I started building out the first chart. I placed the “Fail” field in the columns shelf and changed the measure type from “Sum” to “Average.” Next, I placed the sum of the “Number of Records” field in the rows shelf”

step1Shelves

I then placed the “Course Description” field into the marks shelf as a “detail” and selected the mark type of “Circle” from the dropdown menu. This converts the chart to something that much more clearly resembles a scatterplot:

step2Chart

Finally, I added the sum of the “Fail” field to the marks shelf as two attributes–Color and Size–and changed the color palette to *reversed* “red-green diverging.” This allows the end user to better differentiate the higher-fail courses from the lower-fail courses:

step3Colors

At this point, I had a functional visualization so the rest is mostly cosmetic. I added a more descriptive title, labeled the axes, converted the horizontal to display percentages, and added a few filters to give end users a little more ability to drill into the information based on Term, College Name, and Department. I also added a sliding filter to allow users the ability to filter out courses by the total number of students enrolled in a given class. To do this, I calculated enrollment at the level of Term and Course Description using the “fixed” formula:

step4FixedComman

I also made some edits to the tooltip to provide users with additional information when they rollover circles on the chart. Here is how that looked:

step5tooltip

The end result for part one of this visualization is a worksheet that looks like this:

Distance from Campus

12819896984_979da1112c_o

A major predictor in many of our models (enrollment, retention, etc.) is “distance from campus…” When it was decided that we would include this variable in our models, he response from on high was along the lines of, “Distance from campus? That can’t *possibly* be calculated by ordinary mortals. Let’s throw money at the problem!” Subsequently, and unbeknown to me,  Rock City U drops a ton o’ bling to get this data every year.

Cringe.

Granted, this is not an easy problem, per se, but it sure ain’t hard. It just takes some effort and digging. I can promise you, the amount of time I spent (considering my annual salary) solving this problem cost RCU a hell of a lot less money than buying numbers from some data warehouse somewhere. And because of that (and, hell, since I already spent the time doing this, why should you?)… I’m going to show you how I figured out this “impossible” problem.

First, the data we bought was connected to zipcodes. Seeing as that was the *only* bit of data we were sending to the data warehouse folks, I figured all their calculations must somehow be based on these zips. Hm, I thought… I wonder if there’s an equation somewhere that calculates the distance between to points on a map using, what… maybe latitude and longitude? Hit 2 on my Google search turned up this gem of a website:

http://www.meridianworlddata.com/Distance-Calculation.asp

There it is, a really convoluted and complicated (but accurate) equation for calculating the distance between two points. (The link above is excellent and it very much gets into all of the math involved, but for the purposes of this post, I’m going to just skip over the details and let you take a look if you’re curious).

Now all we need is a file that converts zipcodes to lat/long and we’re good to go. After a very small amount of Googling, I found a handful of websites with zip info in databases, .csv files, lookups, batch queries, etc., etc. Just by way of thank you, here are a few good ones:

http://eweboptions.com/zipcode_radius/data/ZipCodes.txt
http://stevemorse.org/jcal/latlonbatch.html?direction=forward

Or, do what I did and put your tax dollars to work: http://www.census.gov/tiger/tms/gazetteer/zips.txt

For the most part, I used the US Government’s info but, being as it’s from 1990, zips have been added so I had to do some digging here and there (and what I include below is not likely a *complete* dataset, so you might have to dig for a few stray numbers… but, with over 40K entries, it’s gotta’ be close (see step 2 below to download my final database).

So now what?

1. Open up whatever file needs distances added and make sure your variable with zipcode information is *numeric* and named “Zip.”

2. Download and open this Excel file and import into SPSS.

3. Sort both files in ascending order by “Zip.”

4. Merge Lat/Long variables into the file that needs distances added.

5. Find your zipcode in the above Excel file, copy the equation below into an SPSS syntax file, insert your corresponding latitude and longitude where indicated (in bold):

compute step1 = (sin(YOURLAT/57.2958) * sin(lat/57.2958)) + (cos(YOURLAT/57.2958) * cos(lat/57.2958) * cos(long/57.2958 – (YOURLONG)/57.2958)).
execute.
compute distance = 3963*artan(sqrt(1-(step1*step1))/step1).
execute.

6. Run SPSS code.

7. Behold the awesome.

Does it work? You ask. Of course it does. It’s science.

That said, here’s my little disclaimer: All distances are based on “as the crow flies” distances between somewhat arbitrary “midpoints” of each zipcode. That said, all of my distances were within 20 miles of the distances we bought from the data warehouse, and the average distance between my calculations and the distances we bought was < 2 miles.

Wecome to Rock City Institutional Research

My name is Tom and I’m your host for the time being.

A little history: I’m the Director of Institutional Research at a large state university in the Midwest and I absolutely love my job. I’ve noticed something, though. I spend hours upon hours upon hours reinventing the wheel. Some days I spend more time searching the web for tips and tricks than I do actually conducting research… In that time, though, I have learned a lot.

My initial hope, with this blog, is to share some of that knowledge with the Institutional Research community at large. Ultimately, I hope this humble piece of web real estate develops into a community of like-minded researchers sharing their experiences, successes, challenges, and… perhaps… some of their own secrets. Happy reading!