Visualizing Achievement Gaps with Tableau (part 2)

Next, we’ll build out the chart at the bottom of the visualization I shared in part 1. This visualization is a little trickier than the last as we’ll be creating a dual-axis chart.

The first step will be to set up the shelves such that our fail rates display as columns and our course descriptions display as the rows. You do this by dragging the “Fail” field into the columns shelf *twice*:

The default measure for the “Fail” is SUM, which displays the raw number of students who failed the class. We want to see the percentage of students who failed, so we need to convert the measures for the “Fail” fields to averages:  

Now, the goal of this visualization is to show a mark for each gender with a line between the two. The line between the two marks isn’t critical, per se, I just personally feel that it better illustrates the difference between the two genders. To do this, we need to have Tableau create a dual axis chart. This is done by right-clicking on one of the “Fail” fields and selecting “Dual Axis”:


Thus far, we’ve simply created a dual axis chart with all of the exact same information, so essentially, the marks from one axis exactly overlap the marks from the other axis. We need to go into the “Marks” shelves to start making modifications to this default. Because we’re interested in drawing a line between the two genders, we go to the “Marks” shelf for the first “Fail” field, change the dropdown for the mark type to “Line” (see red box ) then drag the “Gender” field to the Color box and again to the Path box. I have also added the SUM of the “Number of Records” field and the SUM of the “Fail” field to the Marks shelf as Tooltips. We’ll circle back to this, but this technique allows you to add additional information to the visualization through the Tooltip rollover feature.

The next step would be do make the marks for gender by adding fields to the “Marks” shelf for the second “Fail” field. The problem is that if we use the same “Gender” field, we need to use the same colors to designate the difference between male and female students. This means that the line between the two genders will display as a gradient that transitions from one gender’s color to the other gender’s color. The end result is something like this:

Depending on your audience, this gradient effect may be just fine. Should you want to show a solid color between the two marks, however, you’ll need to duplicate the “Gender” field and tinker with the colors a bit (more on that in a moment). First, to duplicate a field, head over to the list of variables, right-click on the one you want to copy and select “Duplicate”:

Next, head over to the “Marks” shelf for the second “Fail” field and be sure you have the dropdown for the type of mark set to “Shape” (see the top red box below). Next, put your new “Gender” field (the default name will be “Gender (copy)”) into the Marks shelf and set it as the Color. Finally, in order to make sure the Tooltip works correctly on both types of mark (Line and Shape) you’ll need to drag the “Number of Records” and “Fail” fields into the “Marks” shelf and set them as Tooltips:

This should give you a chart that looks something like this:

Obviously, this looks a little weird. Aside from the gradient issue mentioned above, the shapes and lines are all out of whack; here’s why: when you create a dual axis chart in Tableau, the top and bottom axes aren’t synchronized by default. You have to right-click on one axis or the other and select “Synchronize Axis”:

Finally, you’ll want to set the colors for the Line shape to bet the same; I chose grey but any color is fine. Do this by clicking on the “Color” box in the “Marks” shelf for the first “Fail” field. Next, click on “Edit Colors” and then you can set each gender’s color manually:

Once this is done, you should end up with a  chart that looks something like this:


Now, this essentially produces a massive, scrollable list of all courses offered in the Fall of 2015 and Spring of 2016. Alone, this isn’t the most useful chart. My plan is to combine the scatterplot from Part 1 to this gap chart and put them all into one visualization. Then, I’ll  use the same filters from Part 1 and set them to apply to all of the charts on the dashboard. If you choose not to pair the two into one visualization, you will likely want to add some filters to help your end users dig into the data a little bit better.

Finally, I created a new dashboard, added the two charts we’ve created in parts 1 and 2, and set the filters to apply to all worksheets using the underlying data source:

For the purposes of the blog post, I used a custom sized dashboard and brought everything in as “floating” objects rather than “tiled.” I find that this method gives a little more flexibility for design and, when done well, can help squeeze more onto a given dashboard. That said, it does require a little more tinkering to be sure everything is laid out in a way your end users find useful. I’ve chosen not to go into too much detail on that process here, rather, I’ll just embed the final product here:

Finally, my typical disclaimer: Though these dashboards are based on work I’ve done in my “day job,” the particular examples above use fabricated data.

Visualizing Achievement Gaps with Tableau (part 1)

It’s been a minute! I’m not sure how this will last, but I’m hoping to put together semi-regular blog posts again. Ideally, someone out there will find *something* I have to say useful! As always, what I’m sharing with you today is based on work I’ve done at my own institution–but the visualizations in this post have been created using fabricated data.

Today’s topic? Visualizing achievement gaps in students. I plan to break this topic into two parts. In part one, I’ll be building a scatterplot looking at failure rates on a course-by-course basis. This scatterplot will then be used as a filter for part two; part two will lay out differences in failure rate by gender.

The final product is a dashboard with the scatterplot at the top. Each dot represents a course; the further to the right the dot falls, the higher the D, F, or W (withdraw) rate for that course. The higher it is place on the chart, the larger the enrollment for the course. The size and color of each dot represents the raw number of students who failed that course.

The bottom of the dashboard shows the achievement gap between men and women on a course-by-course basis. For the purpose of this visualization, we are defining “achievement gap” as the difference in pass rates by gender. Here is a look at the end result:


I built the database for this project out of our demographic and enrollment census files. The demographic files are unique at the level of [Student x Term], so each student is only included once per term. The enrollment files are unique at the level of [Student x Term x Course], so each individual student may have multiple rows in a given term, assuming they took more than one course. I included fields indicating Term, Course Department, College Name, Course Level (Grad, Undergrad), Course Subject (e.g., PSY, PHL), Catalog Number, Course Description, and Course Grade and Gender. I also included a field we call “Component” which indicates the *type* of course (e.g., Lab, Lecture, Fieldwork, etc). It’s kinda’ tiny, but here’s what the data looks like:


Now the fun part. I’m making the assumption that you, dear reader, have some basic Tableau knowledge so I’m not going to step through the initial setup for the Tableau workbook; except to say that I fired up Tableau and connected to the data source. My first order of business was  to turn the Course Grade field into a binary. I coded all “failing” grades as 1 and all passing grades as 0. This way, when you average the binary you get a percentage of failing grades. As a point of clarification: for this project, we are considering grades of F, D, W [withdraw], and U [unsatisfactory] to be failing). This new field was named “Fail.”

Tableau Visualization

Once I recoded the grades I started building out the first chart. I placed the “Fail” field in the columns shelf and changed the measure type from “Sum” to “Average.” Next, I placed the sum of the “Number of Records” field in the rows shelf”


I then placed the “Course Description” field into the marks shelf as a “detail” and selected the mark type of “Circle” from the dropdown menu. This converts the chart to something that much more clearly resembles a scatterplot:


Finally, I added the sum of the “Fail” field to the marks shelf as two attributes–Color and Size–and changed the color palette to *reversed* “red-green diverging.” This allows the end user to better differentiate the higher-fail courses from the lower-fail courses:


At this point, I had a functional visualization so the rest is mostly cosmetic. I added a more descriptive title, labeled the axes, converted the horizontal to display percentages, and added a few filters to give end users a little more ability to drill into the information based on Term, College Name, and Department. I also added a sliding filter to allow users the ability to filter out courses by the total number of students enrolled in a given class. To do this, I calculated enrollment at the level of Term and Course Description using the “fixed” formula:


I also made some edits to the tooltip to provide users with additional information when they rollover circles on the chart. Here is how that looked:


The end result for part one of this visualization is a worksheet that looks like this:

Distance from Campus


A major predictor in many of our models (enrollment, retention, etc.) is “distance from campus…” When it was decided that we would include this variable in our models, he response from on high was along the lines of, “Distance from campus? That can’t *possibly* be calculated by ordinary mortals. Let’s throw money at the problem!” Subsequently, and unbeknown to me,  Rock City U drops a ton o’ bling to get this data every year.


Granted, this is not an easy problem, per se, but it sure ain’t hard. It just takes some effort and digging. I can promise you, the amount of time I spent (considering my annual salary) solving this problem cost RCU a hell of a lot less money than buying numbers from some data warehouse somewhere. And because of that (and, hell, since I already spent the time doing this, why should you?)… I’m going to show you how I figured out this “impossible” problem.

First, the data we bought was connected to zipcodes. Seeing as that was the *only* bit of data we were sending to the data warehouse folks, I figured all their calculations must somehow be based on these zips. Hm, I thought… I wonder if there’s an equation somewhere that calculates the distance between to points on a map using, what… maybe latitude and longitude? Hit 2 on my Google search turned up this gem of a website:

There it is, a really convoluted and complicated (but accurate) equation for calculating the distance between two points. (The link above is excellent and it very much gets into all of the math involved, but for the purposes of this post, I’m going to just skip over the details and let you take a look if you’re curious).

Now all we need is a file that converts zipcodes to lat/long and we’re good to go. After a very small amount of Googling, I found a handful of websites with zip info in databases, .csv files, lookups, batch queries, etc., etc. Just by way of thank you, here are a few good ones:

Or, do what I did and put your tax dollars to work:

For the most part, I used the US Government’s info but, being as it’s from 1990, zips have been added so I had to do some digging here and there (and what I include below is not likely a *complete* dataset, so you might have to dig for a few stray numbers… but, with over 40K entries, it’s gotta’ be close (see step 2 below to download my final database).

So now what?

1. Open up whatever file needs distances added and make sure your variable with zipcode information is *numeric* and named “Zip.”

2. Download and open this Excel file and import into SPSS.

3. Sort both files in ascending order by “Zip.”

4. Merge Lat/Long variables into the file that needs distances added.

5. Find your zipcode in the above Excel file, copy the equation below into an SPSS syntax file, insert your corresponding latitude and longitude where indicated (in bold):

compute step1 = (sin(YOURLAT/57.2958) * sin(lat/57.2958)) + (cos(YOURLAT/57.2958) * cos(lat/57.2958) * cos(long/57.2958 – (YOURLONG)/57.2958)).
compute distance = 3963*artan(sqrt(1-(step1*step1))/step1).

6. Run SPSS code.

7. Behold the awesome.

Does it work? You ask. Of course it does. It’s science.

That said, here’s my little disclaimer: All distances are based on “as the crow flies” distances between somewhat arbitrary “midpoints” of each zipcode. That said, all of my distances were within 20 miles of the distances we bought from the data warehouse, and the average distance between my calculations and the distances we bought was < 2 miles.

FAFSFA Position As Predictor of Enrollment

pexels-photo-259191.jpegIf you’re IR office is anything like Rock City’s, you’re always being asked to provide insight into the ways students will make decisions about your university. Perhaps you’re building a predictive model of enrollment… or maybe you’re being asked to look at factors contributing to retention or persistence to graduation. You’ve also, no doubt, learned that what predicts these things nationally (or “in the literature”) do not necessarily provide predictive utility at your own institution.

Something we’ve found to be predictive of whether or not an admitted student enrolls is where they place Rock City on their FAFSA application. That simple. The sooner they list us, the more likely they will enroll. This trick might not work for you, as it requires that  a large percentage of your applicants, a) apply for aid; and, b) list you on their FAFSA application. Obviously, this technique also makes the assumption that students are listing schools in order of preference and not by some other method (alphabetically, numerically by FAFSA code, etc).

No doubt, these are some pretty big assumptions. This may ultimately prove to be a waste of time for your school/university… but it certainly has worked for us. Great, you say, I’m down to give anything a try once. Now what?

Well, it all requires data. Here’s how ours is set up:


Not the greatest screenshot, I’ll admit… but basically what we have here is a student ID number (for merging into whatever database you end up merging this into), and FAFSA code numbers in whatever position they were entered by the student (these are the CAMPUS1-CAMPUS6 variables). Because some FAFSA codes have letters, these are coded as string variables.

So the next step is to make this data usable. I’ve done this in two ways… First, I’ve put together syntax that makes a dummy code out of each of the 6 variables (1 = “My institution,” 0 = “Another institution”). I’ve written this code for Wright State (not, incidentally, Rock City’s main campus). You’ll have to change ‘003078’ in the following syntax to your own FAFSA code:


RECODE WRIGHT7 (1 = 0) (0 = 1) INTO WRIGHT7.

We found this coding to be the most useful and predictive method of using FAFSA codes. There seemed to be a fairly linear relationship between positions 1, 2, and 3, but less so among positions 4, 5 and 6. As such, entering them as dummy variables seemed to work best for us. In case you were wondering, WRIGHT7 for students who do not list your university anywhere on the FAFSA.

NOTE: Typically, you would leave one variable out of your regression analysis to serve as a reference group. If every student in your sample applied for aid, you will still need to do this. If, however, you have a group of students who did not apply for aid, when you merge this data back into your original dataset, you can recode missing values on these 7 variables as 0. When you then enter the 7 variables into your equation, the reference category will be students who did not apply for aid.

If you think there’s a more linear relationship among positions, you could then use the following code to make a continuous variable out of your  position:

Again, a 7 indicates a student applied for aid but did not list your institution on the application. If you have students who did not apply for aid, you will need to account for them when you merge this data back into your main dataset.

NOTE: All of this assumes that the FAFSA data is a separate dataset that will be merged into some sort of larger database.  The syntax I have written does *not* account for students who did not apply for aid. This is an important group that you will need to account for in your final models. As I mentioned, when you merge this transformed data into your larger dataset, those with missing data on these variables are the students who did not apply for aid.

Additional thoughts: Rock City found that we were able to improve the predictive utility of our model with a variable that coded whether or not a student listed a top rival institution on their FAFSA. Sadly, those who did were about 2 times less likely to enroll at Rock City U… but it certainly helped improve our model.

TOEFL Scores – Conversion Syntax

Internationalization. Diversity. Inclusion.

Let’s face it, if your institution is anything like Rock City U’s, you’re starting to see an increased internationalization and diversification of your student body. This is good… but if your IR office is anything like ours, you’re going to be asked to study these students in detail. One of our recent tasks was to see if scores on the Test of English as a Foreign Language (TOEFL) test were related to retention, GPA, satisfaction, etc., etc. Easy, right? Throw together some correlations, a t-test or two, maybe even a logistical regression if you’re feeling fancy…

Oh but if it were that simple. If you’ve worked with TOEFL scores in the past, you know that ETS has created a wonderfully convoluted exam. Not only do they provide students with the option of taking the test in one of three distinct modes of administration… they score each mode on a separate scale. And what glorious scales: 0-120 for your internet test, 0-300 for the computer test, and… wait for it… 310 to 677 for the old-skool paper version. Beautiful.

If you search around the ETS website, you can find a comparison table that will help you figure out how to convert these three scales to one distinct measure. Or you could click here.

Thankfully, my administration was only interested in looking at the overall total score. You can find this comparison table on page 6. One thing should come clear pretty quickly: The computer-based test is the only one that never has a range as a converted value. For this reason, and this reason alone, I chose to convert all scores to their computer-based equivalent.

Of course, it wouldn’t be IR without an additional wrinkle. Rock City’s home institution, instead of creating a separate variable for each type of test, lumps them all together in our admissions database under the variable TOEFL. Yup. All in the same variable. Soooooo…. I have to make some assumptions. They are this:

  • If a student’s TOEFL score fell between 0 and 120, they took the internet-based test and needed to be upconverted to the computer-based range.
  • If a student’s TOEFL score fell between 120 and 300, they took the computer-based test and did not need conversion.
  • If a student’s TOEFL score was greater than 300, they took the paper-based test and needed to be downcoverted to the computer-based range.

Obvious problems here, right? It’s totally possible that a student did a *really* crappy job on the computer-based test (say, got a 110) and, in this conversion scheme, end up looking quite good. Unfortunately, that’s the type of error imperfect data introduces into an analysis. Hopefully your institution makes it clear which test the student took so you don’t have to make this kind of assumption.

So… that’s the background of this little piece of SPSS code. I’ve recoded all the way down to a 49 on the internet test, and all the way down to a 463 on the paper test. At the very least, this should give you a good start. If, like me, your school puts everything under one variable, you can rename that variable to “testscore” and you should be able to run this as-is. You’ll get your converted scores in a variable called “testscore_R”. Enjoy… and please let me know if you know a more streamlined way to do this (or, heaven forbid, if you find any errors).


p.s. – WordPress won’t let me upload a .sps file, so you’ll have to cut and past this one.

Wecome to Rock City Institutional Research

My name is Tom and I’m your host for the time being.

A little history: I’m the Director of Institutional Research at a large state university in the Midwest and I absolutely love my job. I’ve noticed something, though. I spend hours upon hours upon hours reinventing the wheel. Some days I spend more time searching the web for tips and tricks than I do actually conducting research… In that time, though, I have learned a lot.

My initial hope, with this blog, is to share some of that knowledge with the Institutional Research community at large. Ultimately, I hope this humble piece of web real estate develops into a community of like-minded researchers sharing their experiences, successes, challenges, and… perhaps… some of their own secrets. Happy reading!