Starting this year, many employers will be required to report to the Equal Employment Opportunity Commission (EEOC) their employees’ pay by sex, race, and ethnicity within 12 specified pay bins. This reporting will help the EEOC improve enforcement of pay discrimination laws and may provide some insight into the persistence of wage gaps. This week’s WAPPP seminar featured Paul von Hippel, Assistant Professor of Public Affairs at the Lyndon B. Johnson School of Public Affairs, University of Texas at Austin. While many methods have been used to analyze binned incomes, little work has been done to evaluate these methods. Professor von Hippel described three statistical methods for analyzing binned data and their relative accuracy in estimating underlying income differences.
According to Professor von Hippel, there are some misunderstandings about the limitations of binned date. The general impression is that because pay bins can be $10,000-$15,000 wide, it is impossible to estimate pay differences that are smaller than the bin width. While it is difficult to accurately measure the difference in individual pay within one pay band, it is possible to discern average or median pay differences with much greater precision. Even with a bin width of $10,000, it is possible to estimate pay differences within $1,500. Despite the assumption to the contrary, bin width is actually not an obstacle to this sort of analysis.
Depending upon the desired estimate, be it average income, median income, or an index of inequality like the Gini coefficient, certain types of statistical analysis will be more precise than others. Using Census data of binned incomes and statistics on underlying non-binned income, Professor von Hippel described three statistical methods to see how close each method comes, using the binned data, to estimating the underlying non-binned data.
Robust Pareto Midpoint Estimator (RPME): This method, the simplest, sets each household income to the midpoint of its bin. While it is not the most sophisticated method, it works about as well as more complex analyses, particularly if the bins aren’t too wide. As the number of bins increases, the estimates get better and better. The only “trouble spot” with this type of analysis is the top income bin, which doesn’t have an upper bound. To solve this issue, this method estimates a Pareto distribution that fits the top two income bins and plugs in the mean of that distribution for households within those bins. There are some issues with the traditional formula when there are a large number of high-income households, but using a harmonic mean tends to resolve this difficulty. This method is very quick and can run on thousands of employers within a minute or so. The downside of this method is that it’s unrealistic to assume that every household within a given pay bin has the same pay value, and the analysis may lose something by treating each household the same.
Multi-Model Generalized Beta Estimator (MGBE): This method involves fitting continuous distributions to the given income distribution. In looking at county-level Census data, Professor von Hippel set each of 10 different continuous distributions to the income distribution for a county, took the distribution that best fit that county’s data, and then used those distributions to estimate average income, median income, and the Gini coefficient. One positive aspect of this approach is that it treats incomes as continuous, not discrete. However, on the negative side, even the best-fitting distribution may not fit that well. This is particularly true if income is bimodal.
Spline CDF Estimator: This method uses nonparametric bin smoothing to spread incomes evenly across bins. A simple step function works nicely, but the method works even better if the bins are divided recursively or a cubic spline is fit to smooth over the step function to model the distribution of incomes. Professor von Hippel credits David Hunter and McKalie Drown for their work on this method, which combines the best aspects of the other two methods: the Spline CDF Estimator models income as continuous and perfectly reproduces bin counts.
Of these three techniques, the Spline CDF Estimator is the most accurate. Between the other two, RPME works about as well as MGBE for some estimates and is much faster, particularly when estimating average income. However, it is not as accurate for median income, and estimating inequality indices is even more difficult. There are still some inaccuracies with these methods, particularly when trying to estimate trends in inequality over time, but this analysis is good news for researchers studying wage gaps. With the binned data soon to be available from the EEOC, it will be possible to estimate income differences much more precisely than the bins would seem to indicate – and with good data comes good policy! RPME and MGBE are available in both Stata and in R’s inequality package, and the Spline CDF Estimator is available in R’s binsmooth package.