GSS Dashboard

16 minute read

Summary & Context

This project demonstrates 1. Visualization using Tableau, and 2. Preprocessing and analyzing a large public dataset.

I chose this project to practice Tableau and to represent my background in the social sciences. It is also an effort to celebrate publicaly available data and an attempt to make that data even more accessible. The General Social Survey, which receives its primary funding from the National Science Foundation, has been collecting data on American political opinions and social behaviors since 1972. While the analyses here tell a story of Americans’ declining trust in instutions alongside rising political polarization, current administration efforts to slash public spending puts funding for projects like the GSS in the crosshairs.

The General Social Survey consists of 34 waves of representative surveys. The entire dataset contains over 72,000 rows and over 6,600 columns. See the GSS website for further details.

My purpose here is not to provide extensive analysis of any one measure, but users can interactively explore basic relationships through visualization. I constructed a series of measures at my discretion, including:

Education
Health
Happiness
Social Attitudes
Social Relationships
Confidence in Institutions
Confidence in Government
Confidence in Media
Work-Life Balance
Quality of Life
Religiosity
Polarization

Click on a measure button to see its description, distribution, and trends. Toggle a demographic dimension. Select a specific year or all years.

Unfortunately, I have difficulty right-sizing Tableau’s dashboards for github pages. Automatic scaling in Tableau alters objects in the view, and github’s default width does not correspond with my desktop. Visit my Tableau Public Profile for a more responsive Dashboard: 50 Years of the General Social Survey

Details for each measure are shared below. Briefly, most measures are individual-level indices (except Education and Health, which are raw measures, and Polarization, which is an aggregation). All measures except Polarization have been standardized for better comparability. Global trend line data include imputed data for the sake of drawing a continuous trend line - but those imputed data do not carry over to other charts.

Python coding scripts are available in my Github repository. Examples are shown below.

Measure and Dimension details

Variable Name	Variable Type	Description	GSS Variables
Education	Single Measure	Years of school, standardized.	EDUC
Health	Single Measure	Self-reported Health, scale 1 (poor health) - 4 (excellent health), standardized.	HEALTH
Happiness	Index Measure	Average happiness, life excitement, and happy relationships, standardized.	HAPPY, LIFE, HAPMAR, HAPCOHAB
Religiosity	Index Measure	Average religious service attendance, prayer, importance of religion, and strength of beliefs, standardized.	ATTEND, PRAY, RELITEN, GOD, BIBLE
Social Relationships	Index Measure	Average time spent with family, friends, neighbors, and out at bars, standardized.	SOCREL, SOCFREND, SOCOMMUN, SOCBAR
Social Attitudes	Index Measure	Average trust, and feelings about others being helpful and fair.	TRUST, HELPFUL, FAIR
Work-Life Balance	Index Measure	Product of job satisfaction and reversed hours/week worked for respondents with at least 10 hr/week, standardized.	SATJOB, HRS1
Quality of Life	Index Measure	Average standardized Education, Health, Social Relationships, and Work-Life Balance.	EDUC, HEALTH, SOCREL, SOCFREND, SOCOMMUN, SOCBAR, SATJOB, HRS1
Confidence in Institutions	Index Measure	Average confidence in 13 institutions.	CONEDUC, CONFED, CONMEDIC, CONARMY, CONBUS, CONCLERG, CONFINAN, CONJUDGE, CONLABOR, CONLEGIS, CONPRESS, CONSCI, CONTV
Confidence in Government	Index Measure	Average confidence in congress and executive branch.	CONFED, CONLEGIS
Confidence in Media	Index Measure	Average confidence in television and the press.	CONPRESS, CONTV
Age Group	Dimension	Respondent’s age categorized for comparability.	AGE
Degree	Dimension	Respondent’s education level (note correlation with Education, above).	DEGREE
Race	Dimension	Respondent’s race.	RACE
Socioeconomic Status	Index Dimension	Standardized education (years of school), occupational prestige, and total family income, split: 30%(low) / 40%(mid) / 30%(high).	EDUC, PRESTG10, REALINC
Gender	Dimension	Respondent’s gender.	SEX
Race * Gender	Dimension	Respondent’s Race/Gender.	RACE, SEX
Race * SES	Dimension	Respondent’s Race/Socioeconomic status.	RACE, EDUC, REALINC, PRESTG10
Place	Dimension	Type of Social Environment, recoded using SIZE and XNORCSIZ.	SIZE, XNORCSIZ
Region	Dimension	Region of interview. All nine categories are used in the map. Collapsed to four categories for dimension comparison.	REGION

Sample Processing, Recoding & Index Construction

In the code below, I will

Select variables from GSS for the dashboard.
Demonstrate key variable recodes.
Construct indices, demonstrating reliability analysis, factor analysis, and a minimum variable threshold.
Impute measures for continuous trend lines.

Select Variables

# import required packages and libraries
import os 
import pyreadstat # reads sas files
import pandas as pd
import numpy as np
import pingouin as pg
from factor_analyzer import FactorAnalyzer
from sklearn.preprocessing import StandardScaler # I'll also write my own standardization function. 
import matplotlib.pyplot as plt

os.chdir("C:/Dat_Sci/Data Projects/GSS/dashboard project")

df, meta = pyreadstat.read_sas7bdat('C:/Dat_Sci/Datasets/GSS_sas/gss7222_r3.sas7bdat')

df.shape
Out[5]: (72390, 6691)

6,691 is a lot of columns. To identify relevant columns, I used the GSS website’s variable explorer dashboard, which allows you search variables by topic area, or by module, including their “core” module - measures that are collected at every wave, and many others. The following are selected after extensive digging, researching and iteratively realizing what I am interested in.

trends = df[["YEAR", "SIZE", "XNORCSIZ", "AGE", "SEX", "EDUC", "PRESTG10", "REALINC",
           "DEGREE", "RACE", "HAPPY", "TRUST", "HELPFUL", "FAIR", "HEALTH", "LIFE", 
           "HAPMAR", "HAPCOHAB", "RELITEN", "GOD", "BIBLE", "REGION", "ATTEND", 
           "PRAY", "HRS1", "SATJOB", "MOBILE16", "POLVIEWS", "PARTYID", "SOCREL", 
           "SOCOMMUN", "SOCBAR", "SOCFREND", "CONEDUC", "CONFED", "CONMEDIC", "CONARMY",
           "CONBUS", "CONCLERG", "CONFINAN", "CONJUDGE", "CONLABOR", 
           "CONLEGIS", "CONPRESS", "CONSCI", "CONTV"]].copy()

Key Variable Recodes

I demonstrate recodes and index construction for two key measures: Confidence in Institutions & Work-Life Balance

Confidence in Institutions

The GSS provides 13 items about confidence in institutions. The questionaire script reads: “I am going to name some institutions in this country. As far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them?”

And the GSS codes these responses: 1: “a great deal” 2: “only some” 3: “hardly any”

To create an index out of these I’ll first need to recode these questions in a positive direction so that outcomes are easily interpretatble: high numbers = high confidence. I’ll then analyze how internally reliable the set of 13 questions is using an alpha analysis. I’ll use factor analysis to see whether and how respondents cohere around these questions. Because 13 is a lot of items, we’ll then want to know if there is a minimum item cut-off if some rows contain missing values. That is, we want to minimize the number of index values that contain few item answers, but maximize our total responses (n). Finally, this index, along with all the others, will be standardized for easy comparison.

# use a list to recode all items at once
conf_vars = ["CONEDUC", "CONFED", "CONMEDIC", "CONARMY", "CONBUS", 
             "CONCLERG", "CONFINAN", "CONJUDGE", "CONLABOR",
             "CONLEGIS", "CONPRESS", "CONSCI", "CONTV"]

# I change the case to keep the variable name while indicating it has been recoded.
for var in conf_vars: 
    trends[var.lower()] = 4 - trends[var]  # Subtracting 4 from the original yeilds the reverse order.
# And then drop the original. 
trends.drop(conf_vars, axis = 1, inplace=True)

# Now construct the Index
conf_vars = [var.lower() for var in conf_vars] # First I want the same list in lower case because I renamed the variables to lower case.
trends["conf_index"] = trends[conf_vars].mean(axis = 1, skipna = True) # take the mean, instead of the sum, so that rows with missing values don't bring down the totals. I'll deal with missing data below.

trends["conf_index"].describe()
Out[12]: 
count    48957.000000
mean         2.031674
std          0.352033
min          1.000000
25%          1.800000
50%          2.000000
75%          2.230769
max          3.000000
Name: conf_index, dtype: float64

# Now we have an index, but how well to the component items cohere, and how do respondents' answers cohere?

## A. Alpha Analysis for inter-item reliability

confidence_columns = trends[conf_vars].copy() # set the component columns apart

reliability = pg.cronbach_alpha(data = confidence_columns)
print(reliability) # Alpha = .79

# An alpha of .79 indicates high inter-item reliability. The items are all getting after one underlying construct, confidence in institutions. 

## B. Factor Analysis 

fa = FactorAnalyzer(n_factors=3, rotation="varimax")  # Adjust n_factors as needed
fa.fit(confidence_columns)

# Get eigenvalues to determine the optimal number of factors
ev, v = fa.get_eigenvalues()
print("Eigenvalues:", ev)
Eigenvalues: [3.54629544 1.41743876 1.00764509 0.96894189 0.86242839 0.80770145
 0.75664672 0.70908479 0.67070858 0.61520566 0.58425706 0.55188097
 0.50176519]
# eigenvalues > 1 indicate a useful factor. In this case there are 3 strong factors

# Print factor loadings to observe which items cohere in each factor.
loadings = fa.loadings_
print("Factor Loadings:\n", loadings)
Factor Loadings:
 [[ 0.36118129  0.24410226  0.23097058]
 [ 0.19306445  0.59500797  0.1643183 ]
 [ 0.62462896 -0.00825083  0.17924858]
 [ 0.45896996  0.17788212 -0.03423817]
 [ 0.41959102  0.25217145  0.09374704]
 [ 0.28677819  0.22292182  0.11038094]
 [ 0.38402004  0.25974974  0.17509192]
 [ 0.45309956  0.3305685   0.10799279]
 [ 0.06857161  0.2834985   0.27623394]
 [ 0.16888011  0.6249543   0.29079604]
 [ 0.09311359  0.15853253  0.66356791]
 [ 0.52642256  0.03028766  0.04476395]
 [ 0.1125664   0.16761342  0.54758137]]
# factor loadings > .4 indicate a useful inclusion. The rows represent the variables (in the order in which I listed them), the columns represent the factors. 

# next step: separate and name three domains 
factor1 = ["conmedic", "conarmy", "conbus", "confinan", "conjudge"] # generic confidence in institutions
factor2 = ["confed", "conlegis"] # confidence in the government
factor3 = ["conpress", "contv"] # confidence in the media

# subsequent reliability analysis does not show an improvement of factor1 (a subset of 5) over the total index measuring confidence in 13 institutions.
# For the dashboard, I'll keep the Confidence in Institutions Index, and I'll add seperate indices for Confidence in Government and Confidence in Media.

Number of Items Sensitivity Test

Earlier I mentioned that taking the mean of the component measures is better than summing to the get the index because if there are missing values in some of the items, then a respondent’s total index score is artificially lowered. Averaging any number of variables of range 1-3 still yields a range of 1-3, which doesn’t feel like a composite index, but we would standardize in the end anyway, which will (ideally) yeild us a zero-centered range of -3 to 3 either way. But when we do take the mean, we should be careful that the output represents the same thing for all respondents. If one respondent answered a 3 for one item and then refused to continue, that would be much different than a respondent who answered carefully for all 13. We can count the number of items a respondent answered, and compare means, distributions, and n for respondents that answered k, k-1, k-2, etc.

trends["num_conf_vars"] = trends[conf_vars].notna().sum(axis=1) # count the number of missing items per row. 

thresholds = [6, 8, 10, 11, 12] # I'll compare distributions for each of these minimum questions answered thresholds. 
results = {} 
for threshold in thresholds:
    temp_df = indexes.copy() # Create a copy of the dataset to avoid modifying the original
    temp_df["conf_index"] = temp_df[confidence_vars].mean(axis=1, skipna=True) # Recalculate the confidence index based on the threshold
    temp_df.loc[temp_df["num_conf_vars"] < threshold, "conf_index"] = np.nan # Assign rows with missing items < threshold to 'nan'. 
    
    # Summarize the results
    results[threshold] = temp_df["conf_index"].describe()

# Display the summaries for each threshold
for threshold, summary in results.items():
    print(f"Threshold: {threshold}")
    print(summary)
    print("\n" + "="*50 + "\n")

Threshold: 6
count    48665.000000
mean         2.031716
std          0.350796
min          1.000000
25%          1.800000
50%          2.000000
75%          2.230769
max          3.000000
Name: conf_index, dtype: float64

==================================================

Threshold: 8
count    48290.000000
mean         2.031468
std          0.350350
min          1.000000
25%          1.800000
50%          2.000000
75%          2.230769
max          3.000000
Name: conf_index, dtype: float64

==================================================

Threshold: 10
count    47510.000000
mean         2.031037
std          0.349782
min          1.000000
25%          1.800000
50%          2.000000
75%          2.230769
max          3.000000
Name: conf_index, dtype: float64

==================================================

Threshold: 11
count    46674.000000
mean         2.030616
std          0.349311
min          1.000000
25%          1.818182
50%          2.000000
75%          2.230769
max          3.000000
Name: conf_index, dtype: float64

==================================================

Threshold: 12
count    44891.000000
mean         2.029900
std          0.348355
min          1.000000
25%          1.769231
50%          2.000000
75%          2.230769
max          3.000000
Name: conf_index, dtype: float64

==================================================
# Fortunately, there is very little change in n, and amost no change in the mean. Using a minimum of 8 variables for the index provides the best trade off. 

# Use minimum 8 variables

# trends["conf_index"] = trends[confidence_vars].mean(axis=1, skipna=True) # index construction again
trends.loc[trends["num_conf_vars"] < 8, "conf_index"] = np.nan # calling all rows with less than 8 answers "nan."

Work-Life Balance

A less obvious measure is Work-Life Balance. Rather than providing a subjective measure, the GSS offers several items relevant to this construct. Past researchers have constructed this index in myriad ways, often including domestic responsibilities (especially for working women that double as primary caretakers) among the factors that make it up. However, work-life balance is not the primary focus here, so I’ll elect to keep it simple and define it as a function of job satisfaction and number of hours worked. As far as trends go, I’m simply interested to see if the average American is working longer and harder hours.

Job Satistfaction (SATJOB): The GSS asks: “On the whole, how satisfied are you with the work you do–would you say you are very satisfied, moderately satisfied, a little dissatisfied, or very dissatisfied?”

GSS Codes: 1: “Very satisfied” 2: “Moderately satisfied” 3: “Not very satisfied” 4: “Very dissatisfied”

Work Hours (HRS1): The GSS asks: “A. IF WORKING, FULL OR PART TIME: How many hours did you work last week, at all jobs?

GSS Codes: Continuous numeric, Range 0-89.

To construct this index I’ll first recode these variables so that they are numerically interpretable within our construct. I’ll reverse code both job satisfaction so that high numbers = high satisfaction = ‘good for work-life balance’, and work hours, so that low work hours = good for work-life balance. Even though the GSS only asks work hours among those with a Full-time or part-time job, I’ll exclude those working less than 10 hours to ensure that people actually are working. And because there are a number of respondents who have apparently worked 89+ hours, I identified outliers (>80 hrs) and removed these from the distribution, for a practical range of 10-80 hrs/work/week. Multiplying hours worked by job satisfaction yields a possible index score of range 10-320. Because these component variables are on completely different scales, alpha analysis and factor analysis do not apply. And because there are only two variables that make up the construct, it will be necessary to include both in the measure. Like Confidence in institutions, Work-Life Balance will be standardized so it is comparable with other measures on the same scale.

# recode job satisfaction
trends["satjob"] = 5 - trends["SATJOB"] # reverse order so "very satisfied" is highest.

trends["hrs_trim"] = np.where(trends["HRS1"] < 10, np.nan, trends["HRS1"]) # subset minimum hours to ensure people are actually part of the work force

# trim outliers at 3 std * mean
outlier_threshold = trends["HRS1"].mean() + (3 * trends["HRS1"].std()) # Find the treshold. 3 standard deviations above the mean is one way. 
trends.loc[trends["HRS1"] > outlier_threshold, "hrs_trim"] = np.nan # setting outliers and <10 hrs to np.nan to preserve rows.

# reverse direction of hours for index; so high hours worked = poor work/life balance
max_hours = trends["hrs_trim"].max(skipna = True)
trends["hrs_rev"] = (max_hours + 1) - trends["hrs_trim"] # max_hours + 1 means that there is not a zero point. 

# print(trends["hrs_rev"].describe())
# print(trends["satjob"].describe())

# construct index as satjob * hrs_rev
trends["wlb_raw"] = trends["satjob"] * trends["hrs_rev"]
# trends["wlb_raw"].describe()

trends.drop(["hrs_rev", "hrs_trim"], axis = 1, inplace = True)

Standardize

Because I’ll end up with a dozen or so index measures, I will standardize for easy comparison. That is, I’ll convert raw index scores (which in some cases range from 1-3 and in other cases from 1-320) to z-scores, giving each a range of roughly -3 to 3. Each individual measure is then interpretable as the number of standard deviations from the mean. While z-scores don’t mean anything specifically in terms of confidence in institutions or work-life balance, for example, they do reveal which scores are relatively high or low. The convenience of being able to put standardized measures on the same graphs because their scales align is a good trade-off.

One way to standardize is to use scikitlearn’s StandardScaler() function, another is to simply subtract the mean and divide by the standard deviation. Because I’m not pipelining any data, and because StandardScaler() and ordinary math end up treating missing values all the same, I’ll use a simple function here.

def standardize(x):
    return (x - x.mean()) / x.std()

dashboard_measures = ["education", "health", "religiosity", "social_attitudes", "social_relationships", "work_life_balance", "quality_of_life, conf_index", "conf_gov", "conf_media"]

for measure in dashboard_measures:
    trends[measure + "_z"] = standardize(trends[measure]) # the z indicates that it is now standardized. 

Data Storytelling vs. Comprehensiveness vs. Interactivity in Tableau

Tableau is a powerful tool because it allows almost limitless ways to visualize information, which gives the author incredible creative power to tell stories with data. Its interactive features allow users to assume some of that power as well. However, I find that the more information that is crammed onto a dashboard, the more difficult it becomes to make a data story cohere. A simple bivariate relationship can be given a well-designed graph, exciting visuals, KPI-style output text, and accompanying description or analytic detail. And it is often these simple dashboards that make Tableau’s viz-of-the-day list. But my purpose was more a celebration of GSS’s awesome breadth. It was hard to choose what NOT to include. I make no appologies that I crammed a lot in; in the future I would like to add more. Unfortunately, this means that there is not a singular storyline in the dashboard, but multiple, and the user is given the ability to uncover what narrative they will.

In any case the key to creating an appealing dashboard has as much to do with design decisions as what data are available. I chose to center the trend lines and include “50 years” in the title to emphasize the theme of changes over time. I created index buttons in different colors with thumbnail trend lines attached to draw the user’s attention to the several different measures, while emphasizing that change is still the theme. Unfortunately, I was not able to reconcile Tableau’s filter and pages shelves to allow users to toggle between years seamlessly. I love that the pages feature allows users to fast forward through the years, which dynamically updates the barchart, map, and political distribution graphs. But the pages shelf does not allow an “all years” page, which is frustrating. I could include both a year filter and a pages button, but then there are two “Year” switches and when the Year filter is set to “all years” and a user toggles the pages, the filter parameter does not change, thus indicating “all years” even as the pages are changing. A user could stop the pages, but then they would have to unselect all years from the filters shelf and reselect it to get back to an “all years” view, which is not intuitive, and thus I decided was a bad design feature. Despite extensive research, I have not yet been able figure this out. I have not seen another dashboard that accomplishes this. I found no results in a google search, and chatGTP-o3 could not figure it out either. I ended up with just a filter switch for the years, with the default set to “all years”. Still, it is not immediately apparent that selecting a specific year will automatically update the bar charts, political distributions, and map details.

What did work well was the several parameter controls and calculated fields that control which data are being visualized. Because I included all of my measures and all of the dimensions in each chart, I used the same parameters and calculated fields (pointing the parameter to the data) for each.

I’m excited to continue to use this awesome tool in future projects, with an eye to keeping measures and outcomes simple while allowing for more impactful design.

Sammy Shaw

Data Science Portfolio

GSS Dashboard

Summary & Context

Measure and Dimension details

Sample Processing, Recoding & Index Construction

Select Variables

Key Variable Recodes

Confidence in Institutions

Number of Items Sensitivity Test

Work-Life Balance

Standardize

Data Storytelling vs. Comprehensiveness vs. Interactivity in Tableau

GSS Dashboard

Summary & Context

The General Social Survey Dashboard

Measure and Dimension details

Sample Processing, Recoding & Index Construction

Select Variables

Key Variable Recodes

Confidence in Institutions

Number of Items Sensitivity Test

Work-Life Balance

Standardize

Data Storytelling vs. Comprehensiveness vs. Interactivity in Tableau