The Trends in International Math and Science Study (TIMSS) is a series of international assessments of the mathematics and science knowledge of students around the world. It is administered by both the International Association for the Evaluation of Educational achievement (IEA) and Boston College (BC) who first conducted the assessments in 1995 and re-administered them every 4 years after that (1999, 2003, 2007, 2011, and 2015). Different versions of the assessments are given to 4th graders and 8th graders of participating countries. Each question in the assessment test the students’ grasp of different cognitive and content domains. Every assessment contains both multiple-choice and free-response style questions. After the test is given, each student is given a performance score ranging from 0 as the lowest to 1000 as the highest, with 500 being the intended international mean. Each participating country, in turn, is given a performance score which is the average of all of its students’ scores. The study uses the scores 625, 550, 475, and 400 to represent advanced, high, intermediate, and low international benchmarks respectively. That is, a student with a performance score greater than 625 is considered to have an advanced grasp of either math or science while a student with a performance score less than 400 is considered to have a below low grasp of either math or science.
The primary purpose of this article is to show different methods for visualizing question-level data for the TIMSS Math Assesment. This will be done by analyzing the 2011 math test given to 8th grade students of the Republic of Korea, Lithuania, the United States of America, and Chile. Question-level data is the typical response provided by students of each country to each question asked. Visualizing question-level data will hopefully lead to better analysis of a country’s educational needs. Korea was chosen because it performed the best of all 45 participating countries with a performance score of 613. Lithuania was chosen because it performed just above the TIMSS scale centerpoint with a score of 502. Both the USA and Chile were chosen because they are countries of personal interest with scores of 509 and 416 respectively.1
It is worth noting that the IEA and BC consider countries with more than 15% of students scoring less than 400 as countries that can not be reliably assessed. This is because such a high rate of below low performers suggests an increased probability of random guessing. If we eliminate all the countries that could not be reliably assessed, then the country with the lowest performance score is Chile, one of the two countries of personal interest.
The secondary purpose of this article is to tidy the datasets provided by IEA. The new tidy datasets will facilitate the visualization of data in this article as well as facilitate analysis that will be done in future articles. There are few articles online that talk about analyzing and cleaning the TIMSS dataset in the R. Therefore, the Data Munging section is meant as a resource for researchers who would like to analyze the TIMSS dataset in R. For everyone else, the section can be summarized as a repetition of the following code:
dataframe %>%
group_by(column_of_interest) %>%
nest() %>%
mutate(
new_column = lapply(.$data, anonymous_function) %>%
unlist()) %>%
unnest()
Feel free to skip directly to the data visualization section if data tidying does not fill you up with joy…
According to Hadley Wickham, tidy data is “a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
This article will attempt to tidy the TIMSS dataset as much as possible, however, the nesting capabilities of the tidyr
package makes it very tempting to keep all observations in a single table. If all information is kept in one table, then there is a greater chance that multiple observations will be contained in a single row.
The TIMSS dataset that this article works on can be divided into 3 main types of observational units. These observational units are measured at the Student, Book, and Question levels. The focus of this article is question-level data, however, rich question-level data cannot be derived without first organizing the TIMSS dataset at the student and book levels. With tidyr
each observational unit can be stored in a single dataframe. Tidy data tables can be derived from this single dataset with three simple commands
dataframe %>%
group_by(column_representing_observational_unit) %>%
nest()
Therefore the single dataframe this article creates will only be three commands away from a truly tidy dataset.
IEA has a data repository where it publicly displays study data. It can be accessed by going to http://rms.iea-dpc.org/. The order of clicks goes in the order of SEARCH > TIMSS > Grade 8 > 2011 > Chile::Student Test Responses > Korea, Rep. of::Student Test Responses > Lithuania::Student Test Responses > United States::Student Test Responses > SPSS > Codebooks > Download Name:::Whatever > Add To Basket > View Basket > (disk/save icon)
The student achievement books contain student level information such as student responses to each question, the students’ gender, and the students’ school. The IEA naming convention for data files is:
so bsachlm5
represents the 2011 8th grade student achievement test for chile and bsatmsm5
represents the 2011 8th grade student achievement codebook.
The principle packages used in this article will be tidyverse
, and stringr
. The tidyverse
package is important because it includes the dplyr
(cleaning), tidyr
(cleaning) and ggplot
(visualizing) packages as well as the magrittr
(%>%) package. The stringr
package is important because of its text/string manipulation capabilities. Other packages used are haven
, readxl
, and gridExtra
.
The TIMSS dataset must be uploaded before an cleaning, analysis, or visualization can be done.
library(tidyverse)
library(stringr)
library(haven)
chl_achievement_11 <- read_spss('bsachlm5.sav')
kor_achievement_11 <- read_spss('bsakorm5.sav')
ltu_achievement_11 <- read_spss('bsaltum5.sav')
usa_achievement_11 <- read_spss('bsausam5.sav')
detach(package:haven)
The student achievement books are in wide format. Each row represents one student and each column represents either a question or student information. The books include both math and science questions as well as student performance scores, school and student IDs, gender, and test information.
dim(chl_achievement_11)
## [1] 5835 581
names(chl_achievement_11) %>%
head(20)
## [1] "IDCNTRY" "IDBOOK" "IDSCHOOL" "IDCLASS" "IDSTUD" "M032166"
## [7] "M032721" "M032757" "M032760A" "M032760B" "M032760C" "M032761"
## [13] "M032692" "M032626" "M032595" "M032673" "M052216" "M052231"
## [19] "M052061" "M052228"
The student achievement codebook must be uploaded as well. The codebook contains information specific to each question asked. However, we do not want all the information in the codebook so we will need to parse it down to only columns which provide important information. These columns are: FIELD_NAME
, FIELD_LABL
, MEAS_CLASS
, and COMMENT1
. FIELD_NAME
provides the id for each question asked, FIELD_LABL
is a very brief summary of what each question asks, MEAS_CLASS
provides the answer of each multiple choice question (‘M1’ for ‘A’, ‘M2’ for ‘B’, ‘M3’ for ‘C’, and ‘M4’ for ‘D’) or ‘SA’ for each free response question, and COMMENT1
provides the content and cognitive domain for each question. Student answers to each multiple choice question in the student achievement books are simply numbers (‘1’, ‘2’, ‘3’, ‘4’) so the ‘M’ should be removed from MEAS_CLASS
in order to to join the student achievement books with the student achievement codebook. The COMMENT1
column contains two separate variables (‘content domain’ and ‘cognitive domain’) and should be split into two separate columns.
library(readxl)
achievement_codebook_11 <- read_excel('bsatmsm5.xls') %>%
select(FIELD_NAME, FIELD_LABL, MEAS_CLASS, COMMENT1) %>%
mutate(MEAS_CLASS = str_replace(MEAS_CLASS, '^M', '')) %>%
separate(COMMENT1, into = c('content_domain', 'cognitive_domain'), sep = '\\\\') %>%
mutate(cognitive_domain = str_extract(cognitive_domain, '\\w+')) %>%
mutate(question_type = sapply(MEAS_CLASS, function(x){
if(str_detect(x, '\\d')){
'Multiple Choice'
} else if (str_detect(x, 'SA|DPC_D')){
'Free Response'
} else {
'Other'
}
}))
detach(package:readxl)
The student achievement books, in dataframe format, must be manipulated and cleaned before question level visualization can be conducted. The cleaning/manipulation process can be performed in 5 steps, represented below in 4 separate functions. These functions either reshape the dataframe or add new information derived from existing information.
dplyr
‘s non-standard evaluation and data manipulation capabilities as well as stringr
’s regex capabilities. The questions should be combined into a single column and their values should be combined into another column so that each row has a single STUDENTID|question|answer combination - a process often referred to as going ’from wide to long format’. This can be done done by using tidyr
’s reshaping capabilities. Finally, we want to get rid of all NA values in the answer column. NA
values represent either questions that the student wasn’t given or questions that the student wasn’t able to respond to. We will see later that removing questions that a student was given, but did not answer, will not affect the analysis.grab_math_questions <- function(df){
df %>%
#^M grabs all the math questions
#^BSM grabs all the benchmark scores
select_(.dots = names(.)[str_detect(names(.), '^M|IDSTUD|IDBOOK|IDSCHOOL|ITSEX|^BSM')]) %>%
gather_('question', 'answer', names(.)[str_detect(names(.), '^M')]) %>%
filter(!is.na(answer))}
NA
values, potentially deleting a STUDENTID
|question
|answer
combination for a student who was given a question, but was unable to answer that question. The total number of students who were given a question can be derived by filtering for unique question
|BOOKID
combinations, which in turn can be combined with unique STUDENTID
|BOOKID
combinations to determine which questions were provided to which students. This can be done by using the powerful nest()
and lapply()
combination. The nest()
method creates a column named data
which is a dataframe of all selected columns. This is done by filtering all selected columns to unique variables in the unselected column or columns. This data
column, once created, can be iterated over with an anonymous function passed to lapply()
.get_book_info <- function(df){
df %>%
group_by(IDBOOK) %>%
nest() %>%
mutate(
students_per_book = sapply(seq_along(.$data), function(i){
.$data[[i]] %>%
group_by(IDSTUD) %>%
count() %>%
nrow()
})) %>%
mutate(
students_per_book_female = sapply(seq_along(.$data), function(i){
.$data[[i]] %>%
group_by(IDSTUD, ITSEX) %>%
count() %>%
filter(ITSEX == 1) %>%
nrow()
})) %>%
mutate(
students_per_book_male = sapply(seq_along(.$data), function(i){
.$data[[i]] %>%
group_by(IDSTUD, ITSEX) %>%
count() %>%
filter(ITSEX == 2) %>%
nrow()
})) %>%
unnest()}
MEAS_CLASS
column within the achivement_codebook_11
dataframe created earlier. The two dataframes can be combined by using dplyr
’s table joining capabilities. New information joined into the dataframe should be cleaned for analysis.combine_datasets <- function(df, cdbook){
df %>%
left_join(cdbook, by = c('question' = 'FIELD_NAME')) %>%
mutate(
student_gave_correct_answer = sapply(seq_along(.$MEAS_CLASS), function(i){
if(.$MEAS_CLASS[i] %in% 1:4){
.$MEAS_CLASS[i] == .$answer[i]
} else {
str_detect(.$answer[i], '^[12]')
}})) %>%
mutate(
correct_answer_derived_from_labl = FIELD_LABL %>%
str_extract( '\\(\\d\\)|\\(\\w\\)')
) %>%
mutate(
FIELD_LABL = FIELD_LABL %>%
str_replace(' \\(\\d\\)|\\(\\w\\)', '')
)}
nest()
and lapply()
combination used in function #2.get_question_info <- function(df){
df %>%
group_by(question) %>%
nest() %>%
mutate(
students_per_question = sapply(seq_along(.$data), function(i){
.$data[[i]] %>%
group_by(IDBOOK, students_per_book) %>%
count() %>%
.$students_per_book %>%
sum()
})) %>%
mutate(
students_per_question_female = sapply(seq_along(.$data), function(i){
.$data[[i]] %>%
group_by(IDBOOK, students_per_book_female) %>%
count() %>%
.$students_per_book_female %>%
sum()
})) %>%
mutate(
students_per_question_male = sapply(seq_along(.$data), function(i){
.$data[[i]] %>%
group_by(IDBOOK, students_per_book_male) %>%
count() %>%
.$students_per_book_male %>%
sum()
})) %>%
mutate(
correct_ratio_per_question = sapply(seq_along(.$data), function(i){
tot_students = .$students_per_question[i]
.$data[[i]] %>%
filter(student_gave_correct_answer) %>%
nrow()/tot_students
})) %>%
mutate(
correct_ratio_per_question_female = sapply(seq_along(.$data), function(i){
tot_female = .$students_per_question_female[i]
.$data[[i]] %>%
filter(student_gave_correct_answer, ITSEX == 1) %>%
nrow()/ tot_female
})) %>%
mutate(
correct_ratio_per_question_male = sapply(seq_along(.$data), function(i){
tot_male = .$students_per_question_male[i]
.$data[[i]] %>%
filter(student_gave_correct_answer, ITSEX == 2) %>%
nrow()/ tot_male
})) %>%
unnest()}
Now that the functions used to clean the data are created, they can be used to actually clean the data. To prevent needless typing, all 4 functions can be contained in a wrapper function.
clean_math <- function(df, cdbook){
df %>%
grab_math_questions() %>%
get_book_info() %>%
combine_datasets(cdbook) %>%
get_question_info()}
chl_timss_math_11 <- clean_math(chl_achievement_11, achievement_codebook_11)
ltu_timss_math_11 <- clean_math(ltu_achievement_11, achievement_codebook_11)
kor_timss_math_11 <- clean_math(kor_achievement_11, achievement_codebook_11)
usa_timss_math_11 <- clean_math(usa_achievement_11, achievement_codebook_11)
The cleaned data can be used in further projects and should be stored in seperate files.
write_csv(chl_timss_math_11, '2011_TIMSS_CHILE.csv')
write_csv(ltu_timss_math_11, '2011_TIMSS_LITHUANIA.csv')
write_csv(kor_timss_math_11, '2011_TIMSS_KOREA.csv')
write_csv(usa_timss_math_11, '2011_TIMSS_USA.csv')
A separate smaller dataframe will be created specifically for graphing. It will contain only data useful for graphing with ggplot()
. The goal of this article is to visualize the typical response provided by students of each country to each question asked. This means that each graphic should show:
A typical measurement will be the correct ratio which is the number of students in a country who correctly responded to a question divided by the total number of students. Gender variations of the correct ratio will also be used.
Another typical measurement will be the question rank which is a ranking of questions by how many students in a country were able to successfully answer the question. A question rank of 1 will represent the question with the highest correct ratio.
basic_graph_setup <- function(math_df, country){
math_df %>%
group_by(question, students_per_question, correct_ratio_per_question, correct_ratio_per_question_female,
correct_ratio_per_question_male, FIELD_LABL, content_domain, cognitive_domain, question_type) %>%
nest() %>%
arrange(desc(correct_ratio_per_question)) %>%
mutate(question_rank = 1:nrow(.)) %>%
mutate(country = country) %>%
mutate(diff_male_female = correct_ratio_per_question_male - correct_ratio_per_question_female) %>%
mutate(dominant_gender = sapply(.$diff_male_female, function(x){
if(x >0) {
'Male'
} else if(x<0){
'Female'
} else {
'Tie'
}
}))
}
chl_basic_graph_info <- basic_graph_setup(chl_timss_math_11, 'Chile')
ltu_basic_graph_info <- basic_graph_setup(ltu_timss_math_11, 'Lithuania')
kor_basic_graph_info <- basic_graph_setup(kor_timss_math_11, 'Korea')
usa_basic_graph_info <- basic_graph_setup(usa_timss_math_11, 'USA')
international_basic_graph_info <- rbind(chl_basic_graph_info, ltu_basic_graph_info, kor_basic_graph_info, usa_basic_graph_info)
A standard graph is the bar plot. A bar plot that compares the males correct ratio to females correct ratio can be made with geom_col()
.
ggplot(international_basic_graph_info, aes(question_rank, diff_male_female)) +
geom_col(aes(fill = dominant_gender)) +
scale_y_continuous(limits = c(-.15, .15)) +
facet_wrap(~country) +
labs(x = 'Questions Ranked From Easiest to Hardest According to Country',
y = 'Males Correct Ratio - Females Correct Ratio',
fill = 'Gender',
title = 'Gender Comparison Within Countries',
subtitle = 'Bar Plot')
It seems that males in Chile have a lot more success than females in math, while females in Lithuania have a lot more success than males in math. Males in both Korea and the USA have more success than females in math, but the difference isn’t as distinguishable as it is in Chile.
Another standard graph is the scatter plot. A scatter plot that compares the males correct ratio to females correct ratio can be made with geom_point
, however, a more interesting version of the scatter plot can be made with geom_text()
. A text plot is the same as a scatter plot, but with the points replaced with text. In the following example points are replaced with each question’s id.
ggplot(international_basic_graph_info, aes(correct_ratio_per_question, diff_male_female)) +
geom_text(aes(label = question, color = dominant_gender)) +
facet_wrap(~country) +
labs(x = 'n Students Who Correctly Answered Question / Total Students',
y = 'Males Correct Ratio - Females Correct Ratio',
color = 'Gender',
title = 'Gender Comparison Within Countries',
subtitle = 'Scatter Plot')
Unfortunately, the graph seems overly clustered and there is not much gained by plotting the actually question id. However the scatter plot format does show a potentially interesting differences between countries.
It is worth exploring the scatter plot format further and it may even be worth combining the scatter plot with the a density plot (if a bar plot shows normal distribution then it can also be represented as a density plot).
The gridExtra
package provides the ability to combine multiple graphs into one. A combination of scatter plot and density plots might be useful.3
get_legend<-function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
combomain <- ggplot(international_basic_graph_info, aes(correct_ratio_per_question, diff_male_female)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = .5) +
geom_point(aes(color = country), size = 3, alpha = 1/3) +
labs(x = 'n Students Who Correctly Answered Question / Total Students',
y = 'Males Correct Ratio - Females Correct Ratio',
color = NULL)
combolegend <- get_legend(combomain)
combomain <- combomain +
theme(legend.position = 'none')
blank_graph <- list(
labs(x = NULL, y = NULL, color = NULL),
theme(legend.position = 'none',
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.line = element_blank(),
plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
)
combotop = ggplot(international_basic_graph_info, aes(correct_ratio_per_question, color = country, fill = country)) +
geom_density(alpha = .25) +
labs(title = 'Gender and Correct Ratio Comparison of All Countries',
subtitle = 'Scatter and Density Combination Plot') +
blank_graph
comboright = ggplot(international_basic_graph_info, aes(diff_male_female, color = country, fill = country)) +
geom_density(alpha = .25) +
coord_flip() +
blank_graph
library(gridExtra, warn.conflicts = F)
grid.arrange(combotop, combolegend, combomain, comboright,
ncol=2, nrow=2, widths=c(4, 1.4), heights=c(1.4, 4))
detach(package:gridExtra)
The plot shows stark differences between Korea and Chile with regards to correct ratio while Lithuania clearly stands out as female dominant with regards to the the difference in gendered correct ratios.
The static scatter plot is nice, but it would be better if we could compare questions on an individual level. This graph will be made into an interactive graph to display question specific information.
Question domain information can also be shown in bar plot format.
ggplot(international_basic_graph_info, aes(question_rank, correct_ratio_per_question)) +
geom_col(aes(fill = content_domain)) +
scale_y_continuous(limits = c(0,1 )) +
facet_wrap(~country) +
labs(x = 'Questions Ranked From Easiest to Hardest According to Country',
y = 'n Students Who Correctly Answered Question / Total Students',
fill = 'Content Domain',
title = 'Content Domain Comparison of Countries',
subtitle = 'Barplot')
The bar plot does not show any clear pattern with regards to content domain.
ggplot(international_basic_graph_info, aes(question_rank, correct_ratio_per_question)) +
geom_col(aes(fill = cognitive_domain)) +
scale_y_continuous(limits = c(0,1)) +
facet_wrap(~country) +
labs(x = 'Questions Ranked From Easiest to Hardest According to Country',
y = 'n Students Who Correctly Answered Question / Total Students',
fill = 'Cognitive Domain',
title = 'Cognitive Domain Comparison of Countries',
subtitle = 'Barplot')
The bar plot suggests that most countries succeed in the Knowing cognitive domain. the bar plot also suggests that Korea, while still relatively successful to other countries in all domains, is less successful in the Reasoning cognitive domain.
However, sometimes a simple box plot is the most informative.
international_domain <- international_basic_graph_info %>%
group_by(question, cognitive_domain, content_domain) %>%
nest() %>%
gather(type, domain, -c(question, data)) %>%
unnest()
ggplot(international_domain, aes(domain, correct_ratio_per_question)) +
geom_boxplot(aes(color = country)) +
geom_hline(yintercept = 0.5) +
labs(x = 'Domain',
y = 'n Students Who Correctly Answered Question / Total Students',
color = 'Country',
title ='TIMSS: Correct Response Rate Per Domain',
subtitle = 'Boxplot') +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
It turns out every country’s worst domain is Reasoning and every country favors domains Data and Chance and Knowing
A gendered look at these box plots is worth a look.
ggplot(international_domain, aes(domain, diff_male_female)) +
geom_boxplot(aes(color = country)) +
geom_hline(yintercept = 0) +
labs(x = 'Domain',
y = 'Males Correct Ratio - Females Correct Ratio',
color = 'Country',
title = 'TIMSS: Comparison of Male and Female Response Per Domain',
subtitle = 'Boxplot') +
theme(axis.text.x = element_text(angle = -45, hjust = 0))
Females seems to excel in the Algebra and Knowing domains while males seem to excel in the Data and Chance and Number domains.
A scatter plot seems to be the most effective and showing information about individual questions.
ggplot(international_basic_graph_info, aes(correct_ratio_per_question, diff_male_female)) +
geom_point(aes(color = content_domain, shape = cognitive_domain), position = 'jitter') +
facet_wrap(~country) +
labs(x = 'n Students Who Correctly Answered Question / Total Students',
y = 'Males Correct Ratio - Females Correct Ratio',
color = 'Gender',
title = 'Gender Comparison Within Countries',
subtitle = 'Scatter-Text Plot')
Korea performs uncharacteristically poorly on three questions all of which relate to the Data and Chance domain.
The static scatter plot is nice, but it would be better if we could compare questions on an individual level. This graph will be made into an interactive graph to display question specific information.
The problem with trying to graph TIMSS at the question level, is that there are more than 200 questions given to 4 different countries meaning that 800 point must be plotted. Comparing how each country performed on a specific question is near impossible because each the points in the plot have to have a different representation for more than 200 questions. Text doesn’t work because plotted text becomes illegible when more than two plotted text points overlap.
The best solution to comparing individual questions across countries is by using a network graph. A network graph connects nodes of similar information to each other. If the nodes of each country are ordered by rank, then the network graph can help show the differences in country responses to any particular question. A package optimized for network graphs is igraph
, however, it requires learning data structures specific to network graphs. The below network graph is a simple network graph that can be made using ggplot()
without learning any new data structures.4
network_graph_setup <- function(basic_graph, str_abr, link1, link2, from_top){
basic_graph %>%
select(-data) %>%
mutate(id = str_c(str_abr, '_', question)) %>%
mutate(link1 = link1) %>%
mutate(link2 = link2) %>%
mutate(order = from_top)
}
chl_network <- network_graph_setup(chl_basic_graph_info, 'chl', 'chl_ltu', 'bottom', 4)
ltu_network <- network_graph_setup(ltu_basic_graph_info, 'ltu', 'ltu_usa', 'chl_ltu', 3)
usa_network <- network_graph_setup(usa_basic_graph_info, 'usa', 'usa_kor', 'ltu_usa', 2)
kor_network <- network_graph_setup(kor_basic_graph_info, 'kor', 'top', 'usa_kor', 1)
international_nodes <- rbind(chl_network, ltu_network, usa_network, kor_network) %>%
group_by(country, link1, link2) %>%
nest() %>%
gather(link_type, link, -c(data, country)) %>%
unnest() %>%
mutate(link = str_c(question, '_', link)) %>%
arrange(order, question_rank) %>%
mutate(country = factor(country, levels = unique(country), ordered = T)) %>%
mutate(FIELD_LABL = str_to_title(FIELD_LABL)) %>%
mutate(content_domain = str_to_title(content_domain)) %>%
mutate(cognitive_domain = str_to_title(cognitive_domain))
ggplot(international_nodes, aes(question_rank, country, color = correct_ratio_per_question)) +
geom_line(aes(group = as.factor(link)), color = 'grey80') +
geom_point(size = 2) +
scale_color_gradient2(high = 'forestgreen', mid = 'yellow', low = 'saddlebrown', midpoint = .5) +
labs(x = 'Question Rank: From easiest to hardest',
y = 'Country',
color = 'Question Correct Ratio',
title = 'TIMSS: Comparison of Country Correct Response Per Question',
subtitle = 'Bi-Partite Network Graph')
This network graphs correct ratio, question rank, and links the same questions given to different countries.
The network graph is nice, but it is difficult to follow the links between nodes. This graph will be made into an interactive graph to display question specific information.
The priciple purpose of data visualization is to help people understand data. TIMSS is an assessment that provides a rich dataset. It is difficult to understand such a rich dataset by looking at numbers alone. This article only analyzed a small portion of the data available from TIMSS. The only dataset analyzed was the student achievement dataset. TIMSS offers other datasets that can combine the student level data provided by the student achievement dataset with student background information, school background information, and even teacher background information. The already rich dataset can be made richer, meaning that even more information can be derived at the question level. If more information can be derived at the question level, and better visualizations are created, then policy makers and educators can make better decisions with regards to education reform.
Further work will be dedicated to enriching the question-level data and visualization with student, school, and teacher background information.
Admittedly, the static graphs provided in this article leave much to be desired. We can see differences in how countries and genders perform on individual questions, however we cannot see information about the actual question. That is to say, we can see question rank, we can see question domain, we can see how a country performs on a question, however we cannot actually see the question id or the question summary. The attempted text-scatter plot fa iled because there was too much overlap.
A better alternative to these static graphs are interactive graphs. This paper identified three static graphs that make ideal candidates for informative interactive graphs: the two scatter plots and the network graph. Creation of these three interactive graphs will hopefully confirm that the TIMSS data can be effectively vizualized and analyzed at the question level.
Learn more about tidy data by reading Wickham’s paper “Tidy Data”↩
Below code inspired by the STHDA article “ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization”↩
An awesome resource for learning igraph
is Katherine Ognyanova’s tutorial Network Analysis and Visualization with R and igraph↩