Gathering many variables at once in R using a for loop

The set up:
  1. Many of us use R package tidyverse for our data cleaning.
  2. Many of us also run within subjects studies where people are potentially rating each stimulus on a large number of things.
  3. Gather function only lets you gather 1 type of rating at a time, across stimuli.
  4. What if you have a bajillion ratings but you need these gathered?
The set up again but using pictures:
I run a study where people rate 4 faces, but there are 3 ratings per face.
The CSV looks like this:

I would typically resolve this in this manner using tidyverse gather.
In this case, the data is first loaded into a tibble, data.
We select the column of the rating we want to gather, and we gather it and save it into a temporary dataframe.
Once we've finished, we then inner join all these tiny temporary dataframes into one.

data1 <- data %>%
select(rating1_face1, rating1_face2, rating1_face3, partID) %>%
gather(., key = "face", value = "rating1", rating1_face1, rating1_face2, rating1_face3, -partID) %>%
mutate(face = gsub("rating1_", "", face)) data2 <- data %>%
gather(key = face, value = rating2, rating2_face1, rating2_face2, rating2_face3, -partID)%>%
select(rating2_face1, rating2_face2, rating2_face3, partID) %>% mutate(face = gsub("rating2_", "", face))
gather(key = face, value = rating3, rating3_face1, rating3_face2, rating3_face3, -partID)%>%
data3 <- data %>% select(rating3_face1, rating3_face2, rating3_face3, partID) %>% mutate(face = gsub("rating3_", "", face))
inner_join(., data3)
data_long <- data1 %>%
inner_join(., data2) %>%

The final looks like this:

With 3 ratings, this is not too bad to do, but recently I had 46 ratings per 4 stimulus objects.

To solve this, I wrote a for loop (read my notes!).
This really does require that your data column names be very specific.

This means that it really does have to be: DVname_stimuliIndex.

So in this case, I have rating1_face1. You could have attractive_stim1, attractive_stim2, attractive_stim3, etc. Please make sure you name your variables consistently and in this method before trying it this way.

#I make an empty dataset with the 2 variables I know I will be joining with, participant ID and facetype
##facenames are just my names of my stimuli, but you could substitute whatever stimuli you have
#I am making a blank dataset with the participant ID repeated 3 times b/c I happen to have 3 stimuli, you might have more or less so modify the 3 to be however many stimuli you have!
facenames <- c("face1", "face2", "face3") ##this is just the participant IDs numbers <- data$partID
data_long <- tibble(partID = as.numeric(rep(numbers, 3)),
#the second variable I'm creating below, face, should just be the same for you (you should not have to modify that) face = as.character(rep(facenames, each = nrow(data))))

Like this says, this creates a skeleton dataset, data_long, with a column of participant Ids and a column of the within-subjects variable per participant Id.

Looking at data_long, this is what it looks like for now, just 2 columns:
Then, we store a list of variable names. What you would change to fit your data, is first, use the column indexes that you're interested in (here I'm using columns 2 through 4). Second, the pattern in the gsub here is "_face1" which matches my naming pattern, but you probably have a different stimulus index, so you'd substitute that (e.g., "_stim1", "_couch1", "_vignette1"). What's we end up with is a set of generic variable names (I'll show you the output of printing this list below the code).

#create a list of variable names
#the reason I do it this way below is that you can update the column indexes to be the range of the variable names, just for face1, so you don't manually write every variable out.
variablelist <- gsub("_face1", "", colnames(data[2:4]))


If we print variable list, here's what we end up with. You can imagine if your variables are named, "attractive_face1", "healthy_face1", "young_face1", your variable list would be "attractive", "healthy", and "young".

Here's the actual for loop:
It iterates through each variable in variablelist, selects all the columns that contain that variable. This means that in the first loop, variable is "rating1", so "rating1_face1", "rating1_face2", and "rating1_face3" are all selected, along with participant ID. Then, gather is called in to put these variables in long form. This is saved into a mini-temporary dataset called temp.

temp will have 4 columns. A participant ID column, a "key", which is a column that tells you which stimuli this rating is for, is named "face", but you can name that "stim" if you'd like by modifying the key argument.

#for loop will iterate through each variable 
for (variable in variablelist){
  #automatically selects that variable using the name of that variable, for every face
temp <- data %>%
  select(., which(grepl(variable, colnames(data))==TRUE), partID) %>%
  #gathers that variable, the key I've named face but you can decide based on your stimuli type
  gather(., key = "face", value = variable, 1:ncol(.), -partID, na.rm = TRUE)
  #save that to a temporary dataset, temp

#here I'm just fixing my within subject column name to remove the part that is like "rating1_" so we are just left with the stimuli index (face1, face2, face3)
name <- paste0(variable, "_")
temp <- temp %>%
  mutate(face = gsub(name,"", temp$face))

#here I'm just renaming the ratings column
colnames(temp)[3] <- variable

#join temp with the empty data.frame you created earlier  
data_long <- full_join(temp, data_long, by = c("partID", "face"))

rm(temp)
}

This is what the finished thing looks like, the same as the other method, right?

Here's the entire code from the beginning:

#I make an empty dataset with the 2 variables I know I will be joining with, participant ID and facetype

#facenames are just my names of my stimuli, but you could substitute whatever stimuli you have
facenames <- c("face1", "face2", "face3")

#this is just the participant IDs
numbers <- data$partID

#I am making a blank dataset with the participant ID repeated 3 times b/c I happen to have 3 stimuli, you might have more or less so modify the 3 to be however many stimuli you have!

#the second variable I'm creating below, face, should just be the same for you (you should not have to modify that)
data_long <- tibble(partID = as.numeric(rep(numbers, 3)),
                    face = as.character(rep(facenames, each = nrow(data))))

#create a list of variable names
#the reason I do it this way below is that you can update the column indexes to be the range of the variable names, just for face1, so you don't manually write every variable out.
variablelist <- gsub("_face1", "", colnames(data[2:4]))

#this will return you a list of "rating1", "rating2", "rating3"

#for loop will iterate through each variable 
for (variable in variablelist){
  #automatically selects that variable using the name of that variable, for every face
temp <- data %>%
  select(., which(grepl(variable, colnames(data))==TRUE), partID) %>%
  #gathers that variable, the key I've named face but you can decide based on your stimuli type
  gather(., key = "face", value = variable, 1:ncol(.), -partID, na.rm = TRUE)
  #save that to a temporary dataset, temp

#here I'm just fixing my within subject column name to remove the part that is like "rating1_" so we are just left with the stimuli index (face1, face2, face3)
name <- paste0(variable, "_")
temp <- temp %>%
  mutate(face = gsub(name,"", temp$face))

#here I'm just renaming the ratings column
colnames(temp)[3] <- variable

#join temp with the empty data.frame you created earlier  
data_long <- full_join(temp, data_long, by = c("partID", "face"))

rm(temp)
}

Comments

  1. Hi Iris -- this is such a common coding puzzle and I'm so glad you posted!

    Here's another solution!

    # packages
    library(tidyverse)

    # data
    # set random number seed
    set.seed(1223)

    # sample responses ranging from 1-7
    data1 <- matrix(sample(1:7, size = 1200, replace = TRUE), nrow = 100, ncol = 12)

    # 4 faces, 3 ratings
    colnames(data1) <- paste0(rep("rate", 3), rep(1:3, 4), "_", rep("face", 4), rep(1:4, each = 3))

    # convert to tibble
    data1 <- as_tibble(data1)

    # add id and put it in front (i.e., first column); print the tibble
    data1$id <- factor(1:100)
    (data1 <- select(data1, id, everything()))

    # restructure
    # idea here is to create a new rating column that identifies the rating used in the variable
    # spread using that rating column
    ldata1 <- gather(data1, key = variable, value = response, rate1_face1:rate3_face4) %>%
    mutate(ratingid = parse_number(str_sub(variable, start = 1, end = 7)),
    faceid = parse_number(str_sub(variable, start = nchar(variable) - 3, end = nchar(variable))),
    rating = paste0("rate", ratingid),
    face = paste0("face", faceid)) %>%
    select(-c(variable, ratingid)) %>%
    spread(key = rating, value = response)

    # print result
    ldata1

    ReplyDelete

Post a Comment

Popular Posts