Automatically coding your text files for concreteness

The rundown:
I recently had to code a decent number of .txt files for concreteness. My partner, John, and I wrote this R code to do it.

Is it the most beautiful thing in the world? Nope, it's the result of a collaboration between a computer science person that only knows base R (my partner) and me who loves tidyverse and all it does. Yes, in some places it could be more tidy :) 

Does it get the job done? Well, it worked for me!

Will it get your job done? Maybe, but likely you will need to tweak it to do what you need to do and to solve the problems that come up for you. For instance, I've set words that don't appear in the lexicon to 0s but you may choose NAs. You likely will need a different function/package combo if you have word files. You likely will need to re-write the for loop to go by row instead of by file (as it is written now) if you've got all your text with each row as a participant in one dataframe. 

Note* I've not removed the apostrophes b/c you may come into issues w/ contractions that look like words after removing apostrophes (e.g., she'll = shell which you don't want to score). Isn't English stupid?

You will need:
1. Concreteness dictionary This has the actual concreteness scores of 40,000 words (Brysbaert, M., Warriner, A.B., & Kuperman, V., 2014).

2. A folder where all your text files are saved (a working directory)

3. Before running my code: I also recommend capitalizing proper nouns. And if for some reason your contractions don't have apostrophes, you'll have to use human eyes to check out the context the word is used and add back your apostrophes.

Here's an example:
Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.

Here's what the resulting csv looks like. For my purposes, I wanted the word, the concreteness score, meanscore = the mean concreteness for all the words which honestly you may want to re-compute after you re-check words scored as 0 (in this case the mean will not appear to be right b/c I actually coded the entire Mary had a little lamb poem), length = the # of words, and subjid = participant id.



Here's the R code that will do it:

library(readxl)
library(tidyverse)
#set a working directory
setwd("C:/Users/imxwang/Documents/Blog/concrete")

#read your abstractness/concreteness dictionary, I've named the file this way, but you can name it whatever you want
wordData <- read_xlsx("words_concrete.xlsx")

#load your participant text file paths (in my case, my files are all .txt therefore the pattern argument is written this way)
textfiles <- list.files(full.names = FALSE, pattern = "*.txt", recursive = TRUE)

#we're looping through each file
for (i in textfiles){
#in my case, my files are all .txt. If yours are .docx, you'll likely need the function readtext()
inputFile <- readLines(i)

#in my case, my subject id is coming from the file name, you'll need a different solution if yours is different.
subjid <- gsub(".txt", "", i)

#this is naming the abstractness dictionary's first column "word"
colnames(wordData)[1] <- "word"

#we're splitting the text into words here
words <- strsplit(inputFile, "\\s+")

words <- unlist(words)

#remove punctuation except for apostrophes (need to hand code apostrophes because of she'll, we'll issues)
words <- gsub("(?!')[[:punct:]]", "", words, perl=TRUE)

#we're creating an empty dataframe, with "word" and "concreteness" as columns
outputDf <- data.frame('word' = as.character(),
                       'concreteness' = as.integer())

#for each word in the text file, we're indexing our dictionary for its concreteness score.
for (wordIdx in 1:length(words)) {
  currentWord <- words[[wordIdx]]
  currentVal <- wordData[wordData$word == currentWord,]$Conc.M
  if (length(currentVal) == 0) {
    #if number value not found, default to 0
    currentVal <- 0
  }
    # add new row for the current word
    outputDf <- rbind(outputDf, data.frame(word = currentWord, concreteness = currentVal))
}
#this is for my own curiosity, but compute a column that calculates the mean concreteness score for the participant and then compute a column that calculates the # of words, and then create a subject id column
outputDf <- outputDf %>%
  dplyr::mutate(meanscore = mean(concreteness, na.rm=TRUE), length = nrow(.), subjid = subjid)

#this just helps us save each file into a separate csv per participant w/ the subject id as the name of the file. 
name <- paste("df", subjid, sep = "")
assign(name, outputDf)
write.csv(outputDf, paste0(subjid, ".csv"), row.names = TRUE)
}

After running my code: use your human eyes to look at the words coded as 0, so you can see what you should do with those.

Please contact me at imxwang@umich.edu if you have problems or want help modifying this for your own use!

Comments

Popular Posts