Language is probably the most human endeavor: through language, we fundamentally express who we are. Precisely because we express ourselves through language, we can use language to infer information about the authors of texts. This property makes text a fantastic resource for research into the complexity of the human mind, from social sciences to humanities. The recent introduction of large-scale statistical models has made this research even easier and more powerful.
However, it is exactly that human property of text that also creates some ethical problems. While we can explore the property of text to reflect the authors biases, they can also have unintended consequences for our analysis, which get magnified by statistical models. If our data is not reflective of the population we want to study, if we do not pay attention to biases enshrined in language, we can easily draw the wrong conclusions, and create disadvantages for our subjects.
In this talk, I will talk about four types of biases that affect statistical analysis of text, their sources, and potential counter measures. First, I will cover bias stemming from data, i.e., selection bias (if our texts do not adequately reflect the population we want to study) and label bias (if the labels we use are skewed).
Over the last few years, though, there has been an increasing body of work that not only uncovered such biases, but that has also shown various ways to address and counteract these biases, ranging from simple labeling considerations to new types of models.
I hope to leave the audience with a better, more nuanced understanding of the possible pitfalls in working with text, but also with a sense of how effectively these biases can be addressed with a little bit of forethought.
Video footage will be added in the future.