by Jonathan Y. H. Sim
You’ve probably heard the saying, “If you assume, you make an ass of u and me.” You’ve also probably heard your teachers constantly telling you not to make assumptions as they can lead to bad consequences. Yet, no other discipline makes more assumptions than the social sciences. Social scientists make a lot of assumptions about the things they study, not because they are bad scholars, but because they have to: they are studying what is essentially incredibly difficult to study: us!
Studying humans is one of the hardest things to do. The moment you know that every action is being monitored and examined, you will change your behaviour, either presenting the best side of you, or biasing your action if you think your behaviour to help (or not help) the researcher. The moment a policy is enacted, people will change their behaviours in a myriad of ways. Some will game the system to their advantage, while others will do what they can to avoid being negatively impacted by the new policy. But more importantly, there’s just so many confounders present in the daily interactions of humans that it is almost impossible to determine whether something caused a certain social phenomenon.
Imagine this scenario. To encourage more people to use less electricity at home, the government is considering tripling the price of electricity. There are reasons why this might work, and reasons why it won’t. But let’s assume that this policy was implemented on 1 December in Singapore. We might observe a drop in electricity. But we can’t be so sure that it was caused by the tripling of electrical prices. December happens to be a really cold and rainy period, away from the regular scorching heat in Singapore. Households typically do not turn on air-conditioning during that period, so the government will observe a big drop in domestic electrical consumption anyway. December is also the period where a lot of families go on long holidays. So they’d switch off everything at home. Again, this will contribute to the observed drop.
Just because we observe a policy put into effect and a drop in electrical consumption does not necessarily mean that one caused the other. It would be committing the inductive fallacy of confusing correlation for causation (mentioned in the previous chapter).
We can take this a step further and consider: could the policy backfire? Instead of discouraging electrical consumption, is it possible that it might encourage more consumption? Two ways this might be possible. First, if the tripling makes electricity sufficiently expensive, people may begin to perceive electricity more as a luxury good, and feel entitled to its use: “I worked very hard this week, and I can afford it, so I shall turn on air conditioning throughout the entire weekend.” Second, people will try to game the system by searching for cheaper (or free) sources of electricity. When stocking up, people tend to take more than they need for fear that they didn’t take enough. So one phenomenon that will occur is that people will charge batteries in their offices or public places to shave off a few dollars at home. A similar situation, by the way, is already happening in Singapore. With the doubling of water prices, people are now seeking out sources of free water and filling up bottles and buckets of water for use at home. Some of us may think that it’s not worth the trouble, but these people don’t regard it as a hassle. In fact, some get a kick out of gaming the system, so their incentive structure is different from what we have assumed.
As demonstrated, humans are really messy creatures. It is almost impossible to study humans. There’s too many confounders and people may behave in ways we did not expect.
One key reason why studying humans is incredibly difficult is that we lack access to the counterfactual, or what might have been the case if something didn’t happen. It’s very easy to find a counterfactual in the sciences. In laboratory conditions, we can create a control experiment and alter one variable. But not so with humans. We change every step of the way. I can’t just conduct an experiment on you, erase your memory (or go back in time for that matter), and repeat the experiment with a different variable. This is why assumptions are hugely important. Whether we are trying to study humans with models or with empirical research, we must make assumptions about what we are studying. And that is really the ART of being a good social scientist.
In this chapter, I want to focus our attention on the assumptions that social scientists make, be aware of the difficulties in postulating assumptions, how social scientists can be unaware of some other assumptions they are making, and what we can do about it.
1. Programme Evaluation
Here’s an example. Amy Chua, author of the book, “Battle Hymn of the Tiger Mum,” advocates that the Chinese style of parenting is the most effective way: be ultra strict and harsh with your kids and they will become good children. Her kids grew up fine and well, as contributing members to society, according to her. Was it her parenting that caused it? It’s hard to say. Why? Because there are other factors/confounders like the school environment, the culture, the lifestyle, the types of friends the kids make, etc.
Is there a way to determine whether tiger mum parenting actually causes good children?
We can do it through something called “Programme Evaluation.” We need to identify a control group and a treatment group. These two groups should be as similar as possible. The treatment group will receive the “treatment.” In this case, they will receive tiger mum parenting. If we observe a change in behaviour after the treatment, we can say that a change in behaviour correlates with the treatment. And since the only factor that is different (supposedly) is the presence of the treatment (i.e. the tiger mum parenting), we may then infer that the treatment causes the observed effect (in this case, we may infer that tiger mum parenting causes good children, but if and only if the only difference between the two groups is the addition of tiger mum treatment). This means that a lot of the hard work involves trying to identify the control and treatment groups to reduce the confounders as much as possible.
Now some of you may say: isn’t that confusing correlation for causation? Yes and no. You will be confusing correlation for causation if your treatment and control groups are not similar enough to make a decent comparison. But if you have done your best to reduce the confounders, then yes, it may be valid to infer causation from the correlation.
The experiment I proposed above may sound easy, but there’s a lot of problems that we need to consider. First, if we are not careful with our selection of parents, we might have parents in the control group all practising the same parenting style, which might be better or worse than tiger mum parenting. This will skew our results.
Second, a study like this may attract parents that are not representative of the larger population of parents. Only parents who are far more concerned about their child’s upbringing than the average parents may come forward. Also, the ones who can participate in such a study may well be parents who have more time and energy. These tend to be parents from wealthier backgrounds, so their economic backgrounds may already be factors that are very conducive to a child’s growth, development and upbringing. These children may have the luxury of a domestic helper or a nanny, go to good pre-schools and nurseries, have access to a variety of educational toys, have more quality time to spend with their parents, etc.
A bad researcher would not be aware of such problems, and we should be very careful when reading research social science papers. What assumptions did the researcher fail to make explicit? Such questions can help us to recognise the bias and blindness that affected the researcher from carrying out a better research on the subject.
A good researcher may well be aware of such problems, and limit the findings of these results in his/her publication. But as I mentioned in the previous chapter, there is a wide prevalence of misreported or misunderstood science. Even if we read social science papers like this, most of us tend to overlook these considerations in the paper, and assume that the findings could be extended to all parents and children. So again, it is important for us to be careful with what we read. To reiterate, the value of science is not in the answers, but in the process by which the answers are sought.
Here, I’d like to discuss three common empirical methods social scientists use to study humans. These methods were all initially employed in the physical sciences, but were later imported into the social sciences as a way of improving the way researchers study humans.
1.1 Randomised Control Trials
Randomised control trials (RCT) are the gold standard in all scientific and social scientific research, since this allows you to minimise your confounders as much as possible, almost like a laboratory condition. How it works is that you invite a large number of participants to your study. And each participant is randomly assigned either into the control group or the treatment group.
What’s important is that they cannot know which group they’re in. It’s possible that sometimes, the person doing the assignment may accidentally give out clues, like body language or choice of words. So to avoid that, researchers sometimes employ what’s known as a “double-blind test,” where even the people involved in assigning the groups don’t even know whether which group is which.
The idea is that the sheer size of the study will be representative of the population (or demographic) and it will help to reduce (or average out) certain confounders that cannot be eliminated. Yet, much as how RCTs aspire to be a representative sample owing to its large number of participants, it can fall prey to unrepresentative sampling and survival bias (mentioned in the previous chapter). What do I mean? Let’s think about this together. Who are the people who are free enough to spend several weekday mornings or afternoons to participate in a study? Definitely not working adults. And since many of these studies are based in universities, the majority (if not all) of participants will be university students.
A study by Henrich et al (2010) found that in a sample of hundreds of studies from leading psychology journals, 68% of these studies originated from the United States (96% from leading Western nations). Of that 68% of papers from the US, 67% of the test subjects (participants) were specifically undergraduates majoring in psychology. Those are very biased samples!
Often times, researchers are not careful enough and fail to consider that simple things like timing can itself be a process of selection that filters out particular demographics. So the RCT findings are skewed because of the bias sample of participants involved.
However, it is often not ideal to implement an RCT study. Cost is often the reason due to the sheer size of the participants. But the other reason is that RCT often requires you to arrange people in ways that are not conventional. Organisations and agencies will often turn down requests to carry out an RCT. A school principal, for example, will not allow you to re-organise students into control and treatment classes as it would disrupt school activities and curriculum.
The last reason — and this is important — is ethical. If you want to study whether something is good for people, you could run into problems, especially if this is a school involving children. Here, the ones to object are often parents: “If it’s good, then all the children should all get it! Why should some lucky ones get it? It’s not fair.” Conversely, if you want to study if something is bad for students, their parents might object, saying: “If it’s bad, why are you testing it on our kids?”
So, what alternatives do we have?
We have two other solutions that are cheaper, and they certainly won’t incur the wrath of others for ethical or logistical reasons. We have to bear in mind that these are not as good as RCTs in eliminating confounders. Much of the work involves trying to minimise as many confounders as you can, and in making reasonable assumptions about group similarity. What makes social science more an art than a science lies in the fact that identifying the control and treatment groups for comparison is an art: how do you justify that two different groups of people are similar enough to make a valid comparison?
This is used when the effect you are studying varies because of other factors beyond your control.
For example, people generally shower more when the weather is hot, and they will shower less when the weather is cool. How then can you figure out whether a particular water-conservation programme works? People might be using less water because the weather is cold and not because of the conservation programme.
Here’s where difference-in-difference comes in. In the first place, we must ensure that the control and treatment groups are similar enough. They should have similar water usage requirements (families with children will use water very differently from homes with just one individual), and similar water usage habits (e.g. in Singapore, people bathe twice a day on average; as opposed to some European countries where some people bathe once every few days). One should also consider whether other demographic properties are relevant. Does the household income matter? Does education level matter?
Just as important for this kind of study, the two groups should be exposed to similar environmental or background effects. For this study, the weather conditions of the control and treatment should be similar enough, since we know that people typically shower more on hot weathers, and less on colder weathers.
Having identified our treatment and control groups, and having assigned whatever treatment that was planned, we can collect data of their water consumption and measure the effect, as the calculation of the difference between the treatment and control. This will help to reduce much of the background conditions that may cause people to use more or less water.
Here’s a more difficult scenario. Let’s say a researcher wants to examine the efficacy of the death penalty in Singapore. We may consider the treatment as the death penalty, and Singapore as the treatment group. But what could we choose for our control group (that is supposed to be similar enough to represent the counterfactual of a Singapore without the death penalty).
Probably, the best scenario to study that effect would be to split Singapore into two halves. Let’s have the death penalty in the Western side of Singapore (treatment), and abolish the death penalty on the Eastern side of Singapore (control). If we assume that crime is largely influenced by economic factors (i.e. more crime when the economy is bad, and less crime when the economy is good), then this will be ideal since both the Eastern and Western sides of Singapore are exposed to the exact same economic factors.
Sounds good? Chances are, at the end of our observation (let’s pretend we studied this over five years), we’d see that the control group has high crime rates while the treatment group has low crime rates. Does this mean that the death penalty is effective? Absolutely not! Don’t forget that humans adapt and change strategies, often to game the system. So, if we abolish the death penalty on one half of Singapore, what could result is a situation where criminals who face the potential of the death penalty may move to the East of Singapore. So the Western side of Singapore (with the death penalty) will only have petty crimes not enough to warrant the death penalty.
This definitely won’t work. What we should do is to find another country (or state) that’s like Singapore as much as possible but without the death penalty. Again, if we assume that economic factors are contributing causes to crime, then what we need to be on the lookout for is for a country (or state) that is exposed to similar economic conditions as our background environment. That is, if global economic affairs affects Singapore positively, it should also affect that country (or state) positively (by about a similar extent), and if negative, both countries should be similarly affected.
Let’s pretend that we have found another country, e.g. Country X, that is similar enough economically. Now we have to justify why Country X, as our control, is sufficiently similar despite its differences. What about the overall education levels of Country X? Is education an important contributing factor to crime? If so, how much? If it’s negligible, then yes, we could compare the two countries to measure the effects of the death penalty on crime.
But what if you just can’t find another group that is similar enough to make a reasonable comparison using difference-in-difference?
1.3 Regression Discontinuity
Another method is known as regression discontinuity. Unlike difference-in-difference, the key feature of regression discontinuity is that it relies on a cutoff to differentiate the control and treatment groups. It could be examination results (e.g. the passing mark as a cutoff), or it could be a particular criteria for admission into a state welfare scheme or some subsidy. But what we’re more interested in are the people around the cutoff, those slightly above and those slightly below. It is reasonable to assume that these people are similar in many ways, and it was just a matter of luck or some other minor differences that led to one group being above the mark, and another group being below the mark.
We’re not interested in people far beyond the cutoff. In the context of examinations, students who score A+ are very very different from those around the passing mark, and certainly a world apart from those who got an F. Maybe the A+ people already know the content, have lots of extra tuition, have high IQs, drink a lot of herbal drinks that help to boost the brain, etc. Maybe the people who got an F are those who play all the time, or maybe they have had very dysfunctional families that made it impossible for them to study, or maybe they just have very low IQs or poor study habits.
On the other hand, students who passed by 1 mark and students who failed by 1 mark are not very different in terms of their abilities and calibre. What sets them apart are probably one careless mistake, or not enough time to complete the last question. It is reasonable to assume that they are quite similar in many ways.
If we want to see if a particular remedial programme is effective, we could put all the students who failed their mid-year examinations into that remedial programme. The remedial programme is thus conceived as the treatment. Our treatment group may involve the students who just failed the exam, 5% below the pass mark. And our control group will be those students who barely passed the exam, say 5% above the pass mark. They are not required to take remedial classes.
But why 5% above and below the cutoff mark? Shouldn’t 1% be better, since those people will be more alike? That’s true, but there is a trade-off that we must consider. If we reduce our spread (i.e. 1% above/below the cutoff rather than 5%), we risk having a sample size that’s too small to make any valid generalisations (i.e. Hasty generalisation). But in order to get a decent sample size, we will have to increase our spread. That runs the risk of including more dissimilar people, so we have to justify why 5% (or 10% or 20%) is still a reasonable spread. 5% could be justifiable if 5% is the average weight of each examination question. So 5% is a difference of having one answer correct or wrong. (In which case, having 10% might still be reasonable, as the difference in answering two questions right or wrong may not be reflect significance differences in the students.)
We can then measure the efficacy of the remedial programme by looking at their final year examination results. Did the students in the treatment group perform better than students in the control group? If so, by how much? This could give us a measure of just how good the remedial programme might be.
Sounds good? On paper, yes. But in reality, many things could take place to confound our research. For starters, the parents of those students who failed (our treatment group), could be worried that their children might not pass their final year exams, thereby enrolling their children for extra tuition classes at the best tuition centres. That might explain why their final year exam results are so much greater than the control group.
Another potential scenario is that if the school assured parents that this remedial programme is very good, then it might be possible that the students in the treatment group might not go for extra tuition classes outside. But, this does not stop the parents of students in the control group (who barely passed their exam) from enrolling them for extra tuition classes. Thus, if the remedial programme was measured to be barely effective, it could be due to the extra help that students in the control group received.
You can’t stop these kids from going for tuition classes. And it would be highly unethical to administer placebo tuition classes for the control group. So, either we accept the limitations of this study, or we’d have to find a better way to refine our experiment to reduce these confounders.
One issue that we need to be aware of when conducting such research is that in some cases, the treatment may be regarded as highly valuable (or to be avoided as much as possible). For example, if the treatment is some kind of monetary aid to people above the cutoff, people might be inclined to game the system so as to achieve the cutoff requirements. This can skew the research findings as these people might have done whatever they could to superficially attain the result, and thus are not the people that we actually want in our treatment group. The same can be said for treatments that are undesirable, e.g. the burden of time in having to attend classes or counselling sessions. In which case, people will do whatever they can to be disqualified from the treatment just to avoid it. So what we have in the treatment group will be people who, for various reasons, have no means of getting themselves disqualified even if they wanted to. So these are issues that we have to be aware of as well.
Allow me to present a slightly more challenging scenario, that is similar to our remedial class scenario: The management of a university suspects that full attendance at lectures might contribute to higher scores in examinations, and so they are considering making attendance at lectures compulsory.
If we’re going to use regression discontinuity, we’ll need some way to have a cutoff to identify our control and treatment groups. We could administer a mid-term examination in a particular module, and use that to determine the cutoff. What should be our cutoff? A-? B+? Or if you prefer percentile, the 90th percentile? 75th percentile? These suggestions might be too high. For students in these range of grades, it is quite difficult to determine if a student is scoring high marks just because of lecture attendance. These students could have higher intellectual abilities, or have a deep understanding of the subject matter on their own. Or the examiner had set a really tough question with a heavy weightage to separate the top students from the mediocre students.
So maybe we need to aim a bit lower. Perhaps the 60th percentile could be our cutoff. Students around this range are similar enough in that they have some mastery of the content, and do not exhibit the same high intelligence or calibre as the top scoring students. Also, there will be more students available so that our treatment and control groups could have a size sufficient for sampling.
What should our spread be? 1% 5% 10% 20%? If we have enough students in the 1% spread, then yes, we could just identify the students 1% below the cutoff as our treatment group, and the 1% above as our control group. If the sample size is not large enough, then we’ll have to increase our spread. Nonetheless, the treatment group will be required to attend all lectures.
Again, there can be problems with our study. Students in the control group who perceived that they did well for the mid-term exams may decide to skip lectures. If lectures are indeed crucial to their grades, this will affect their performance when we measure the efficacy of compulsory attendance at the final exams. That module might have a fantastic or terrible lecturer. Or maybe the one to have a real effect on the students is not the lecturer but the tutor of the module. Students may not have the same tutor, and some might have a tutor who makes very useful notes and summaries that greatly aided their revision for the mid-term and final examinations.
Nonetheless, whatever the findings may be, the study is very specific to this module and its unique factors. The content could be really abstract and difficult to learn, or the lecturer could be one of the best in the university that no other lecturers could match. So what is true of this module is not necessarily true of other modules. We cannot simply extend the findings from one module and generalise it for the entire university.
1.4 Internal and External Validity
It is thus very important for us to consider whether the findings of social science research is internally and/or externally valid.
What is true of only one module in the university cannot necessarily be true of other modules in same university, or other universities for that matter. We say that this study is only “internally valid,” and NOT “externally valid.” We have to be extra careful with exporting the conclusion of this study. We could potentially use the conclusion of the study on one module as the basis for discussing about other modules. But there is a limit. What is true of this module is not necessarily true of other modules, unless other modules are share similarities with the module that we studied (e.g. same lecturer, same group of students, etc.)
What can we do to achieve external validity? We could extend our study to something on a much broader scale to cover many other modules, by looking at the Grade Point Average (GPA) of all undergraduate students. We could set the cutoff as 2.50 and set a policy requiring compulsory attendance at lectures for students with GPA 2.50 and below. Students in the 5% above the cutoff will be our control group, while students in the 5% below will be in our treatment group, receiving the treatment of compulsory attendance. At the end of one semester (or one academic year), we can compare to see if the treatment group performs significantly better than the control.
A test like this would allow us to study the effects of full attendance across the entire university. Sounds great, doesn’t it? But there is a problem. Such a study would be “externally valid,” but NOT “internally valid.” The claim is true for the entire university population, but not for individual modules. By doing such a large scale study that cuts across every module and discipline, we have eliminated confounders in specific modules, like the difficulty of the module/content, the teaching efficacy of the lecturer, the types of students enrolled in the module, etc. For this reason, we cannot assume that the conclusion about the general population of the university is true for a specific module. We may, however, export the conclusion from such a study to other universities, insofar as the other universities are similar to the university we studied.
Could you have a study that is both internally valid and externally valid? Yes, you can, but it is too expensive and requires too much time and effort, those most researchers can’t and won’t do it (even if they could). To use the example above, it involves studying many individual modules, not just in one university, but also in other universities. If we can have a large collection of studies showing that full attendance across different modules and universities lead to better examination results, we can then have both internal and external validity.
Sounds great! But we don’t have enough studies that are like that, and this has a lot to do with the sociology of research. There is greater incentive to publish new findings, than to repeat the research of other scholars to establish a finding that is both internally and externally valid.
So what we have is a lot of social science research out there that is either true of a specific group, or true of a general population, and we cannot simply export these findings to other groups. We must be very careful when we encounter people who extend the claims of a particular research to groups not mentioned in the study (especially if the group being studied is different from the group that it has been extended to).
In other words, take it with a pinch of salt, and always ask: “How different is this group of people (mentioned in the study) as compared to the group of people (that person X is extending to claim to)?”
In order to justify such an extension, we need to look at the assumptions that person is making. Are those assumptions reasonable? How might the two groups be different that the conclusion wouldn’t necessarily apply?
2. The Normative Dimension of Measures
One issue that I’ve alluded to in the previous section is that attempts at measuring humans can alter human behaviour. If there are monetary incentives, people will game the system to achieve it. But it doesn’t always have to be monetary incentives. In many cases, a mere descriptive measure can take on a normative (good/bad, right/wrong) dimension.
For example, if your weight is beyond a healthy acceptable range, you might be considered obese. But culturally, obesity is more than just being beyond a healthy range. It can be a cultural stigma, and one that translates into: fat and ugly.
Similarly, the grade point average of university students is an indicator of their academic performance. And while these scores indicate whether a student is a good student in the sense of one who’s good academically, the sense of academic goodness can get conflated with other senses of “good.” In many cultures, this feeds into the identity of many students. Having a good GPA is not just an academic achievement, it is a measure of one’s self-worth and identity. Having a low GPA doesn’t simply mean that one isn’t very good academically. Some infer it to also mean that one is a failure in life.
Jean-Francois Lyotard (1984) refers to such measures as a “terror of performativity.” Especially in the context where measures are in place to determine how well we are performing, these measures come to “encapsulate or represent the worth, quality or value of an individual or organisation within a field of judgement.” (Ball, 2003)
These measures can dominate our cognitive structures, reducing our identity to revolve around the fulfilment of performance measures, thereby losing sight of other broader and more important goals and projects in life.
Do allow me to share an anecdotal story to emphasis this point. At the start of every semester, I would send out e-mails introducing myself to my students, and I would ask for their introductions, so that I can get to know them better. In one semester, I had to teach in the residential halls. For students to continue staying in a residential hall, they must be actively participating in hall life. This is measured through a point system. They get points for the number of clubs they join, whether they are in office, and for various projects that they participate in. As it is highly competitive, they tend to take on between 4 to 6 of these intense activities each semester. One thing that really stood out was how these students introduced themselves. They introduced themselves in terms of the activities they were doing. This is quite unlike the student introductions I got from students not residing in residential halls: their’s were typically about their hobbies and plans for the future.
Measures have a powerful role to play in affecting our identity, what’s important to us, and our behaviour, and it’s worthwhile for us to question and think about the way we set measures for ourselves and for others (especially when we take on seniors roles with people under our care).
A good example of this is the academic system. Not too long ago, some scholars thought that it was a good idea to measure scholars based on the number of publications and impact that he/she has on the world. It seems like a good idea, since we do see a correlation between good scholars and highly cited publications (and you see a lot of scholars describing their identity in that way too).
However, fast-forward a few decades to where we are today, and we have a “publish or perish” culture. Since the measure of academic success lay in the measure of publications and their impact, everything else takes a back seat, including teaching. And there are many scholars who have since learnt how to game the system. typically by publishing small bits of research. Few these days would dare to undertake big, ambitious and lengthy projects with the potential of making a great impact.(https://www.chronicle.com/article/We-Must-Stop-the-Avalanche-of/65890) So not only has the quality of research dropped, but a ridiculous proportion of papers are published that have never been read. Within the social sciences alone, 90% of published papers in the field have never been read by anyone. (http://blogs.lse.ac.uk/impactofsocialsciences/2014/04/23/academic-papers-citation-rates-remler/)
But perhaps one big tragedy in academia is that everyone’s so busy trying to publish something new in their field, that no one’s trying to synthesise all these disparate information. Synthesis is not part of a scholar’s measure of success.
This is something we should be really careful with. We tend to impose performance measures (among many other kinds of measures) on people for the sake of trying to measure and determine the effect of a particular phenomenon, without much thought on how these measures can have negative consequences until it becomes too obvious that we can’t easily undo the damage.
3. Models are Great – But They’re All Wrong!
If you’ve ever done online shopping for clothes, you’d notice that no one sells clothing online by showing photos of clothes lying flat on a table. Either a human person or a mannequin wears it to model the clothes, so that we have an idea of how it’ll look when worn. More importantly, we use these models to help us get a sense of whether the cutting of the clothes might potentially fit our body frame. If I see a model wearing a tight shirt where it curves inwards at the waist, I immediately know that I can’t wear it as my belly curves outwards. Some of us may reference the length of the model’s arms with respect to the model’s height to get a gauge of whether the dress or long sleeves might be too long or short.
That’s essentially what models do. They come in all shapes and forms. Some of them are even human! And we use models all the time without realising it. And like fashion models, models help to simplify complex phenomena and structure data in a way that we can understand without having to resort to any empirical work. Really good models are powerful as they can predict outcomes with incredible accuracy.
This can lead some people to mistake models for the truth, and approach the use of these models very uncritically. So, I’d like to quote the famous statistician, George Box, who said (and this is very important):
“All models are wrong!”
It might be rather strange to hear this coming from the very guy who uses models for his entire professional career. What Box means to say is that models are wrong because they are merely simplifications of complex phenomena in this world. Simplifications are made possible when we restrict the model to a fixed set of assumptions and conditions about the world.
One model we’re probably quite familiar with is the physical model representing Newton’s 2nd Law of Motion, Force = Mass x Acceleration. This model assumes an ideal condition where there is no friction, in order for us to have a simple understanding of how force varies with the mass and acceleration of an object. In the real world, this model doesn’t quite apply because of factors like wind resistance and friction of the object against the surface.
All models make assumptions and restrict the use under fixed conditions. It’s for this reason that we cannot expect a model to tell you everything, or every aspect of a particular phenomenon.
3.1 What are Models not Saying?
As simplifications, models work well to tell us a limited number of things, but we can get the mistaken idea that it tells us everything. One way to become more aware of the limits of models is to question just what a particular model is not telling us.
I will illustrate this using the Gini coefficient. It is a model commonly used to determine how equal or unequal the distribution of wealth is in a particular society. A score of 0 means that the society has an equal distribution of wealth, while a score of 1 means that society’s wealth is most unequally distributed. The formula is a little bit complicated, so I’d rather not get into that. Instead, I’ll use the Lorenz curve often used to visually depict the Gini coefficient. To plot such a graph, one must first sort the wealth of individuals in the population from poorest to richest. Each point on the graph is calculated as a ratio of a person’s wealth divided by the wealthiest amount in that society.
Let’s say the richest person has $1 billion. So his wealth is represented on the graph as a ratio of the richest amount ($1b divided by $1b), which is 1.0.
If the second richest person has $0.7 billion, as a ratio to be plotted on the graph, it will be 0.7.
Suppose the poorest person has a wealth of $100. As a ratio on the graph, it will be ($100 divided by $1 billion) 0.0000001.
The closer the Lorenz curve is to the straight diagonal line, the more equal the distribution of wealth. The further away it is, the more unequal the distribution.
If all you have is a graph of the Lorenz curve and a Gini coefficient value, what is this model not telling us? (Or unable to tell us?) I’ll give you a clue. The results are ratios that are dependent on the richest and poorest people in society.
Have you figured out what the model doesn’t say?
Ratios don’t tell us very much about how rich or poor a society is. I could have a society where the poorest person can own a small unit in an apartment, an iPad and several other things that the poor in other societies cannot afford. In other words, I could have a society that is very wealthy, where the poorest person lives a much better life than other people who are actually in poverty. However, this society could have several billionaires, and that would vastly skew the inequality levels, giving me a picture that this society has a high level of inequality. Conversely, I could have a society where everyone is in poverty, but the Gini coefficient says has a very equal distribution. What it fails to reveal is that everyone is equally poor.
Of course, these scenarios rarely happen. But it’s because they rarely happen that we tend to forget about the limitations of this model. What is more common is the presence of super wealthy people in a society, and that skews the Gini coefficient to depict that society to be more unequal than it really is. This is quite dangerous because there’s a lot of talk about poverty and unequal distribution of wealth, and people frequently rely on the Gini coefficient as if it says everything about how bad the situation is. No, it’s not the complete picture. Like all models, it’s wrong because it tells us only one thing.
The next time you encounter a model, it’s worthwhile to try to figure out what it doesn’t tell you. There’s so much you can learn from what a model is silent about.
3.2 All Models Have Encoded Biases
The other reason why all models are wrong is that all models – including models used in the physical sciences and engineering – have biases coded into them. Many people who use models tend to be blissfully unaware of the biases in the models they use. But this is a false sense of security that people tend to have, in the belief that as long as models are grounded in facts and figures, there can be little or no biases. Yet that is far from the truth about models.
At the very least, the bias comes in the form of how we expect things (or systems) to behave. It’s not so bad in the physical sciences and in engineering as there are methods for rooting out incorrect biases. However, the problem is more devious in computer models and especially so for social science models. Not only do these models encode the biases of how we expect people to behave, but it also includes certain biases about our perception of humans, or biases about what we think people value over other things, or biases about the way certain social interactions will play out. It is far too easy to be under the illusion that you are unbiased and objective when you are working with facts and figures in your models.
True, numbers don’t lie. But the concern lies in the selection of the figures, of the selection of what goes into the model. “Why did you pick X instead of Y?” Questions like these often go unanswered and are left hanging in many papers. Sometimes, researchers choose it because it seemed “obvious” that it is relevant to their study. But what is obvious is a subjective matter. What’s obvious to you is not obvious to another. And for that matter, what is obviously relevant to you is very subjective and not obviously relevant to others. It is based on what you know (or think you know). That is a problem. We can only model based on what little we know, and what we think is relevant. But we could be mistaken about what’s relevant, and fail to put it into the model. But who’s to say that it’s not relevant to the model?
4. A Case Study of the Game Theory Model
Thus far, I’ve been talking about models in very abstract terms just to highlight the ways in which they can go wrong. Here, I’d like to discuss the Game Theory model, also know as the Prisoner’s Dilemma. I’ll talk about how it’s such a powerful model for prediction, but I’d also like to talk about the encoded biases and limits of this model, and what we can do when real world situations deviate from models.
Here’s the scenario. Two robbers were caught by the police and confined in separate prison cells, where they have no way to communicate with each other. The police present each robber the following options:
(1) If you both confess, you will both serve 2 years in prison.
(2) If you confess, but the other stays silent, you will go free, while the other has to serve 3 years in prison.
(3) If you stay silent, but the other confesses, you will serve 3 years in prison, while the other goes free.
(4) If you both stay silent, both will serve 1 year in prison.
You have some time to decide, so you don’t have to make a split-second decision. Think about it. What would you do? Will you confess or stay silent? Why?
What happens if you stay silent? You might get 1 year in prison, but you also run the risk of having the other robber defecting and confessing. Since you can’t talk to the other robber (and you don’t know him that well), you don’t know if he will play you out.
This is known as the free-rider problem. We don’t like it when people take advantage of our good will, or benefit from our losses. If there’s a good chance that people are not going to cooperate because it is more advantageous to defect and take advantage of those who cooperate, then why should you cooperate and be on the losing end?
So, even though Option 4 (above) is the best outcome possible (1 year in prison), you’re not sure that in this state of uncertainty whether you can trust the other enough to stay silent too. The other robber could just confess with hope that you stay silent, thereby taking advantage of your good will. Or the other robber could confess simply because he doesn’t trust that you will stay silent. Maybe he thinks that you will take advantage of his good will.
So, no matter how you think about it, staying silent doesn’t seem like a good idea. Sure, you might get 1 year in prison, but you might also get 3 years in prison.
If you confessed, you either get to go free, or get 2 years in prison. At the very least, you know you won’t get taken advantaged of by the other if he confessed. So, no matter how you look at it, confessing seems like the better choice.
In fact, confessing seems like the better choice for the other robber too. Thus, according to Game Theory, if the two robbers are as rational and self-interested in the way we make decisions, both robbers are quite inclined to confess, and so Option 1 (above) is the predicted outcome. It is also referred to as the Nash Equilibrium as it is the outcome that reflects the situation where no one will choose to change their strategies. No matter how we assess the situation, confessing seems to be the better strategy.
In some ways, Game Theory can also explain why we can’t have all the good things in the world. Given the free rider problem where none of us wants to be played out by other, the best thing to do is not to cooperate, since you don’t know if the other may cheat or back out of the deal.
If you’ve been paying attention, people have been queueing earlier and earlier for the new iPhone. At the beginning, people used to line up about one or two hours before the launch of a new model. In subsequent years, people began queueing overnight. In recent years, people have been queueing for a week! We could all have nice things by agreeing not to queue like this. But those who got played out by coming a little bit late have learnt that they need to queue much earlier than the rest. Everyone else is inclined to think the same way: queue earlier than the others to beat the crowd. That’s our Nash Equilibrium, and so each year the queues begin forming earlier and earlier. That’s Game Theory in action.
We can also use Game Theory to explain why it’s hard for countries or companies to go green. Again, it’s the free rider problem in action. It’s too expensive, and no one wants to be the first (or only one) to do it so that others can benefit at their expense, so the Nash Equilibrium, the better thing to do is to not go green. It also explains why it’s hard to stop corruption in government. If you decide to be clean and play by the rules, everyone will take advantage of you (or worse, get you fired), so the Nash Equilibrium is to maintain that status quo.
4.1 The Danger of Game Theory
It’s amazing how well Game Theory can be used to explain a variety of situations. However, the danger of learning this model just like that is that we can go away with the idea that everyone is selfish, that we cannot trust anyone, that the best thing to do is to always act for our own interest and benefit. I’ve taught Game Theory over many semesters and noticed how students walk away from class with this “realisation” about how messed up humans are.
That’s the danger of learning and using models uncritically. Models are supposed to help us understand reality. But if we’re not critical about the models we use, these models can instead shape our perception and create realities.
There is an incredible number of research papers and articles about just how selfish and greedy economics majors are. Part of it stems from such uncritical use of models, and they have used the model to create a false reality where we cannot trust or cooperate with others, that the better thing to do is to act for our own self-interest. That’s a huge mistake!
We need to be aware of how the model is wrong, how it fails to reflect reality all the time.
4.2 Model Deviations and Deficiencies are Learning Points
While the Prisoners Dilemma is a scenario that best exemplifies Game Theory, in reality, researchers have found that mafias don’t behave that way. You could put two mafias in separate interrogation rooms, and both will choose to stay silent, contrary to what the Game Theory model predicts.
Does this mean that the mafia have since falsified Game Theory, and that we can now throw the model away? Well, not quite. Members of the mafia are quite unlike our two robbers in the Prisoners Dilemma. So we need to be aware of the assumptions and background conditions of Game Theory and how the mafia situation deviates from the model.
First, the one explicit condition of this model is that the two parties cannot communicate during the decision process. This is to model in a two-person scenario, how it’s impossible for us to communicate and know what everyone else is thinking (or doing) in a society (or large group of people) when we’re trying to make our own decisions.
Second, in this model we assume that each robber behaves rationally and is solely interested in his own benefit and welfare, thus acting to create the best outcome for himself. Again, this is to model how in a society, even the most altruistic person cannot possibly act for the benefit of everyone, and can at best act for the benefit of one’s self and one’s immediate circle.
Third, in this model, we are assuming that the two robbers do not have high levels of trust, almost as if they are strangers. This is to model the exact level of trust that we have with all the strangers we meet in society, given how impossible it is for us to meet and communicate with everyone.
Coming back to the mafias, we’ve had cases of mafias interrogated separately and chose to cooperate with the other by staying silent. So the first point (no communication during the decision process) is not so relevant for considering how that scenario is different. Given the levels of trust needed to enter and stay in the mafia, they have far more trust than the robbers in our Prisoners Dilemma. Moreover, given such trust and strong bonds forged over the years in their mafia community, it is likely that they do not operate on cold rational self-interested calculations, but instead make decisions to include the other (and the broader mafia community).
The key difference here is trust. If we can achieve high levels of trust like the mafia members, then it is possible for us to cooperate – even if we are unable to communicate at that point in time – and get a better outcome. The Nash Equilibrium in cases of high trust is not Option 4, but Option 1 in this case.
In fact, in the real world, we do see a lot more trust and cooperation than what is presented to us in Game Theory. Where people are allowed to communicate and form bonds of trust, we get many associations and cooperatives. Groups of people can come together to maintain a community garden. People can cooperate and enter into business with one another. Even criminals have trust for each other – probably more so than most groups – that they can operate such wide syndicates and get a lot done together.
But before we get too pleased with this happy and positive point, we should consider whether we are missing other things in our analysis of the difference? Could there be other social mechanisms in place that will ensure that staying silent is the preferred choice?
Think of all the mafia movies you’ve watched. What happens if a mafia member betrays the mafia? Things don’t end well for him, right? The mafia wouldn’t just go after him, they would go after his entire family as well. That is what we call social sanctions (or penalties) for betraying trust.
That’s missing from our Prisoners Dilemma model. Our Game Theory model is deficient in discussing that. It’s not a flaw of Game Theory. Remember, Game Theory is supposed to be a model to explain social phenomena. If I were to queue one night before all the other iPhone crazed people came to queue for a brand new iPhone, there is no social sanction, other than possibly angry stares from those who can’t be the first in line. If several car manufacturers decide to go green, but I decide that I won’t so that I can reap the greatest profits, the other manufacturers would be very annoyed, but there’s little they can do. Social sanctions are weak or missing in these scenarios.
In other scenarios, we have strong penalties and sanctions. We pay taxes to support the maintenance, construction, and delivery of public goods and services. It’s not so easy to be a free rider and defect, because there are harsh penalties in the law to deter people from behaving as such.
This has been a sampling of the various ways in which the Game Theory model deviates from other circumstances, and how the model setup for fixed situations can be deficient in explaining other situations. There are many other ways in which Game Theory is deficient and/or deviates from other real world cases. The good news is that we don’t have to throw away these models when they deviate or are deficient. We can use the model as a base for learning about the key differences, to gain insights in to how the differences allow for such different outcomes.
Just as important is that we realise the limits of models. Game Theory can’t be applied perfectly to every social phenomenon, but only to some situations where there’s a lack of communication and trust. This leads us to a more interesting and constructive question. If trust is so important, how can we overcome situations that Game Theory explains and predicts quite well, to restore that trust and achieve a better outcome? Real world situations that deviate from this model can provide us with a lot of clues for this answer.
I’ve spent quite a lot of time discussing the limits of models, and how we can still use models even if they don’t match real world situations very well. As I mentioned before, if we are not careful, if we use them unquestioningly, models can create realities rather than help us understand reality.
For some time now, policy-makers and economists have been using models, not only to understand their societies, but to also organise and sort it to yield better outcomes. As technology becomes more tightly intertwined with human societies, we are now able to realise these social models and outcomes with greater speed. Given how models have encoded biases, assumptions, and assumptions of fixed steady conditions, we should approach the use of our models with great care.
A case study would be China. It has implemented the Sesame Credit System, which relies on a model on how trustworthy their citizens are. It is facilitated by an algorithm that pulls in information about each citizen, so that it can rate them according to that model. Your trustworthiness score will determine the types of loans available to you, how much loans you can take, the jobs you can apply, and more. For now, people in the bad citizen category are not allowed to travel on trains and planes.
Imagine China ten years from now. As the Sesame Credit System continues, it would generate a caste system of good citizens and bad citizens. Are people in the bad citizen category because they are actually bad, untrustworthy citizens? Or do they behave badly because they have been condemned and confined into that category of “bad citizens”?
Did the model help us understand reality, or did it create a reality?