Statistics
From GSLISWiki
This page should contain some information about using statistics properly. Unfortunately, I am not the best person to fill in information in this page. Here's something I submitted at one point to the CMC BB, that can exist for now until somebody improves it.
Contents |
[edit] What's the point of using statistics?
One possible answer can be found in:
- Headrick, Daniel R. (2000). When Information Came of Age: Technologies of Knowledge in the Age of Reason and Revolution, 1700-1850. Oxford University Press: New York.
Headrick states that statistics are a "...means of transforming information. [emphasis Headrick's] Turning an accumulation of facts or anecdotes into numbers not only compresses them but also allows patterns to emerge that would remain hidden in narratives; thus, to the early practitioners of 'political arithmetick,' mortality tables (the numbers of deaths by parish every week) revealed the incidence of various diseases. From this small beginning arose the idea of demographic surveys, of censuses, and of sociological inquiries." (Headrick, pg. vi)
One obvious benefit of statistics, therefore, is allowing patterns to emerge, and to communicate/be able to read the data more effectively. This means that statistics are suplemental to anecdotes and narratives, not superior to them, and that they are primarily useful when statistics allow information to be obtained that would not otherwise be clear.
The common defence of the use of statistics, to my understanding, is that it provides more reliable results. This justification is often used blindly, such as when people use statistical analysis to process survey results, and because they used statistics, think that their survey must therefore be rigorous. Comming from a cognitive science background, I find anything that blindly depends on people's oral reports laughably unreliable.
So why do I find oral reports laughably unreliable? Well, here's an oversimplified answer that will do for the purposes of this post:
- Pinker, Stephen (1997). How the Mind Works. W. W. Norton & Company: New York. (Pinker is a mulleted god of insults)
"The neuroscientist Michael Gazzaniga has shown that the brain blithely weaves false explanations about its motives. Split-brain patients have had their cerebral hemispheres surgically disconnected as a treatment for epilepsy. Language circuitry is in the left hemisphere, and the left half of the visual field is registered in the isolated right hemisphere, so the part of the split-brain person that can talk is unaware of the left half of his world. The right hemisphere is still active, though, and can carry out simple commands presented in the left visual field, like 'Walk' or 'Laugh'. When the patient (actually the patient's left hemisphere) is asked why he walked out (which we know was a response to the command presented to the right hemisphere), he ingenuously replies, 'To get a Coke.' When asked why he is laughing, he says, 'You guys come up and test us every month. What a way to make a living!'" (Pinker, page 422)
[Concept of self-deception also on pages: 272, 305, 421-424, 445, 448, 543, 654; Gazzaniga sources pg 600 or ask me to post/email them]
Another example, which also demonstrates how much oral reports can be manipulated, is the numerous videotaped cases (anecdotal evidence) of suspects intimidated by police into confessing who were later proved innocent (if I was to be rigorous, I would cite a source here, but I'm working from memory right now). This is why I found the following note in Haythornthwaite's article interesting:
- Haythornthwaite, Caroline (2002). Strong, Weak, and Latent Ties and the Impact of New Media. The Information Society, 18:385-401.
Note 2 on page 397: "These figures show reported interactions, not interactions captured from electronic logs. Thus, they show perceptions of interconnection with other class members rather than actual connections. Such kinds of reports have been found to be reliable for comparisons across reports of connections, and in whether a connection occurred or not, although not accurate in matching exactly how often an interaction occurred (Hartley et al., 1977; Rice & Shook, 1990b). Thus, these figures should be read for comparison across media and time rather than as measures of exact numbers of interactions."
This quote demonstrates that for any survey you do, you need to have an idea of what answers to the survey are reliable, and what answers are not, and what answers you don't have a clue about. That allows you to know whether you can base your conclusions on the results or not. And ascertaining the reliability of your results needs to be done *before* you start using statistical analysis to extract patterns upon which you base your conclusions.
[edit] How do you use statistics properly?
So how do you use statistics properly? Well, I wasn't sure, despite having 3 statistics (or similar) classes as an undergraduate, so, when in doubt, ask David Dubin. Here's what I gleaned from our conversation (and I hope I understood everything he said correctly).
When you apply any statistical test, you are make several assumptions about the data. Therefore, when you are ready to start using statistics, you first must be clear about what it is you are trying to measure. I am mainly referring to inferential statistics. There are several important questioins you must ask yourself.
- What variables am I trying to compare?
- What am I trying to measure?
- Am I interested in comparing central tendancy? Variance? Why?
- What does a difference between samples signify about my data?
- What is the distribution of these variables accross the total population (if you could measure everybody)?
- Is it a normal distribution (bell curve), or some other type of distribution?
- Does the distribution of my sample data reflect this distribution, and how closely?
- There are (mathematical) tests you can perform on your data to determine how closely it fits to any particular type of distribution (for more information about these tests, talk to David)
- Parametric tests assume some theoretical distribution. Frequently it's the normal distribution, but not always.
- What kind of data do I have? Is it qualitative or quantitative?
- What level of measurement? Nominal, ordinal, interval, ratio?
- These are important questions, because each type of statistical test is only appropriate for certain kinds of data.
- What is the probability that my data is simply the result of typical statistical variance within the total population, based on my sample size, and sample selection techniques?
- Am I satisfied with my results if 1% of the time they could be simply the result of typical sample variance? 5% of the time? 0.1% of the time?
- This is important because it determines what the cut off point is for determining whether the results are statistically significant or not. Essentially, it is the margin of error.
- Is any difference significant, or, due to my hypothesis (i.e., what I am trying to test/what I expect to find due to the nature of my data), is it only significant if A is larger than B?
- How many data sets am I comparing? Two, three, twenty-seven?
- This is important because each time you run a test such as the T-test, which compares central tendancy, there is a probability that the results are simply due to typical variance. Every time you run the test, the likelihood that any one of the results is due to typical variance is increased, which means that if you have 18 data sets, you are making 18 choose 2 comparisons (i.e., a lot), and the likelihood that one of the results is due to variance is, as a result, very high. Therefore, for many data sets it is often more prudent to analyze variance, rather than central tendancy, though some of those tests are particularly sensitive to noise (such as outliers), so again, you must be careful and know what you are doing.
- Am I measuring two different variables? Am I looking for a correlation (association) between the two? Can I determine whether it is a related correlation? (The relationship can be spurious.) What relationship exists, can I predict it? (regression analysis; this portion of my notes is more readable, but I may have gotten some of it wrong)
- Related correlation is not a measure of causality.
- Correlation only points to a relationship among variables: it never tells you which (if any) of the variables is a cause.
- etc.
Once you have asked these questions, then you are in a position to choose a test that will give you meaningful results when applied to your data, and you should understand exactly what the results signify. The questioning process is a bit more complex than what I have summarized above, but as I am not an expert, I need someone who is to refine the list and add what I have left out.
[edit] Independent vs. Dependent Variables
Many people are confused by the difference between independent and dependent variables. The simple answer is that the independent variable is the one you change, and the dependent variables are the ones you watch to see if your changes to the independent variable have any effect. All other variables should be controlled/kept constant. How this principle is applied depends heavilly on the field in which it is applied, however.
- Note 1: independent and dependent variables are important when you are interpreting the results of a statistical test. They have no effect on the mathematics of the test itself, and whether the test demonstrates a correlation between two variables or not.
- Note 2: The absence of association/correlation is not enough to prove the absence of causation. Similarly, an association/correlation is not enough to prove causation. And a measured correlation may prove to be spurious. Which begs the question, "How can statistics be useful?" The short answer is that an unexpected correlation or lack of correlation will indicate an area where more research needs to be done in order to better understand the mechanisms at work.
[edit] Science
In the sciences, the above description is taken pretty literally. In each trial or in each experimental group, the independent variable is given a different value/is changed. Experiments are performed in highly controlled environments, where as many variables as possible are kept constant, and randomization is used to control for the rest. This allows the scientist be as sure as possible that the changes (or lack of changes) one observes in the dependent variables are only due to the changes made in the independent variable.
[edit] Example 1: Feather in a Vacuum
If a scientist wanted to demonstrate that Newton's laws still held by testing whether a feather fell at the same rate as a ball in a vacuum, he or she would take two identical vacuum tubes (same size, shape, material, pumping mechanism, manufacturer, ideally bought at the same time, etc.), place the ball in a compartment at the top of one of them, the feather in a compartment at the top of the other one, and release a trap-door to the compartments at the same time. The independent variable would be the shape of the objects. The dependent variable is the time it takes for the objects to fall from the moment the trap-door is released to the moment they hit the bottom of the vacuum tube.
- Note 1: The variation in mass, density, material, etc. could also be controlled for if the scientist was not sure if they might have an effect on the experiment. Since the experiment has been conducted so many times, and it has been demonstrated so conclusively that these variables have no effect on the dependent variable in this experiment, they do not need to be controlled. However, if this was a novel experiment, these variables would have to be controlled; otherwise you could not rule them out as possible causes for the observed behavior of the objects. In cases where it is known that not all variables are controlled, the scientist cannot be sure to have discovered a causal relationship. Often, such relationships are assumed until they are later proven wrong, but this is poor scientific practice.
- Note 2: No matter how hard a scientist tries to control for all the variables, there are inevitably variables he or she will inadvertantly fail to control for, either (A) because he or she does not know that they exist, (B) he or she fails to recognize how the variable might impact the experiment, or (C) he or she is unable to devise a method to control the variable, or lacks resources to control a variable. This is one of the reasons why science can never claim to discover the "Truth", but simply is a collection of predictions and rules that work, that allow people to build things or plan for things successfully. All science can demonstrate for sure is things that don't work, things that are not true.
[edit] Social Science
In the social sciences the researcher usually has less control over the studies he or she is conducting. In studies where statistics are relevant because some measured or categorized aspects of two or more groups are being compared, the researcher is often not manipulating any variable, but simply applying statistical tests to existing data. In the case where the researcher still directly manipulates a variable, that variable is the independent variable, and the variables observed to see if they change are the dependent variables. In the case where the researcher did not manipulate any variable, the decision as to which variable to call independent becomes murky. In the end, it is up to the researcher to make his or her best guess as to which variable is causal, and which variables are affected by the causal variable: the causal variable will then be designated the independent variable, and the others the dependent variables.
In either case, since the environment in the social sciences is impossible to control because there are simply too many variables, the best that statistical tests can demonstrate is a correlation between two variables. To determine cause one must have much more information. And, indeed, it may be impossible to demonstrate any causal relationship.
[edit] Conclusions
So it boils down to: what do you want to find out? If you want to know if a phenomenon exists, you need to find a couple of verified anecdotes (multiple sources, I think, is a standard method of verification). If you want to demonstrate that it exists, find some particularly vivid (verified) anecdotes. If you want to know how common it is, do a (careful) survey and use the appropriate statistics to analyze the results.
[edit] Learning More About Statistics
[edit] Classes:
- Intermediate Level
- SOC 485 Intermediate Social Statistics with Gray Swicegood, usually offered in the Fall
- SOC 486 Intermediate Social Statistics with Gray Swicegood, usually offered in the Spring
- SOC 485 is Part I, SOC 486 is Part II.
- The course descriptions on the web still list them incorrectly (both are listed as SOC 485).
- These classes come highly recommended by any GSLIS student who has taken them.
- Professor Swicegood is an excellent instructor. I really enjoyed his course and highly recommend it. Epsy 580 is also a very good introductory statistics course, if you want to learn the basics. -- Gina 00:45, 23 Dec 2004 (EST)
- EPSY 580 Statistical Inference in Education
- This class is best taught by A. Klein. It is highly recommended that you do not take it with C. Anderson.
- PSYC 406 Statistical Methods I
- PSYC 407 Statistical Methods II
- Nobody has taken either of these classes yet.
- Advanced Level
- 590DA Data Analysis for LIS with David Dubin, expected to be offered in Spring 2006
- This class presupposes that you have recently taken a graduate level statistics course, and he suggests you take it in Sociology, Ed Psych, Psych, or a similar field so that the emphasis of your statistics course is on when it is appropriate to apply the different tests (in math the emphasis is more on deriving formulas to demonstrate mathematical rigor). Other Doc. students in the department highly recommend SOC 485 with Professor Swicegood.
- I think a strong argument can be made that the contents of David's course are something every doctoral student should know, because they will invariably run accross journal articles in their academic careers which utilize statistical methods, and they should have the literacy to be able to critically understand and analyze how the results were processed. -Ingbert
[edit] Statistics Related Links:
- Statistics Glossary - This source is quite good at explaining statistical terms, though it's not very intuitive to navigate.

