For those of us reading a lot of studies, if you are asked for a christmas wish then there is a book that I can recommend very much: Stuart Ritchies "Science Fictions: Exposing Fraud, Bias, Negligence and Hype in Science". I knew quite a few things that could go wrong with studies, but this book taught me a few new tricks -- and things to look for. And unfortunately something in that book fits this study: There is something fundamentally wrong if the sample sizes (number of patients) are as small as in this study, which means that the study it will either show false results or greatly exxagerate results.
Mind you, I haven't read the full text of this study. And I can't say if the results are correct or not. But I do see a few red flags just from reading the abstract. I just thought this study would make a good example to share some of what I have learned, and explain the generic issues with small numbers
The researches took RLS sufferers and had them fill out a questionaire. Then they looked for statistical trends, and reported the one that had statistical significance. And you should never, ever do that. For one, questionnaires are notoriously unreliable. (Do all patients have the same understanding? One person may have a totally different concept of what severe pain is than the next person. Also bias is important, which is well known from nutritional science: Could you say how many eggs you ate in the last year? Do you count eggs used in bread/cake/convenience foods, and how many did you have? Note that if you think eggs are not healthy, for whatever reason, you will probably vastly underestimate the number of eggs you ate.) But let's assume for the rest of this post that all patients filled out the questionnaire perfectly. The reason why whatever result you see will be either wrong or exaggerated is purely mathematical and lies in the small numbers.
The problem is best explained with an example. Consider a 2-sided coin. If you do 1000 flips and head comes up say 60% of the time, we can be very sure that the coin isn't fair. But if we do 10 flips and head comes up 6 times there is a strong possibility that this is just random chance and the coin is fair. Studies use the concept of the p-value and statistical significance to express this. The quest in studies is to reach "statistical significance", which is defined as a p-value of at most 0.05 -- and you have a much better chance of getting your paper published if the p-value is at 0.049 compared to 0.051.
The definition of the p-value, if I remember correctly, is that this is the probabilty of seeing this exact data being just a random fluctuation.If you flip it 4 times and head comes up every time, you might assume that it's "loaded" and not a fair coin. However, the chance of getting 4 times head with a fair coin is 1 out of 16 (6.25%), so you write a research paper about it you'd write "this is not a fair coin, heads is more likely to come up than tails, p-value 0.0625 (not statistically significant)". The "not statistically significant" means that there is at least a 5% chance that the result isn't really there -- it could be a fair coin and you just got unlucky. On the other hand, do 5 flips and if head comes up every time then the p-value is 0.03125.
(1) Take 1000 fair coins and flip them each 5 times. (Think 1000 different questions in your questionaire.) On average, 31 of these coins will show heads 5 times (1 out of 32 flips) and 31 will show tails 5 times. By your protocol, you would report that you have found 62 coins to be unfair (after all the p-value for each coin is 0.03125 that it is fair) -- even though all coins were fair!
(2) Now we do a second experiment. Take an unfair coin, that has a 60% probability of showing heads and 40% tails. The probability to roll heads 5 times in 5 flips is 7.78%. Now roll 1000 such coins 5 times each. You will find that (on average) 78 coins show heads 5 times, so you would conclude that 78 coins are unfair (correctly) and have a 100% chance to come up with heads (dead wrong), p=0.03125 again. (If it is a fair coin, the chance is still 3.125% to give this result.) You would also see "no statistically significant result" for the remaining 922 coins (again dead wrong).
And this illuminates what the problem with small numbers is. Because as in experiment (1), simply due to random chance some associations will show up that aren't really there. But even worse, as you see from experiment (2), you will miss most associations (we couldn't show anything for 922 coins) and the remaining associations are vastly exxagerated (we thought that 78 coins have a 100% chance to come up heads).
This is not a theoretical example. We had this in genetics when researchers looked for "critical genes", genes that would determine some major trait (intelligence, height or whatever). Researchers did small studies and identified a lot of "candidate" genes, claiming strong associations with some personality traits -- and all of them were wrong and couldn't be verified in followup studies with larger numbers.
The problem of course is the method. A good study lists all the effects it wants to check before the study is started, and then reports all results (regardless of the p-value). Looking for effects in data that you have already collected is the equivalent of the "Texas sharpshooter fallacity" (
https://en.wikipedia.org/wiki/Texas_sha ... er_fallacy): Shoot 30 bullets randomly at a barn and you will find some bullet holes grouped close together. Then draw a bullseye around these bullet holes and claim that there was something that attracted the bullets to this spot, there has to be a relation there!
Now, you *can* see good results in small numbers, but then the p-value will be much lower than 0.05. Take smoking for example, which increases the risk of lung cancer by a factor 10 to 30 (depending on the study) -- if you take 100 lung cancer patients then 90 or more of them would have been smokers. However, if we talk about RLS, there is literally no hope of finding such a clear cut association
Anyway, do get the book. It's a fascinating read.