Aprotinin: propensity for confusion
So, once again, Aprotinin is bad.
In the New England Journal of Medicine, the Perioperative Ischaemia Research Group just published the results of a large observational study encouraging readers not to use aprotinin after cardiac surgery. One of the main reasons cited in the paper was a doubling in the risk of renal dysfunction and renal failure.
The study is to be commended on the large sample size, an impressive 4374 patients in 69 institutions across the world. Whilst we, as readers and reviewers are often wary about conclusions based on small sample sizes concluding no difference when a true difference might exist, why don’t we display a similar scepticism about large studies that conclude a difference when none might exist?
Big is not always better, the authors state that in this setting, a randomised trial would be ideal but would be difficult “if not impossible” to conduct, and observational studies when sufficiently large may offer critical insights even in light of recognized limitations.
Unfortunately they are missing the point. A small randomised trial will give you an estimate that is closer to the truth, because baseline differences are balanced, but the results may be imprecise (wide confidence interval) that is progressively narrowed by increasing the sample size.
However, a large observational study may give you a very precise estimate, but it may be far from the truth (influenced by bias arising from differences in the groups compared), that can never be corrected by increasing the sample size.
Precision of the estimate is never more important than a correct estimate!
Patients in the aprotinin arm had a greater proportion of patients with a history of renal disease, 5 times more patients with previous CABG and twice the number of patients with CABG and valve surgery. Patients with higher surgical risk are also at increased risk of renal failure after surgery.
The researchers use propensity score adjustment to adjust for differences in measured covariates. There are 3 main ways where propensity score can be used to `balance’ measured differences. The most common is using the propensity score to derive a one-to-one match and discarding any unmatched patients before comparing the two grous. The next is to divide the group into 5 ranges of propensity score and compare them within each range and finally, to include the propensity score as a covariate (which is what was used in this paper) that allows the estimates to be balanced for the measured differences.
All well and good. Unfortunately, the problem does not lie in the propensity score matching of measured variables; it is the unmeasured and unknown variables. For example if institutions that used aprotinin in high risk patients also had a very tight policy on postoperative fluid restriction that predisposed patients to renal impairment independent of the aprotinin.
Statistics can prove association but not causation. A good point made in this study, is the demonstration of a dose response to aprotonin that may be suggestive of causation, but a strong argument can only arise from a series of (previous or future) randomised trials.