Combining multiple gravitational-wave observations allows for stringent tests of general relativity, targeting effects that would otherwise be undetectable using single-event analyses. We highlight how the finite size of the observed catalog induces a significant source of variance. If not appropriately accounted for, general relativity can be excluded with arbitrarily large credibility even if it is the underlying theory of gravity. This effect is generic and entirely analogous to the so-called ``cosmic variance'' of cosmology: in essence, we only have one catalog that contains all the events. We show that the cosmic variance holds for arbitrarily large catalogs and cannot be suppressed by selecting ``golden'' observations with large signal-to-noise ratios. We present a mitigation strategy based on bootstrapping (i.e.~resampling with repetition)~that allows assigning uncertainties to one's credibility on the targeted test. We demonstrate our findings using both toy models and real gravitational-wave data. In particular, we quantify the impact of the cosmic variance on the ringdown properties of black holes using the latest LIGO/Virgo catalog.