Below, we examine two common pitfalls of density plots made using kernel density estimates (KDEs). Overall, KDEs have satisfactory performance in summarising observations of continuous random variables, but, as highlighted below, when the variable is bounded, or contains point masses, the density estimate can misrepresent the data without any issues being immediately apparent from the density plot.
Included is also a comparison between the density estimates provided by ggplot and ggdist. By default, ggdist attempts to automatically detect he bounds of the observed data.
Simple continuous example
Sample \(x_1,\dots,x_N\), from Laplace distribution with mean, \(\mu = 0\), and scale, \(b=1\), that is, the density function of \(x\) is \(f(x) = \frac{1}{2}\exp\left(-|x|\right)\).
We see a difference in the tails of the density plots, when the boundedness is taken into account in the KDE computations.
Code
p2 <-ggplot(mapping =aes(x = x2)) +stat_density(aes(colour ="Bounded"), bw ="SJ", geom ="line", bounds =c(-1.5, .5)) +stat_density(aes(colour ="Naive"), bw ="SJ", geom ="line") +stat_slab(aes(colour ="ggdist"), fill =NA, normalize ="none", scale =1) +labs(title ="Bounded observation", subtitle ="KDE computed with and without the bounds",x =TeX("$x$"), y ="KDE", colour ="Method") +scale_colour_vibrant(limits =c("ggdist", "Naive", "Bounded", "2"), breaks =c("ggdist", "Naive", "Bounded")) +scale_y_continuous(expand =c(0,NA))p2
The ECDF difference plot also suggests calibration issues with the unbounded KDE. Both the bounded ggplot density and the automatic bound detection by ggdist look well calibrated.
Simulate an example, where measurements under a certain threshold are reported as zeros. Here \(x\sim\exp(x)\) with all values sampled below \(0.1\) reported as zero.
Code
x3 <-rexp(1000, 1)x3[x3 < .1] <-0
If the measurement threshold is known, the data can be visualized as a mixture of a discrete point mass and a continuous distribution. The “naive” KDE seems to under estimate the density near \(0.1\), and can’t show the spike at zero.
Below, the density of the continuous mixture component is scaled by \(P(x \neq 0)\) to make the area of the bar and the KDE together add to one.
Code
p3 <-ggplot() +geom_rect(aes(xmin =0, xmax = .1, ymin =0, ymax =10*sum(x3 ==0) /length(x3), colour ="Mixture", fill ="Mixture"),alpha = .3) +stat_density(aes(x = x3[x3 >0], y = (1-sum(x3 ==0) /length(x3)) *after_stat(density), colour ="Mixture", fill ="Naive"), bw ="SJ", alpha =0, bounds =c(.1, Inf), outline.type ="upper") +stat_density(aes(x = x3, colour ="Naive", fill ="Naive"), bw ="SJ", alpha =0, linewidth =1, outline.type ="upper") +stat_slab(aes(x = x3, colour ="ggdist - full", fill ="ggdist - full"),slab_fill =NA, normalize ="none", scale =1) +labs(title ="Observation with point mass",subtitle ="KDE for full observation and only continuous part",x =TeX("x"), y ="", colour ="Method", fill ="Method") +scale_colour_vibrant(limits =c("ggdist - full", "Naive", "Mixture", "2"), breaks =c("ggdist - full", "Naive", "Mixture")) +scale_fill_vibrant(limits =c("ggdist - full", "Naive", "Mixture", "2"),breaks =c("ggdist - full", "Naive", "Mixture")) +scale_y_continuous(expand =c(0,NA))p3
The calibration of the KDE for the continuous part should be inspected separately, as the scaled density above would show miscalibration due to missing part of the observation distribution.
Again, the ECDF difference plot suggest issues with the naive approach and thus could be used to warn user if the naive density plot is drawn.
Here the automatic bound detection by ggdist produces a KDE that is well calibrated after the over saturation of PIT values of zero and the sudden drop where no observations are covered by the density estimate. When restricted to the continuous part of the mixture, both the ggplot and ggdist methods give calibrated estimates.
---title: "Pitfalls of density plots"author: "Teemu Säilynoja"date: "2023-05-15"date-modified: "`r format(Sys.Date(), '%Y-%m-%d')`"format: html: page-layout: full toc: true code-fold: true code-tools: true code-line-numbers: true---```{r}#| label: imports#| message: falselibrary(ggplot2)library(ggdist)library(bayesplot)library(khroma)library(latex2exp)library(sfsmisc)source("../../../code/helpers.R")source("../../../code/kde_tests.R")set.seed(37645624)good_theme <- bayesplot::theme_default(base_family ="Sans", base_size =14) +theme(axis.text =element_text(colour ="#666666", size =12),axis.ticks =element_line(colour ="#666666"),title =element_text(colour ="#666666", size =16),plot.subtitle =element_text(colour ="#666666", size =14),legend.text =element_text(colour ="#666666", size =12),legend.title =element_text(colour ="#666666", size =14),axis.line =element_line(colour ="#666666"))theme_set(good_theme)``````{r}ecdf_difference_limits <-ecdf_confidence_intervals(gamma =adjust_gamma_optimize(1000,1000, .95),N =1000, K =1000)x0 <-0:1000/1000p0 <-ggplot(mapping =aes(x = x0)) +geom_step(aes(y = (ecdf_difference_limits$lower -0:1000) /1000)) +geom_step(aes(y = (ecdf_difference_limits$upper -0:1000) /1000)) +labs(title ="ECDF Difference",subtitle ="95% outer simultaneous confidense intervals",x ="PIT",y ="ECDF - Difference",colour ="Method")```## IntroBelow, we examine two common pitfalls of density plots made using kernel density estimates (KDEs).Overall, KDEs have satisfactory performance in summarising observations of continuous random variables,but, as highlighted below, when the variable is bounded, or contains point masses, the density estimatecan misrepresent the data without any issues being immediately apparent from the density plot.Included is also a comparison between the density estimates provided by `ggplot` and `ggdist`. By default, `ggdist` attempts to automatically detect he bounds of the observed data.## Simple continuous exampleSample $x_1,\dots,x_N$, from Laplace distribution with mean, $\mu = 0$, and scale, $b=1$, that is,the density function of $x$ is $f(x) = \frac{1}{2}\exp\left(-|x|\right)$.```{r}#| label: sample 1x1 <-sample(c(-1,1), 1000, replace = T) *rexp(1000)``````{r}p1 <-ggplot(mapping =aes(x = x1)) +stat_density(aes(colour ="ggplot"), bw ="SJ", geom ="line") +stat_slab(aes(colour ="ggdist"), fill =NA, normalize ="none", scale =1) +labs("Laplace distribution",subtitle ="N = 1000\nx ~ 1/2 exp(-|x|)",x =TeX("$x$"), y ="KDE", colour ="Method") +scale_y_continuous(expand =c(0,NA)) +coord_cartesian(ylim =c(0,1.1*max(density(x1)$y)))p1```The ECDF difference plot doesn't suggest any issues with the KDE```{r}#| message: falsep0 +geom_step(aes(y =ecdf(pit_from_densityplot(p1, 1, x1))(x0) - x0,colour ="ggplot")) +geom_step(aes(y =ecdf(pit_from_densityplot(p1, 2, x1, T))(x0) - x0,colour ="ggdist")) +scale_colour_vibrant(limits =c("ggdist", "ggplot", "1", "2"), breaks =c("ggdist", "ggplot"))```## Bounded continuous dataSample as above, but the distribution is truncated to the interval $[-1.5,0.5]$.```{r}#| label: sample 2x2 <- x1while (!all(x2 >-1.5, x2 < .5)) { mask <- (x2 <-1.5| x2 > .5) x2[mask] =sample(c(-1,1), sum(mask), replace = T) *rexp(sum(mask))}```We see a difference in the tails of the density plots, when the boundedness is taken into account in the KDE computations.```{r}p2 <-ggplot(mapping =aes(x = x2)) +stat_density(aes(colour ="Bounded"), bw ="SJ", geom ="line", bounds =c(-1.5, .5)) +stat_density(aes(colour ="Naive"), bw ="SJ", geom ="line") +stat_slab(aes(colour ="ggdist"), fill =NA, normalize ="none", scale =1) +labs(title ="Bounded observation", subtitle ="KDE computed with and without the bounds",x =TeX("$x$"), y ="KDE", colour ="Method") +scale_colour_vibrant(limits =c("ggdist", "Naive", "Bounded", "2"), breaks =c("ggdist", "Naive", "Bounded")) +scale_y_continuous(expand =c(0,NA))p2```The ECDF difference plot also suggests calibration issues with the unbounded KDE. Both the bounded `ggplot` density and the automatic bound detection by `ggdist` look well calibrated.```{r}#| label: ecdf 2#| message: falsep0 +geom_step(aes(y =ecdf(pit_from_densityplot(p2, 2, x2))(x0) - x0,colour ="Naive")) +geom_step(aes(y =ecdf(pit_from_densityplot(p2, 1, x2))(x0) - x0, colour ="Bounded")) +geom_step(aes(y =ecdf(pit_from_densityplot(p2, 3, x2, T))(x0) - x0,colour ="ggdist")) +labs(colour ="Method") +scale_colour_vibrant(limits =c("ggdist", "Naive", "Bounded", "2"), breaks =c("ggdist", "Naive", "Bounded"))```## Continuous data with point massesSimulate an example, where measurements under a certain threshold are reported as zeros.Here $x\sim\exp(x)$ with all values sampled below $0.1$ reported as zero. ```{r}#| label: sample 3x3 <-rexp(1000, 1)x3[x3 < .1] <-0```If the measurement threshold is known, the data can be visualized as a mixture ofa discrete point mass and a continuous distribution. The "naive" KDE seems to under estimate the density near $0.1$, and can't show the spike at zero. Below, the density of the continuous mixture component is scaled by $P(x \neq 0)$to make the area of the bar and the KDE together add to one.```{r}p3 <-ggplot() +geom_rect(aes(xmin =0, xmax = .1, ymin =0, ymax =10*sum(x3 ==0) /length(x3), colour ="Mixture", fill ="Mixture"),alpha = .3) +stat_density(aes(x = x3[x3 >0], y = (1-sum(x3 ==0) /length(x3)) *after_stat(density), colour ="Mixture", fill ="Naive"), bw ="SJ", alpha =0, bounds =c(.1, Inf), outline.type ="upper") +stat_density(aes(x = x3, colour ="Naive", fill ="Naive"), bw ="SJ", alpha =0, linewidth =1, outline.type ="upper") +stat_slab(aes(x = x3, colour ="ggdist - full", fill ="ggdist - full"),slab_fill =NA, normalize ="none", scale =1) +labs(title ="Observation with point mass",subtitle ="KDE for full observation and only continuous part",x =TeX("x"), y ="", colour ="Method", fill ="Method") +scale_colour_vibrant(limits =c("ggdist - full", "Naive", "Mixture", "2"), breaks =c("ggdist - full", "Naive", "Mixture")) +scale_fill_vibrant(limits =c("ggdist - full", "Naive", "Mixture", "2"),breaks =c("ggdist - full", "Naive", "Mixture")) +scale_y_continuous(expand =c(0,NA))p3```The calibration of the KDE for the continuous part should be inspected separately,as the scaled density above would show miscalibration due to missing part of theobservation distribution. ```{r}p4 <-ggplot() +stat_density(aes(x = x3[x3 >0], colour ="Mixture"), bw ="SJ", bounds =c(.1, Inf),geom ="line") +stat_slab(aes(x = x3[x3 >0], colour ="ggdist - cont."),slab_fill =NA, normalize ="none", scale =1, fill =NA) +labs(colour ="Method") +scale_colour_vibrant(limits =c("ggdist", "Naive", "Mixture - cont.", "ggdist - cont."), breaks =c("Mixture", "ggdist - cont."))p4```Again, the ECDF difference plot suggest issues with the naive approach and thus could be used to warn user if the naive density plot is drawn.Here the automatic bound detection by `ggdist` produces a KDE that is well calibrated after the over saturation of PIT values of zero and the sudden drop where no observations are covered by the density estimate.When restricted to the continuous part of the mixture, both the `ggplot` and `ggdist` methods give calibrated estimates.```{r}#| label: ecdf 3#| message: falsep0 +geom_step(aes(y =ecdf(pit_from_densityplot(p3, 3, x3))(x0) - x0, colour ="Naive")) +geom_step(aes(y =ecdf(pit_from_densityplot(p4, 1, x3[x3 >0]))(x0) - x0, colour ="Mixture")) +geom_step(aes(y =ecdf(pit_from_densityplot(p3, 4, x3, T))(x0) - x0,colour ="ggdist - full")) +geom_step(aes(y =ecdf(pit_from_densityplot(p4, 2, x3[x3 >0], T))(x0) - x0,colour ="ggdist - cont.")) +labs(colour ="Method") +scale_colour_vibrant(limits =c("ggdist - full", "Naive", "Mixture - cont.", "ggdist - cont."), breaks =c("ggdist - full", "Naive", "Mixture - cont.", "ggdist - cont."))```