a, Schematic of variational autoencoder used to extract sequence features of promoter and enhancer cCREs. After the model is trained, the latent space is sampled for each cCRE and then projected onto two dimensions. b, Density scatterplots of UMAP projection of the trained variational autoencoder’s latent space for each cCRE, stratified by cCRE class. Red indicates high density, and dark blue indicates low density. c, Density scatterplot depicting the correlation between the first visualized dimension in a and b with GC content of input cCRE sequence (PCC, R = −0.93). d, Density plots showing the distribution of dimension 1 values (sequence scores) for promoters (top) stratified by gene class (protein coding: red, lncRNA: orange) and distal enhancers (bottom) stratified by overlap with experimentally-derived transcription start sites (long-read RNA: pink, PRO-cap: green, neither: yellow). All distributions are significantly different (pairwise, two-sided, Wilcoxon test with FDR correction, p < 1.0 × 10−6).