Introduction

The following analysis workflow is a companion to the manuscript “Revealing the impact of structural variants in multiple myeloma� by Rustad et al. We applied the Piecewise Constant Fitting (PCF) algorithm for structural variant (SV) hotspot discovery in multiple myeloma, taking advantage of a large dataset of 752 patients with long-insert low-coverage (4-8X) whole genome sequencing. This approach was first described by Glodzik et. al Nature Genetics 2017, and we refer to the initial publication for methodological details. The initial implementation has since been generalized for use with different datasets, with utility functions for every step of the analysis, from background model generation to data visualizations.

Glodzik et al initially performed SV hotspot analysis for tandem duplications associated with BRCA deficient breast cancer, after first identifying two rearrangement signatures (RS) present in these patients, characterized by short (RS1) and long (RS3) non-clustered tandem duplications. Compared with multiple myeloma, SVs are much more frequent in breast cancer (median 93 SVs per patient in breast cancer vs. 16 in multiple myeloma), particularly in patients who are BRCA deficient. Consequently, we had to make several adjustments when adopting this approach to multiple myeloma.

Given the considerably lower number of SVs in multiple myeloma, we found that it was not feasible to perform hotspot analysis separately for each individual SV class. Combining SV classes together has the advantage that different SV classes resulting in similar downstream effects may produce a stronger signal. Although the optimal background model for each SV class would be somewhat different, there were enough similarities that a combined background model performed significantly better than random. Training separate background models for each SV class and combining them to provide predictions for overall SV burden did not out-perform a combined model.

The main challenge when combining different SV classes together is dealing with highly clustered events such as chromothripsis. These events behave very differently from single SVs and would dominate the results. On the other hand, excluding these events altogether would result in loss of valuable information. We ended up performing two parallel analyses on subsets of the data and combining the results.

Our two complementary analyses were: 1) non-clustered SVs only and 2) all SV classes with downsampling.

In the non-clustered SV analysis, we included all breakpoints of the single deletion, inversion, translocation and tandem duplication classes. In addition, we included templated insertion breakpoints. Although many templated insertions involve multiple chromosomes and are technically considered as “complex�, their local footprints usually contain only a few breakpoints. We therefore found it worthwhile to include the information from this SV class here.

In the downsampling analysis, we included all SV classes, but applied a downsampling function so that only one breakpoint per 500 kb per patient were included in the PCF analysis. In this way, highly clustered events did not disproportionately affect the analysis. Although we used binning for the purpose of downsampling, the PCF algorithm was run with standard settings, without binning.

The use of downsampling introduced an element of randomness to the analysis. Although the number of SV breakpoints in each 500 Kb bin always remained the same (0 or 1), randomly selecting one out of several breakpoints affects the breakpoint distribution in the region, which is crucial in defining hotspots. We therefore repeated the downsampling procedure followed by PCF 1000 times, requiring hotspots to be reproduced >95 % of the time.

Below we go through the analysis workflow in three steps: 1) background model generation, 2) background model validation and 3) hotspot analysis.

options("scipen"=999, "digits"=4)

# PACKAGES
packages <- c("MASS", "dplyr", "stringr", "GenomicRanges", "BSgenome.Hsapiens.UCSC.hg19", "boot", "ggplot2",
              "ggpubr", "RColorBrewer", "reshape2", "mosaic", "pscl", "gridExtra", "stringi", "valr", "Hmisc")

invisible(suppressWarnings(suppressMessages(lapply(packages, library, character.only = TRUE))))

# UTILITIES
source('utils/plotCoefficients.R')
source('utils/prepareBinData.R')
source('utils/eval.bgModel.R')
source('utils/drawRandomRearrs.R')
source('utils/runLightPcf.R')
source('utils/drawSegmentation.R')
source('utils/fastPCF.R')
source('utils/extract.kat.regions2.R')
source('utils/annotateBgModel.R')
source('utils/calcIntermutDist.R')
source('utils/calc.gw.mut.rate.R')
source('utils/downsampleRearrBins.R')
source("utils/count.sv.breakpoints.R")
source("utils/plotHotspots.R")

# where data is stored and output written
fp <- '../data/'
out <- '../data/results/'

Background model generation

The underlying assumption of this hotspot analysis was that a large proportion of SVs are passenger events, which may nonetheless cluster together in loci whose features make them prone to acquire (passenger) SVs. We assumed that hotspots can be defined with higher sensitivity and specificity by adjusting the threshold for calling a hotspot to the local background SV rate. We determined the background SV rate empirically by training a negative binomial regression model to predict the SV rate based on a number of genomic features. We excluded the immunoglobulin loci and canonical translocation partner regions to reduce the signal from events under positive selection. Although most of the SV hotspots were kept in the dataset when determining the background SV rate (since, by definition, hotspots have not yet been defined at this stage), the signal from a limited number of hotspots would be be diluted by the vast stretches of the genome that are not under selection. For the purpose of background model generation, we separated the genome into 500 kb bins.

Analysis setup

# normalize vector function
normaliseVector <- function(v, refv=NULL) {
  if (is.null(refv)) {
    r <- (v - mean(v, na.rm=TRUE))/sd(v, na.rm=TRUE)
  } else {
    r <- (v - mean(refv, na.rm=TRUE))/sd(refv, na.rm=TRUE)
  }
  return(r)
}

# Null model variable setup
model.variables <- c('noNbases', 'medianRepTime', 'meanCn', 'dnase', 'highExpGenes', 'lowExpGenes', 
                     'staining', 'fragile', 'counts.alu', 'counts.rep','segDup', 'H3K27ac.primary')

model.variable.names <- c('N bases', 'early replication (GM06990)', 'copy number', 'DNAse (GM12878)', 
                          'highly expressed genes', 'lowly expressed genes', 'chromatin staining', 
                          'fragile sites', 'ALU elements', 'repeats', 'segmental duplications', 'H3K27ac (Primary MM)')

names(model.variable.names) <- paste0(model.variables , '.norm')

# bin size for background model
bin.size = 500000

# list of file paths for background model

bedList <- c(
  '../data/mmData/mm.genes.high.expr.tsv',
  '../data/mmData/mm.genes.low.expr.tsv',
  '../data/mmData/wgEncodeOpenChromDnaseGm12878Pk.narrowPeak',
  '../data/mmData/chromBandsPos',
  '../data/mmData/broadFragile37Chrom.txt',
  '../data/mmData/alu.repeatMasker',
  '../data/mmData/repeatMasker1000',
  '../data/mmData/GRCh37GenomicSuperDup.nohead.tab',
  '../data/mmData/Merged.MM.primary.K27ac.remove2.5k.bed'
)

names(bedList) <- c('highExpGenes', 'lowExpGenes', 'dnase', 'staining', 'fragile', 'counts.alu', 'counts.rep','segDup', 'H3K27ac.primary')

# path to replication time reference -- the following columns are required: c('chr', 'chromStart', 'chromEnd', 'timing')
repTimingBed <- paste0(fp, 'mmData/lymph_cell_line_06990_replication_time2.bedgraph')

# prepare the copy number data
# data frame with columns Chromosome chromStart  chromEnd  total.copy.number.inTumour
cnv.df <- read.csv(paste0(fp, 'mmData/cnv.mm.csv'))

# read all SVs

sv <- read.delim(paste0(fp, 'mmData/200505_sv_final.txt'), stringsAsFactors = F)

# read bed file with recurrent SV regions

recurrent <- read.table(paste0(fp, 'mmData/recurrent.tsv'), stringsAsFactors = F, sep = " ", header = F)
names(recurrent) <- c("chrom", "start", "end", "recurrent")

# prepare the breakpoints -- immunoglobulin loci and canonical partners annotated (i.e., CCND1, CCND2, MMSET, MAF, MAFA, MAFB)

sv.bp.raw <- rbind(sv %>% dplyr::select(sample, chrom1, pos1, SVTYPE, sv_pcawg, key) %>% dplyr::rename(chr = chrom1, position = pos1),
                   sv %>% dplyr::select(sample, chrom2, pos2, SVTYPE, sv_pcawg, key) %>% dplyr::rename(chr = chrom2, position = pos2))

recurrent_keys <- sv.bp.raw %>%
  filter(SVTYPE == "TRA") %>%
  mutate(chrom = paste0("chr", chr),
         start = position,
         end = position) %>%
  bed_intersect(recurrent, suffix = c(".sv", "")) %>%
  distinct(key.sv, recurrent) %>%
  dplyr::rename(key = key.sv)

bps.all <- sv.bp.raw %>%
  left_join(recurrent_keys, by = "key")

# setup SV for hotspot analysis
sv <- sv %>%
  mutate(pf = ifelse(SVTYPE == 'DEL', 2,
                     ifelse(SVTYPE == 'DUP', 4,
                            ifelse(SVTYPE == 'INV', 1, 32)))) %>%
  dplyr::select(sample, chrom1, pos1, chrom2, pos2, pf, SVTYPE, sv_pcawg)

names(sv) <- c('sample', 'Chromosome.1', 'pos.1', 'Chromosome.2', 'pos.2', 'pf', 'svtype', 'sv_pcawg')

# cytoband reference
cyto.bed <- read.delim(paste0(fp, 'mmData/cytoband_b37.bed'), stringsAsFactors = F)
colnames(cyto.bed)=c("chrom","start","end","band","direction")

# chromosome borders for plotting
chrom_borders <- rbind(cyto.bed %>%
                              filter(chrom != "Y") %>%
                              group_by(chrom) %>%
                              summarise(start = max(end)-1,
                                        end = max(end)),
                       data.frame(chrom = c(1:22, "X"),
                                  start = 0,
                                  end = 1)) %>%
      as.data.frame()
# input/output file names for model data
binDataPath <- paste0(fp, 'data-bin-', bin.size, '.tsv')

if(file.exists(binDataPath)){
  # if data frame of background features per bin has already been generated, import previously stored data
  binData <- read.delim(binDataPath, stringsAsFactors = F)
  
} else {
  # if no background data exists, generate new model
  # generate data frame of genomic features per bin
  binData <- prepareBinData(bedList=bedList, 
                            cnDf=cnv.df, 
                            repTimingBed=repTimingBed
  )
  
  # substitute NA for 0
  binData <- binData %>%
  mutate(medianRepTime = ifelse(is.na(medianRepTime), 0, medianRepTime),
         meanCn = ifelse(is.nan(meanCn) | is.na(meanCn), 2, meanCn))
  
  # normalize features across bins
  already.normalized <- c('medianRepTime') # vector of variables already normalized in the input data, to avoid double normalization
  
  for (vi in 1:length(model.variables)) {
    if(model.variables[vi] %in% already.normalized) {
      vn <- paste0(model.variables[vi], '.norm')
      binData[[vn]] <- binData[[model.variables[vi]]]
    } else {
      normalisedV <-  normaliseVector(binData[,as.character(model.variables[vi])])
      vn <- paste0(model.variables[vi], '.norm')
      binData[[vn]] <- normalisedV
    }
  }
  
  # write output
  write.table(binData, file = binDataPath, sep = "\t", quote = F, row.names = F)
}

generate combined and class-specific empirical background models

# set up list with different SV subsets
pcawg_classes <- c("DEL", "chromothripsis", "INV", "complex", "TRA", "chromoplexy", "DUP", "templated_insertion")
non_clustered_classes <- c("DEL", "INV", "TRA", "DUP", "templated_insertion")

# count SVs per bin for different classes
bpListBg <- list()
bpListBg$sv.all <- filter(bps.all, is.na(recurrent))
bpListBg$downsampling <- downsampleRearrBins(filter(bps.all, is.na(recurrent)), binSize = bin.size, seed = 42)
bpListBg$non.clustered <- filter(bps.all, sv_pcawg %in% non_clustered_classes, is.na(recurrent))

for(i in 1:length(pcawg_classes)){
  bpListBg[[i+3]] <- filter(bps.all, sv_pcawg == pcawg_classes[i], is.na(recurrent))
  names(bpListBg)[[i+3]] <- pcawg_classes[i]
}
sv.bp.per.bin <- countSVs(bpListBg)
bg.model.names <- names(bpListBg)

# generate final data frame with genomic features and SV counts
to.bg.model <- left_join(binData, sv.bp.per.bin, by = c("chr", "chromStart", "chromEnd"))

# fit background models and calculate SV rate if data does not already exist

bgModelDataPath <- paste0(fp, 'bgModel-bin-', bin.size, '.Rdata')
rateDataPath <- paste0(fp, 'rates-bin-', bin.size, '.Rdata')
bgPredictPath <- paste0(fp, 'predicted-bin-', bin.size, '.txt')

if(file.exists(bgModelDataPath) & file.exists(rateDataPath) & file.exists(bgPredictPath)){
  # load existing data if possible
  load(bgModelDataPath) # object name: bg.models.list
  load(rateDataPath) # object name: sv.rate.list
  bg.predict <- read.delim(bgPredictPath, stringsAsFactors = F)

}  else {
  # if files do not exist, generate new data and save results
  
  bg.models.list <- list()
  sv.rate.list <- list()
  bg.predict <- to.bg.model[c("chr", "chromStart", "chromEnd")]
  
  
  for(i in 1:length(bg.model.names)){
    sv.feature <- bg.model.names[i]
    
    # negative binomial regression 
    formula.nb <- as.formula(paste(sv.feature, ' ~ ' , paste(paste0(model.variables,'.norm'), collapse=' + ')))
    bg.models.list[[i]] <- glm.nb(formula.nb, data = to.bg.model) # negative binomial fit
    names(bg.models.list)[i] <- sv.feature
    
    # calculate SV rate
    sv.rate.list[[i]] <- calc.gw.mut.rate(bpListBg[[sv.feature]])
    names(sv.rate.list)[i] <- sv.feature
    
    # predict SV burden per bin
    bg.predict$pred.count <- exp(predict(bg.models.list[[i]], to.bg.model, type = "link", se.fit=TRUE)$fit)
    names(bg.predict)[i+3] <- sv.feature
  }
  
  write.table(bg.predict, file = bgPredictPath, row.names = F, quote = F, sep = "\t")
  save(bg.models.list, file = bgModelDataPath)
  save(sv.rate.list, file = rateDataPath)
  
}

# get background model coefficients
bg.coefs.list <- list()
for(i in 1:length(bg.model.names)){
  nb.fit <- bg.models.list[[i]]
  coeff.df <- data.frame(coeff=nb.fit$coefficients, se=coef(summary(nb.fit))[, "Std. Error"])
  coeff.df <- coeff.df[2:nrow(coeff.df),]
  coeff.order <- order(coeff.df$coeff)
  coeff.df<- coeff.df[coeff.order, ]
  coeff.df$labels <- model.variable.names[as.character(rownames(coeff.df))]
  coeff.df$model <- bg.model.names[i]
  
  bg.coefs.list[[i]] <- coeff.df 
}

bg.coefs <- bind_rows(bg.coefs.list) %>%
  mutate(exp_coeff = exp(coeff),
         exp_se_lower = exp(coeff-se),
         exp_se_upper = exp(coeff+se)
         )

Plotting the background model coefficients for non-clustered SVs

# plot non-clustered background model

bg.coefs %>%
  filter(model %in% c("non.clustered")) %>%
  mutate(model = gsub("non.clustered", "Non-Clustered SVs", model),
         model = capitalize(model)) %>%
  ggplot(aes(reorder(labels, exp_coeff), exp_coeff))+
    geom_col(fill = "lightblue")+
    geom_hline(yintercept = 1, size = 0.5) +
    geom_errorbar(aes(ymin = exp_se_lower, ymax = exp_se_upper), size = 1, width = 0.5, col = "darkgray")+
    coord_flip()+
    labs(y = "exp(coeff)",
         x = "Model features")+
    scale_y_continuous(expand = c(0,0.01), breaks = c(0,0.5,1,1.3), limits = c(0,1.4))+
    theme_bw()+
    theme(axis.text = element_text(color = "black"),
          text = element_text(size = 11),
          strip.background = element_blank(),
          strip.text = element_text(size = 12),
          panel.grid = element_blank(),
          axis.title.y = element_blank()
          )

Validation of empirical background models

Models for all SVs together and individual SV classes

As validation of the background model, we compared the prediction accuracy of the background model with random permutation in a series of bootstrapping experiments, using 70 % of the bins for training and the remaining 30 % as a test dataset. Prediction accuracy as measured by mean squared error (MSE) is shown for the test dataset.

# generate list of evaluation data for each background model

evalDataList <- list()
formula.nb <- as.formula(paste("sv.count", ' ~ ' , paste(paste0(model.variables,'.norm'), collapse=' + ')))
iter <- 1000 # number of bootstrap iterations

evalDataPath <- paste0(fp, 'eval-data-bin-', bin.size, '.Rdata')

if(file.exists(evalDataPath)){
  load(evalDataPath) # object name evalDataList
} else{
  
  for(i in 1:length(bg.model.names)){
    sv.feature <- bg.model.names[i]
    
    # background model evaluation parameters
    evalDataList[[i]] <- suppressWarnings(evaluateBgModel(bgData = dplyr::rename(to.bg.model, sv.count = toString(sv.feature)), 
                                                          formula.nb = formula.nb, 
                                                          iterBoot = iter, 
                                                          propTrain = 0.7))
    names(evalDataList)[i] <- sv.feature
  }
  
  save(evalDataList, file = evalDataPath)
}

Prediction accuracy of the background model on training data compared with random permutation.

# model summary for all models
eval.summary.list <- list()
for(i in 1:length(bg.model.names)){
  temp <- evalDataList[[bg.model.names[i]]]$MSEsummary
  temp$sv <- bg.model.names[i]
  eval.summary.list[[i]] <- temp
}

eval.summary <- bind_rows(eval.summary.list) 

# parmutation background summary
permutation_bg <- eval.summary %>%
  filter(model == "Permutation") %>%
  mutate(variable = paste0(variable, "_permutation")) %>%
  dcast(sv ~ variable, value.var = "MSE")

# estimate fold-change of MSE for median predicted versus the permutation background model
# values > 1 indicate improvement of MSE with the prediction model compared with permutation
# median predicted value is compared with median and 95% CI for the permutation, similar to how P-values were estimated

eval.summary_wide <- eval.summary  %>%
    filter(model == "Null-model" & variable == "median") %>%
    dcast(sv + p.val ~ variable, value.var = "MSE") %>%
    left_join(permutation_bg, by = "sv") %>%
    mutate(fold_change_median = median_permutation/median,
           fold_change_CI2.5 = CI2.5_permutation/median,
           fold_change_CI97.5 = CI97.5_permutation/median,
           fold_change_CI97.5 = ifelse(fold_change_CI97.5 > 5, 5, fold_change_CI97.5),
           p.val.str = ifelse(p.val < 1/iter, paste0('p<', as.character(1/iter)), paste0('p=', round(p.val,3))))

Showing comparison of background model and permutation for non-clustered SVs.

# Showing model error compared to permutation background
plotCompareModels(evalDataList$non.clustered)

Summary of background model versus permutation for each SV class separately

# plot model summary for single-class models
eval.summary_wide %>%
    filter(sv %in% pcawg_classes) %>% 
    mutate(fold_change_CI97.5 = ifelse(fold_change_CI97.5 > 3, 3, fold_change_CI97.5),
           sv = gsub("templated_insertion", "Templated ins", sv),
           sv = capitalize(sv)) %>%
    ggplot(aes(reorder(sv, fold_change_median), fold_change_median))+
      geom_col(fill = "lightblue")+
      geom_hline(yintercept = 1, size = 0.5) +
      geom_errorbar(aes(x = reorder(sv, fold_change_median), 
                        ymin = fold_change_CI2.5, ymax = fold_change_CI97.5), size = 1, width = 0.5, col = "darkgray")+
      labs(y = "Fold-improvement (Mean Squared Error)\nPrediction model vs. permutation")+
      scale_y_continuous(expand = c(0,0.05), breaks = c(0,0.5,1,1.5,2))+
      theme_bw()+
      theme(axis.text = element_text(color = "black"),
            text = element_text(size = 12),
            strip.background = element_blank(),
            strip.text = element_text(size = 12),
            panel.grid = element_blank(),
            axis.title.y = element_blank()
            )+
          coord_flip()

Correlation between observed and predicted SV counts in the test data (left) and density plots showing observed and predicted SV breakpoint counts per bin (right).

# observations versus predictions on test data
eval.obs.vs.pred(evalDataList$non.clustered$testData)

Hotspot analysis by the Piecewise Constant Fitting (PCF) algorithm

The PCF algorithm is a segmentation method for sequential data. We used PCF to find genomic segments with a much higher SV breakpoint density than neighboring genomic regions and higher than expected from the background model. The PCF algorithm takes three parameters: gamma, a smoothing parameter that determines how sensitive the algorithm will be to slight local differences in breakpoint density; imd.factor, the required fold-change above the background rate to be considered a hotspot; and kmin.my, a minimum number of samples that must be involved to consider a hotspot. We refer to the original publication by Glodzik et al for a detailed description and validation of the PCF parameters. We used the same values for imd.factor and kmin.my as in the original publication, but reduced the gamma parameter to account for the lower number of SVs in our dataset.

Hotspots of non-clustered SVs

# parameters of the PCF algorithm
gamma <- 5
imd.factor <- 2 # significance threshold
kmin.my <- 8 # minimum number of samples

# run hotspot analysis
hotspots.out.nonClustered <- runLightPcf(rearr.df=filter(sv, sv_pcawg %in% non_clustered_classes), # rearrangements 
                                         kmin=kmin.my,# kmin parameter for PCF
                                         gamma=gamma, # gamma parameter for PCF
                                         bg.rate=sv.rate.list$non.clustered$gw.rate, # genome-wide rate of breakpoints for given class
                                         logScale=T, # log-scale transformation of inter-mutation disances
                                         doMerging=T, # merge adjacent hotspots
                                         obs.exp.thresh=imd.factor, # paramter for PCF
                                         allBins = bg.predict, # expected breakpoints/bin
                                         exp.col='non.clustered', # column from allBins to use as background rate per bin
                                         binSize = 500000,
                                         downsampleBins = FALSE,
                                         seed = NULL
                                         )
## [1] "Total number of breakpoints: 19564"
# give IDs to each hotspots
hotspots.out.nonClustered$hotspot.id <- paste0('nonClustered_chr', 
                                               hotspots.out.nonClustered$chr,'_', 
                                               round(hotspots.out.nonClustered$start.bp/1000000,1), 'Mb' ) 

# Save hotspot figures
drawSegmentation(rearr.df = filter(sv, sv_pcawg %in% non_clustered_classes), 
                 hotspots.df = hotspots.out.nonClustered, 
                 fn = paste0(out, 'nonClustered.segmentation.i=', imd.factor , '-g=', gamma, '-n=', kmin.my, '.pdf'), 
                 main.text = paste0('Hotspot analysis: all nonClustered'))
## quartz_off_screen 
##                 2

downsampling hotspot analysis

# number of hotspot analyses
iterations <- 1000

# parameters of the PCF algorithm
gamma <- 5
imd.factor <- 2 # significance threshold
kmin.my <- 8 # minimum number of samples

# downsampling data path
downHotDataPath <- paste0(out, 'down.hot.iter=', iterations, '-i=', imd.factor, '-g=',gamma, '-n=', kmin.my, '.Rdata')

# run hotspot analysis
if(file.exists(downHotDataPath)){
      load(downHotDataPath) #object name: hot.down.list
} else {
      hot.down.list <- list()
      
      for(i in 1:iterations){
            hot.down.list[[i]] <- runLightPcf(rearr.df=sv, # rearrangements 
                                              kmin=kmin.my,# kmin parameter for PCF
                                              gamma=gamma, # gamma parameter for PCF
                                              bg.rate=sv.rate.list$downsampling$gw.rate, # genome-wide rate of breakpoints for given class
                                              logScale=T, # log-scale transformation of inter-mutation disances
                                              doMerging=T, # merge adjacent hotspots
                                              obs.exp.thresh=imd.factor, # paramter for PCF
                                              allBins = bg.predict, # expected breakpoints/bin
                                              exp.col='downsampling', # column from allBins to use as background rate per bin
                                              binSize = 500000,
                                              downsampleBins = TRUE,
                                              seed = NULL
            )
            
            hot.down.list[[i]]$chr <- as.character(hot.down.list[[i]]$chr)
            
            # give IDs to each hotspots
            hot.down.list[[i]]$hotspot.id <- paste0('downsampling_chr', 
                                                    hot.down.list[[i]]$chr,'_', 
                                                    round(hot.down.list[[i]]$start.bp/1000000,1), 'Mb' ) 
            # set iteration variable
            hot.down.list[[i]]$iter <- i
      }
      save(hot.down.list, file = downHotDataPath)
}

hot.down.df <- bind_rows(hot.down.list) 
# generate bins

overlap.bins = 1000 # 1 kb

binList <- list()
    for (c in c(as.character(1:22), 'X')) {       
        chr.length <- seqlengths(Hsapiens)[[paste('chr',c,sep='')]]       
        binStarts <- seq(from=1, to=chr.length-overlap.bins, by=overlap.bins )
        binEnds <- binStarts + overlap.bins-1
        binList[[c]] <- data.frame(chrom=c, start=binStarts, end=binEnds, stringsAsFactors = F)
    }
bins.df <- do.call('rbind', binList)
bins.df$bin <- row.names(bins.df)
row.names(bins.df) <- NULL
bins.df$mid <- (bins.df$start + bins.df$end)/2

# number of hotspot iterations overlapping each bin
down.bins <- hot.down.df %>%
      dplyr::select(chr, start.bp, end.bp, iter) %>%
      dplyr::rename(chrom = chr, start = start.bp, end = end.bp) %>%
      bed_intersect(bins.df, suffix = c("", ".bin")) %>%
      distinct(chrom, iter, bin.bin) %>%
      dplyr::rename(bin = bin.bin) %>%
      group_by(bin) %>%
      summarise(hotspot_called = n()) %>%
      right_join(bins.df, by = "bin") %>%
      mutate(hotspot_called = ifelse(is.na(hotspot_called), 0, hotspot_called)) %>%
      as.data.frame() 

down.bins.toplot <- rbind(down.bins %>%
                                dplyr::select(-end, -mid),
                        chrom_borders %>%
                                mutate(hotspot_called = 0,
                                       bin = 1.1) %>%
                                dplyr::select(-end)) %>%
      mutate(hot_prop = hotspot_called/iterations)
ggplot()+
      geom_line(data = down.bins.toplot %>% filter(chrom == 1), aes(start/1000000, hot_prop), size = 0.7, col = "lightblue")+
      geom_ribbon(data = down.bins.toplot %>% filter(chrom == 1), aes(ymin = 0, ymax = hot_prop, x = start/1000000), 
                  fill = "lightblue")+
      geom_hline(yintercept = 0.95, size = 1, linetype = 2)+
      theme_bw()+
      guides(col = F)+
      theme(text = element_text(size = 12),
            axis.text = element_text(color = "black"),
            strip.background = element_rect(color = "white", fill="white"),
            legend.title = element_blank(),
            legend.position = "top",
            panel.spacing = unit(0.0005, "lines"),
            panel.border = element_rect(size = 0.5, colour = "#666666", fill = NA),
            panel.grid = element_blank(),
            plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"))+
      labs(x = "Position (Mb)",
           y = "Called as hotspot (prop)")

Combinined hotspots

# significant hotspots = called in >95 % of iterations

down_pcf <- down.bins %>%
  mutate(prop_called = hotspot_called/iterations) %>%
  filter(prop_called >= 0.95) %>%
  bed_merge(max_dist = 5) %>%
  as.data.frame() %>%
  mutate(hotspot.id = paste0("down_chr", chrom, "_", round(start/1000000, 1), "Mb"),
         size = end-start)

## preparing nonclustered hotspots

nonclustered_pcf <- hotspots.out.nonClustered %>%
  dplyr::rename(chrom = chr, start = start.bp, end = end.bp) %>%
  dplyr::select(hotspot.id, chrom, start, end) %>%
  mutate(size = end-start)

## merge hotspots

# Hotspots non-clustered
noclust.gr <- with(nonclustered_pcf, GRanges(chrom, 
                                      IRanges(start=start, 
                                              end=end)))

# Hotspots downsampling
down.gr <- with(down_pcf, GRanges(chrom, 
                                        IRanges(start=start, 
                                                end=end)))

# check overlap between hotspot approaches

ols_1 <- findOverlaps(noclust.gr, down.gr)

down_pcf$shared <- 'unique'
down_pcf$partner_size <- 0
down_pcf$shared[subjectHits(ols_1)] <- nonclustered_pcf$hotspot.id[queryHits(ols_1)]
down_pcf$partner_size[subjectHits(ols_1)] <- nonclustered_pcf$size[queryHits(ols_1)]


ols_2 <- findOverlaps(down.gr, noclust.gr)

nonclustered_pcf$shared <- 'unique'
nonclustered_pcf$partner_size <- 0
nonclustered_pcf$shared[subjectHits(ols_2)] <- down_pcf$hotspot.id[queryHits(ols_2)]
nonclustered_pcf$partner_size[subjectHits(ols_2)] <- down_pcf$size[queryHits(ols_2)]

# define which hotspots to keep

nonclustered_pcf <- nonclustered_pcf %>%
  mutate(keep = ifelse(shared == "unique" | size < partner_size, 1, 0))

down_pcf <- down_pcf %>%
  mutate(keep = ifelse(shared == "unique" | size < partner_size, 1, 0))

## combine
combined_hotspots <- rbind(down_pcf %>%
                              filter(keep == 1) %>%
                              mutate(analysis = ifelse(shared == "unique", 'downsampling', "both")),
                           nonclustered_pcf %>%
                              filter(keep == 1) %>%
                               mutate(analysis = ifelse(shared == "unique", 'nonClustered', "both")))
head(combined_hotspots)
write.table(combined_hotspots, file = paste0(out, "sv_hotspots_combined.txt"), sep = "\t", quote = F)