dREG Gateway

Documentation

1)  Login:
The user needs to log in by clicking 'login' link at the top-right corner of the page. Having an account provides a number of benefits, and is free and easy.

dREG login

2)  Create a new project (optional)
Optionally, users can choose to make a new 'project' in the dREG gateway to archive a collection of dREG data from related experiments. This will allow a collection of experiments to be stored in close proximity to each other.

dREG project

3)  Start new dREG
Select the menu 'Start dREG' below the dREG logo to create an data analysis for your data, as the following screenshot.

dREG experiment

4)  Select bigWig files
Select bigWig files representing PRO-seq, GRO-seq, or ChRO-seq signal on the plus and minus strand. Please notice that two GPU resources are available now, currently it is easier to get the computation resources on Comet.sdsc.xsede.org than Bridges.psc.edu.

dREG experiment create

5)  Submit the job
Click the 'save and launch' button. BigWig file are transferred to the XSEDE server and a GPU queue is scheduled to run dREG. After submitting, the user can check the status in the next web page, as shown below. Depend on the queue status, the job maybe wait for a long time to start prediction. Once started, it will only take 1-4 hours to complete.

6)  Check the status
The user can check the status of their 'experiment' by clicking the menu 'Saved dREG runs' below the dREG logo.

dREG experiment browse

7)  Check the results
Once a job is completed, the user can select 'Full results' in the drop-down list and then LEFT-click 'Download' link in the experiment summary page to download a compressed file described in the 'output' sheet in this page, or the user can download any single file from the drop-down list. The downloaded file with the 'tar.gz' extension can be decompressed by the 'tar' command, the file with the 'gz' extension can be decompressed by the 'gunzip' command in Linux. Please don't use RIGHT-click to open a tab for downloading.

In Safari, it could be problematic because Safari tries to unzip the compressed results automatically using a non-compatible compress method. Please check this link to disable this feature.

dREG experiment summary

8)  Switch to Genome Browser
The convenient tool provided by the gateway is the user can check the results in the Genome Browser by clicking 'Switch to genome browser' link. The genome identifier must be specified by two ways, 1) select from the drop-down list or 2) fill the identifier in the textbox. Please use LEFT-click to open a genome browser window.

dREG experiment summary

9)  Check the storage
The user can LEFT-click 'Open Folder' link in the experiment summary page to check the storage for the current job or click the menu 'Storage' under the dREG logo to check the folders and files for all jobs(experiments). The following figure shows the data files in the job's folder, including two bigWig files, one result in bedgraph format, two outputs of job scheduler on GPU nodes.

dREG experiment summary

10)  When you meet failure
Currently when you run the dREG jobs, there are two types of errors you may have. One error may come from the system, called a system error, such as no computing time on specific GPU nodes or an internal errors in Apache Airavata. The other type of error is caused by the users' bigwig, called bigwig error, which can occur when read counts are normalized, each read is mapped to a region, or read counts in minus strand are positive values. The following figures show how to identify the error and how to handle it.

a)  System error
When users submit the experiment, the failure will be shown in the experiment summary page soon as figure 10-S1 or 10-S2. The experiment status is "Failed" and many java errors are shown in the "Errors" item. Users can't solve this problem and should report this error the web master.

System error(1)
Figure 10-S1

System error(2)
Figure 10-S2

b)  Bigwig error
After the experiment is complete, no results can be downloaded and job status shows a failure (see Figure 10-S3). Users can find the dREG log file or task log file to identify the problem. Enter into "storage directory" by clicking the "open" link. The users can find "ARCHIVE" folder where Apache Airavata copy back all files from the computing node. Check the dREG log file (out.dREG.log) to see the bigwig problem or check the task log file ("slurm-tasknoxxx.out") and find the reason why the task is aborted. Figure 10-S4 and 10-S5 give a two examples for this kind of error. If the bigwig has problems, please refer to this link to solve the problems.

Bigwig error
Figure 10-S3

This figure shows the bigWig problems in the dREG log file.

Bigwig error(1)
Figure 10-S4

This figure shows the task log file in which explains the task was killed due to time limit.

Bigwig error(2)
Figure 10-S5

The dREG gateway is web service built on the Apache Airavata software framework and the XSEDE platform using the following software packages:

[1] dREG package: https://github.com/Danko-Lab/dREG.

The dREG package is developed to detect the divergently oriented RNA polymerase in GRO-seq, PRO-seq, or ChRO-seq data using support vector machines (e1070 or Rgtsvm package).

[2] dREG.HD package: https://github.com/Danko-Lab/dREG.HD.

The dREG.HD pa/ckage refines the location of TREs obtained using dREG by imputing DNAse-I hypersensitivity.

[3] Rgtsvm package: https://github.com/Danko-Lab/Rgtsvm.

Rgtsvm implements support vector classification and support vector regression on a GPU to accelerate the computational speed of training and predicting large-scale models.

[4] Airavata PHP Gateway: https://github.com/apache/airavata-php-gateway.git.

Airavata PHP Gateway provides an API to build web sites which interact with high performance computers that are part of XSEDE.

The input to dREG consists of two bigWig files which represent the position of RNA polymerase on the positive and negative strands. The sequence alignment and processing steps to make the input bigWig files are a major factor influencing how accurately dREG predicts TIRs. dREG makes several assumptions about data processing that are critical for success.

Critical elements of a bioinformatics pipeline that is compatible with dREG will include:

  • Representing RNA polymerase location using a single base.

    PRO-seq measures the location of the RNA polymerase active site, in many cases at nearly single nucleotide resolution. Therefore, it is logical to represent the coordinate of RNA polymerase using the genomic position that best represents the polymerase location, rather than representing the entire read. dREG assumes that each read is represented in the bigWig file by a single base. We have noted poor performance when reads are extended. It is critical that users pass in bigWig files that represent RNA polymerase using a single nucleotide.

  • Include a copy of the Pol I transcription unit in the reference genome.

    PRO-seq data resolves the location of all four RNA polymerases found in Metazoan cells (Pol I, II, III, and Mt). DNA encoding the Pol I transcription unit is highly repetitive, and is not included in most mammalian reference genomes. Nevertheless, the Pol I transcription unit is a substantial source of reads in a typical PRO-seq experiment (10-30%). Many of these reads will align spuriously to retrotransposed and non-functional copies of the Pol I transcription unit, which can create mapping artifacts. To solve this issue, we include a single copy of the repeating DNA that encodes the Pol I transcription unit in the reference genome used to map reads. We use GenBank ID# U13369.1. Including a copy of this transcription unit provides an alternative place for Pol I reads to map, preventing reads from accumulating in Pol I repeats.

  • Trim 3' adapters, but leave the fragments.

    Much of the signal for dREG comes from paused RNA polymerase. RNA polymerase pauses 30-60 bp downstream of the transcription start site. Due to this short RNA fragment length, paused reads in most PRO-seq libraries will sequence a substantial amount of adapter. This leads to poor mapping rates in full-length reads. Therefore, it is crucial to remove contaminating 3' adapters so that paused fragments will map to the reference genome properly.

  • Data represents unnormalized raw counts.

    dREG assumes that data represents the number of individual sequence tags that are located at each genomic position. For this reason, it is critical that input data is not normalized. The dREG server checks to ensure that input data is expressed as integers, and will return an error if this is not the case.

Users can also use scripts generated in the Danko lab to create compatible bigWig files. Options for scripts at different starting points in the analysis are given below:

  • Convert raw fastq files into bigWig.

    Our pipeline produces bigWig files that are compatible with dREG, and can be found at the following URL: https://github.com/Danko-Lab/proseq_2.0. Our PRO-seq pipeline takes single-end or pair-ended sequencing reads (fastq format) as input. The pipeline automates routine pre-processing and alignment steps, including pre-processing reads to remove the adapter sequences and trim based on base quality, and deduplicate the reads if UMI barcodes are used. Sequencing reads are mapped to a reference genome using BWA. Aligned BAM files are converted into bigWig format in which each read is represented by a single base.

  • Convert mapped reads in BAM files into bigWigs.

    We provide a tool that converts mapped reads from a BAM file into bigWig files that are compatible with dREG. This tool is available here: https://github.com/Danko-Lab/RunOnBamToBigWig.

Other considerations:

    The quality and quantity of the experimental data are major factors in determining how sensitive dREG will be in detecting TREs. We have found that dREG has a reasonable statistical power for discovering TREs with as few as ~40M uniquely mappable reads, and saturates detection of TREs in well-studied ENCODE cell lines with >80M reads. To increase the number of reads available for TRE discovery, we encourage users to merge biological replicates in order to improve statistical power prior to running dREG.

    We have found that visualizing aligned data in a genome browser prior (e.g., IGV or UCSC) to downstream analysis is a useful way to catch any data quality or alignment issues.

1) dREG run generates a compressed file including the dREG results as follows:

 

File name Description
$PREFIX.dREG.infp.bed.gz Informative positions with dREG scores predicted by the dREG model. Decompress it with 'gunzip' in Linux.
$PREFIX.dREG.peak.full.bed.gz Significant peaks (FDR < 0.05) with dREG scores, p-values and center positions where the maximum dREG scores are located. Decompress it with 'gunzip' in Linux.
$PREFIX.dREG.peak.score.bed.gz Significant peaks (FDR < 0.05) only with dREG scores. Decompress it with 'gunzip' in Linux.
$PREFIX.dREG.peak.prob.bed.gz Significant peaks (FDR < 0.05) only with p-values. Decompress it with 'gunzip' in Linux.
$PREFIX.dREG.raw.peak.bed.gz All raw peaks generated by dREG peak calling, including dREG scores, uncorrected p-values, center positions where the maximum are located in smoothed curves, center positions where the maximim are lcoated in original curve, centroid. Only available in the Web storage.
$PREFIX.tar.gz Including above 5 files, can be decompressed by 'tar -xvzf' in Linux.
Box 1: Brief description of key terms

Informative position: Loci denoted as "informative positions" meet the following criteria: contain more than 3 reads in 100 bp interval on either strand, or more than 1 read in 1Kbp interval on both strands. Informative positions are used to predict the dREG scores for TRE (Transcription Regulatory Element) identification.

dREG score: Training and prediction is done using a Support Vector Regression model where a label of 1 indicates RNA polymerase II initialization or transciption through the informative position. The predicted values from the pre-trained model are called dREG scores. A dREG score close to 1 indicates that a position likely a TRE.

Peak p-value: We test 5 dREG scores around each candidate peak center using the NULL hypothesis that each point within this peak is drawn from the non-TRE distribution. This test estimates the statistical confidence of each candidate dREG peak. In the final result, FDR is applied to do multiple correction and only the peaks with adjusted p-value < 0.05 are reported.


2) In the Web storage folder there are some files required by the WashU genome browser:

 

File name Description
$PREFIX.dREG.infp.bw The bigWig file converted from the bed file of informative positions ($PREFIX.dREG.infp.bed.gz).
$PREFIX.dREG.peak.score.bw The bigWig file converted from the significant peaks (FDR < 0.05) with dREG scores ($PREFIX.dREG.peak.score.bed.gz).
$PREFIX.dREG.peak.prob.bw The bigWig file converted from the significant peaks (FDR < 0.05) with p-values ($PREFIX.dREG.peak.prob.bed.gz).
*.bed.gz.tbi The index files generated from the corresponding bed files. Please ignore them if you download the results.

3) There are two log files in the Web storage folder:

 

File name Description
$PREFIX.dREG.log Print the summary information after peak calling. If the bigWigs don't meet the requirements of dREG, the warning information will be outputted in this file.
slurm-??????.out The verbose logging output of dREG package.