dREG Gateway

dTOX Documentation

1)  Login:(same as dREG)
The user needs to log in by clicking 'login' link at the top-right corner of the page. Having an account provides a number of benefits, and is free and easy.

dREG login

2)  Create a new project (optional, same as dREG)
Optionally, users can choose to make a new 'project' in the dREG/dTOX gateway to archive a collection of sequencing data from related experiments. This will allow a collection of experiments to be stored in close proximity to each other.

dREG project

3)  Start new dTOX
Select the menu 'Start dREG/dTOX' below the dREG logo to create an data analysis for your data, as the following screenshot. Please notice to select the "dTOX prediction" Application.

dREG experiment

4)  Fill experiment form
Select bigWig files representing PRO-seq, ATAC-seq, or DNase-I-seq signal on the plus and minus strand.

dREG experiment create

5)  Submit the job
Click the 'save and launch' button. BigWig file are transferred to the XSEDE server and a GPU queue is scheduled to run dTOX. After submitting, the user can check the status in the next web page, as shown below. Depending on the queue status, the job may wait for some time to start prediction. Once started, it will take 6-10 hours to complete depending on the genome used.

6)  Check the status
The user can check the status of their 'experiment' by clicking the 'Saved runs' button on the top menu.

dREG experiment browse

7)  Check the results
Once a job is completed, the user can select 'dTOX Bound Regions' in the drop-down list and then LEFT-click 'Download' link in the experiment summary page to download a compressed file described in the 'output' sheet in this page. The downloaded file has a 'gz' extension and can be decompressed by the 'gunzip' command in Linux. Please don't use RIGHT-click to open a tab for downloading. To extract bound motifs for one specific transcription factor, download our R script (here)

In Safari, it could be problematic because Safari tries to unzip the compressed results automatically using a non-compatible compression method. Please check this link to disable this feature.

dREG experiment summary

8)  Switch to Genome Browser
The convenient tool provided by the gateway is the user can check the results in the Genome Browser by clicking 'Switch to genome browser' link. The genome identifier must be specified by two ways, 1) select from the drop-down list or 2) fill the identifier in the textbox. Please use LEFT-click to open a genome browser window.

dREG experiment summary

9)  Check the storage
The user can LEFT-click 'Open Folder' link in the experiment summary page to check the storage for the current job or click the menu 'Storage' under the dREG logo to check the folders and files for all jobs(experiments). The following figure shows the data files in the job's folder, including two bigWig files, one result in bedgraph format, two outputs of job scheduler on GPU nodes.

dREG experiment summary

10)  If your job fails
When you run dTOX, there are two main types of errors you may encounter. One error may come from the system, called a system error, such as no computing time on specific GPU nodes or an internal errors in Apache Airavata. The other type of error is caused by the users' bigWig file, called a bigWig error, which can occur when read counts are normalized, each read is mapped to a region, or read counts in minus strand are positive values. The following figures show how to identify the error and how to handle it.

a)  System error
When users submit the experiment, the failure will be shown in the experiment summary page soon as figure 10-S1 or 10-S2. The experiment status is "Failed" and many java errors are shown in the "Errors" item. Users can't solve this problem and should report this error the web master.

System error(1)
Figure 10-S1

System error(2)
Figure 10-S2

b)  Bigwig error
After the experiment is complete, no results can be downloaded and job status shows a failure (see Figure 10-S3). Users can find the dTOX log file or task log file to identify the problem. Enter into "storage directory" by clicking the "open" link. The users can find "ARCHIVE" folder where Apache Airavata copies back all files from the computing node. Check the dTOX log file (run.dTOX.log) to see the bigwig problem or check the task log file ("slurm-tasknoxxx.out") and find the reason why the task was aborted. Figure 10-S4 and 10-S5 give two examples for this kind of error. If the bigwig has problems, please refer to the link for PRO-seq, link for DNase-I-seq, or link for ATAC-seq to solve the problems.

Bigwig error
Figure 10-S3

This figure shows the bigWig problems in the dREG log file.

Bigwig error(1)
Figure 10-S4

This figure shows the task log file in which explains the task was killed due to time limit.

Bigwig error(2)
Figure 10-S5

The input to dTOX consists of two bigWig files which represent either the position of RNA polymerase on the positive and negative strands (PRO-seq) or the accessibility on the positive and negative strands (DNase-I-seq or ATAC-seq). The sequence alignment and processing steps to make the input bigWig files are a major factor influencing how accurately dTOX predicts transcription factor binding.

A key component of all datatypes is that data represents unnormalized raw counts. dTOX assumes that data represents the number of individual sequence tags that are located at each genomic position. For this reason, it is critical that input data is not normalized. The server checks to ensure that input data is expressed as integers, and will return an error if this is not the case.

Users can also use scripts generated in the Danko lab to create compatible bigWig files. Options for scripts at different starting points in the analysis are given below:

Other considerations:

    The quality and quantity of the experimental data are major factors in determining how sensitive dTOX will be in detecting transcription factor binding. To increase the number of reads available for transcription factor binding detection, we encourage users to merge biological replicates in order to improve statistical power prior to running dTOX. Additionally, to compare binding predictions between conditions we recommend comparing samples at similar sequencing depths or down sampling to create similar sequencing depths.

    We have found that visualizing aligned data in a genome browser prior (e.g., IGV or UCSC) to downstream analysis is a useful way to catch any data quality or alignment issues.

1) A dTOX run generates a compressed file including the following files:

 

File name Description
$PREFIX.dTOX.bound.bed.gz TFBS regions that are predicted as bound. The file includes chromosome, start, ending, MOTIF ID, RTFBSDB score, strand, dTOX score, bound status. Decompress it with 'gunzip' in Linux.
Box 1: Brief description of key terms

Informative position: Loci denoted as "informative positions" meet the following criteria: contain more than 1 reads in 400 bp interval on either strand. Informative positions are used to predict transcription factor binding.

dTOX decision value: Training and prediction is done using a Support Vector Regression model where a label of 1 indicates transcription factor binding. The predicted values from the pre-trained model are called dTOX decision values. A dTOX decision value close to 1 indicates that a position likely to be bound.


Box 2: Extracting bound motifs for a specific transcription factor.

The dTOX output file contains the binding status of our entire set of motifs with PWMs. To find the binding status of the motifs you are interested in, you can run our R script that extracts the Motif IDs that belong to a particular transcription factor. The script is located here. This script requires 3 arguments: the name of the file with the dTOX results, the transcription factor you want to extract, and an output file name. To run this script on Unix or Linux, you need to use the following command:

R --vanilla --slave --args out.dTOX.bound.bed.gz TF outputFile.bed.gz < extract-bound-TF.R


2) In the Web storage folder there are some files required by the WashU genome browser:

 

File name Description
$PREFIX.dTOX.bound.bw The bigWig file converted from bound motifs ($PREFIX.dTOX.bound.bed.gz).
*.bed.gz.tbi The index files generated from the corresponding bed files. Please ignore them if you download the results.

3) There are one log file in the Web storage folder:

 

File name Description
slurm-??????.out The verbose log output of dTOX package.

dREG Gateway is online service that supports Web-based science through the execution of online computational experiments and the management of data. Below are frequent questions about the dREG Gateway and the dTOX program.

Q: How should I prepare bigWig files for use with dTOX?

A: Information about how to prepare files can be found on the Danko lab github page here for PRO-seq , DNase , and ATAC-seq .

Q: How should I do when I meet the computational failure in the dREG gateway?

A: There are two types of error you may have, we explain how to identify your error and how to handle it here.

Q: Which browser works well with the dREG gateway?

A: We have tested in the Firefox, Google Chrome and Safari so far. For IE (version 10 or 11) and some version of Safari, you maybe have trouble showing sequence data in WashU genome browser. For Safari users, please read next Q&A.

Q: What should the Safari users be aware of?

A: By default, Safari unzips a zip file automatically when you download it. However dTOX results are compressed by the 'bgzip' command which is not compatiable with the Safari method. It would be problematic when you download dTOX results. Please refer to this link to disable this feature in Safari and then download the compressed results from dREG gateway.
Secondly, when you click the genome browser link, please use the Left-Click, don't use Right-Click menu and the menu option "open a new tab".

Q: Will dTOX work with my data type?

A: dTOX was trained and tested on PRO-seq, ATAC-seq, and DNase-I-seq. dTOX will also work well with data collected by any run-on and sequencing method, including GRO-seq, PRO-seq, or ChRO-seq. Other methods that map the location of RNA polymerase genome wide using alternative tools (for example, NET-seq) will most likely work well, but are not officially supported.

Q: Will the pre-trained models work using data from my species?

A: Models are currently available only in mammalian organisms. The length and density of genes, which vary considerably between highly divergent species, affects the way that a transcribed promoter or enhancer looks. For this reason, models can only be used in species. We are working to create models in widely-used model organisms, including drosophila and C. elegans.

Q: How deeply do I need to sequence PRO-seq libraries?

A: Sensitivity is reasonable at ~40 million mapped reads and saturates at ~100 million mapped reads. See our analysis here: supplementary figure 3 in dREG paper.

Q: How long do my data and results keep in the dREG gateway?

A: One month.

Q: How do I cite dTOX?

A: Please cite our papers if you use dTOX results in your publication:
(1) ADD CITATION. Choate, L. A., Wang, Z., & Danko, C. G. (2018). Identification of transcription factor binding using genome-wide accessibility and transcription. bioRxiv.

Q: Do I have to create account before using this service?

A: Yes, this system is supported by an NSF funded supercomputing resource known as XSEDE, who regularly needs to report bulk usage statistics to NSF. Nevertheless, data that you provide are completely safe.

Q: How do I know the status of the computational nodes?

A: Since we can't update this web site very often, the gateway status is updated here on the dREG page based on the notifications of the XSEDE community.

Q: Who do I thank for the computing power?

A: This web-based tool is powered by SciGaP and Apache Airavata and the GPU servers are supported by the XSEDE.

Q: I have another question that is not on this FAQ. How can I contact you?

A: Yes, please contact us with any questions! Zhong(zw355 at cornell.edu). Charles(cgd24 at cornell.edu).