dREG Gateway

BayesPrism Documentation

1   Login
The user needs to log in by clicking 'Log in' link at the top-right corner of the page. Having an account provides a number of benefits, and is free and easy.

dREG login
Figure 1: Login page

2   Create a new experiment
Select the BayesPrism application on the dashboard panel to create a data analysis for your data, as the following screenshot (Figure 2).

dREG panel
Figure 2: dREG dashboard

3   Set experiment name
Rename Experiment Name, and click Add a description to comment on the experimental setup (optional). Choose the project that the experiment belongs to. By default, the "Default Project" is created and used.

BayesPrism experiment name
Figure 3: Start new BayesPrism experiment

4   Upload count matrix files
The bayesPrism need two types of count matrix file: the bulk RNA-seq count matrix and the reference count matrix. Currently we implement multiple data import from the dfferent data source, such as tsv, xls, rds/dataframe, rds/suerat, h5ad. For details please check the input tab of this page.
The gateway provides two ways to upload count matrix files for users. (1) Click "Select files from storage" to choose existing files submitted for previous tasks, or (2) click "Drop files here or browse" to upload new files from user's storage.
Note:
(1) Each row of count matrix indicates one unique gene id, so the count matrices should have same gene set in the bulk and in the reference.
(2) Count matrices can not be normalized
(3) At least 50 reads for each cell type are suggested.

Upload count matrix files
Figure 4: Upload count matrix files

5   Set computing parameters
(1) Specify species for the gene removal in ribosomal, mitochondria, chrX, and chrY. For other species, the users need to remove these genes manually.
(2) Specify the cell type and tumor state for each cell sample in the reference count matrix using CSV format. 3 columns are defined: cell_id, cell_type and tumor_state. The tumor state should be 0 (non-tumor) or 1(Tumor).
(3) Specify the prefix of the output files. This can help distinguish results from multiple experiments.

BayesPrism parameters
Figure 5: Set computing parameters

6   Submit the job
Once steps 1-5 are finished, proceed to "save and launch". Input data and parameters will be submitted to the computing node of the XSEDE cluster via the dREG gateway server. Click the checkbox next to "Receive email notification of experiment status" if needed. Upon launching, users will be directed to the "Experiments" page, shown in Fig. 4. A typical experiment usually finishes within 4 hrs. Users may view the progress by logging in and clicking the "Experiment button on the left control panel at the dashboard.

7   Check the status
Users may view the progress by logging in and clicking the "Experiments" button on the left control panel at the dashboard. All experiments submitted are listed on this page.

BayesPrism experiment browse
Figure 6: Check the experiment status

8   Check the results
Once a job is completed, the user can click selected BayesPrism experiment and the website will jump to Experiment Summary page. All parameters used to set up the experiment are listed on this page. The user can also access output files of BayesPrism stored in the ARCHIVE. Just click the ARCHIVE to check any single result file. A compressed file, including input count matrix file set, two task log files and all result files, is also provided for users. Click Download Zip button to download a compressed file. The downloaded file with the 'tar.gz' extension can be decompressed by the 'tar' command, the file with the 'gz' extension can be decompressed by the 'gunzip' command in Linux.
In Safari, it could be problematic because Safari tries to unzip the compressed results automatically using a non-compatible compress method. Please check this link to disable this feature.

bayesPrism experiment archive
Figure 7: BayesPrism Archive

The input to BayesPrism consists of two count matrices which represent the read counts in bulk sample and in reference scRNA (or GEP). The count matrix file of scRNA or GEP can be exported from the single cell package, such as Seurat, CellRanger. Here we first explain the data format of count matrix used in BayesPrism.

1  Count Matrices

Data Format Used For Description
TSV Bulk,Reference A tab-separated values file contains read counts for each gene (as row) in every sample (as column). BayesPrism requires TSV with row names and column names.
XLS Bulk,Reference An Excel file contains read counts for each gene (as row) in every sample (as column). BayesPrism requires the first row of XLS gives all sample names and the first column give all gene names or IDs
RDS/dataframe Bulk,Reference This RDS file is an R data frame conatins read counts for each gene (as row) in every sample (as column). BayesPrism requires the data frame has row names (genes) and column names (samples).
RDS/sce Reference This RDS file contains a SingleCellExperiment object which repersents read counts for each gene (as row) in each sample (as column).
RDS/seurat Reference This RDS file contains a Seurat object which represents single-cell expression data for R. Each Seurat object revolves around a set of cells and consists of one or more Assay objects.
h5ad Reference Hierarchical Data Format version 5 (HDF5) is used to store both the expression values and associated annotations on the genes and cells in Python. H5AD format can be read into R as a SingleCellExperiment.

 Note:

 (1) The bulk matrix and the reference matrix should use same gene annotation.

 (2) All matrices use raw counts, not allow normalized data.

 (3) In the reference matrix, at least 50 reads are required for each cell type

2   Cell type and tumor state

If the reference count matrix doen't contain the cell type and tumor state for each cell, the user must provide a CSV file to indicate the cell type and tumor state for each cell. The CSV should have 3 columns: cell id, cell type, tumor state ( values: 0 or 1).

3   Species

BayesPrism removes genes in ribosomal and mitochondria, chrX, and chrY before deconvolution. If the data is not for human and mouse, the users have to remove these genes in advance.

4   scRNA or GEP

The reference count matrix could represent scRNA data or GEP (Gene Expression Profile) data. GEP only support TSV, XLS, and RDS/datframe.

1  BayesPrism output files

BayesPrism generates a RDATA file ($PREFIX.rdata) for R users and a compressed file ($PREFIX.tar.gz) for Python users.

R users can open RDATA file using "load" commmand easily. Python users need to extract multiple RDS files (see the following table) using the decommpresion command "tar -xvzf" on Linux

Note: All files below are stored in the "ARCHIVE" directory.

File name Description
$PREFIX.rdata This Rdata file contains the 'rted' object which can be explored by the 'str' command.
$PREFIX.tar.gz The compressed file contains multiple RDS data which represent the items of the 'rted' object. The following table shows all RDS data.
$PREFIX.cor.pdf The correlation heatmap indicates the correlation between the samples in the bulk data.

2   Contents in $PREFIX.tar.gz

File name Description
Access cell type fractions
rted.res.first_gibbs_res.gibbs_theta.rds Initial estimation of fraction for all cell subtypes in each bulk sample
R code: rted$res$first.gibbs.res$gibbs.theta
rted.res.first_gibbs_res.theta_merged.rds Initial estimation of fraction for all cell subtypes in each bulk sample
R code: rted$res$first.gibbs.res$theta.merged
rted.res.final_gibbs_theta.rds The updated estimates of cell type fraction
R code: rted$res$first.gibbs.theta
Access gene expression(raw read scale)
rted.res.first_gibbs_res.Znkg.rds The estimates of the mean of posterior read count for each cell subtype in each bulk sample.
R code: rted$res$first.gibbs.res$Znkg
rted.res.first_gibbs_res.Znkg_merged.rds The estimates of the mean of posterior read count for each cell subtype (merged across subtypes) in each bulk sample.
R code: rted$res$first.gibbs.res$Znkg_merged
Access gene expression(normalized read scale)
rted.res.first_gibbs_res.Zkg_tum.rds The mean count of tumor expression in each bulk sample.
R code: rted$res$first.gibbs.res$Zkg.tum
rted.res.first_gibbs_res.Zkg_merge.rds The mean count of tumor expression in each bulk sample.
R code: rted$res$first.gibbs.res$Zkg.merge
rted.res.first_gibbs_res.Zkg_tum_norm.rds The depth-normalized count of tumor expression in each bulk sample.
R code: rted$res$first.gibbs.res$Zkg.tum.norm
rted.res.first_gibbs_res.Zkg_tum_vst.rds The variance stabilizing transformed count of tumor expression in each bulk sample.
R code: rted$res$first.gibbs.res$Zkg.tum.vst
rted.res.phi_env.rds The batch corrected non-malignant cell expression.
R code: rted$res$phi.env
Others
rted.res.first_gibbs_res.cor_mat.rds Correlation between the samples in the bulk matrix.
R code: rted$res$first.gibbs.res$cor.mat
rted.para.input_phi.rds
R code: rted$para$input.phi
rted.para.input_phi_prior.rds
R code: rted$para$input.phi.prior

2  Read RDS results in Python.

Python users can use 'pyreadr' to read RDS file (https://stackoverflow.com/questions/40996175/loading-a-rds-file-in-pandas).

Here we briefly show how to read it in Python.

import pyreadr

result = pyreadr.read_r('rted.res.first_gibbs_res.gibbs_theta.rds')

# Extract the pandas data frame. In the case of Rds there is only one object with None as key
df = result[None]

3   Correlation plot.

Correlation heatmap

dREG Gateway is online service that supports Web-based science through the execution of online computational experiments and the management of data. The items below are trying to answer qustions from the users

Q: How should I prepare count matrix files for bayesPrism use with the dREG gateway?

A: .

Q: How should I do when I meet the computational failure in the dREG gateway?

A: There are two types of error you may have, we explain how to identify your error and how to handle it here.

Q: Which browser works well with the dREG gateway?

A: We have tested in the Firefox, Google Chrome and Safari so far. For IE (version 10 or 11) and some version of Safari, you maybe have trouble showing sequence data in WashU genome browser. For Safari users, please read next Q&A.

Q: What should the Safari users be aware of?

A: By default, Safari unzips a zip file automatically when you download it. However dREG results are compressed by the 'bgzip' command which is not compatiable with the Safari method. It would be probelmatic when you download dREG results. Please refer to this link to disable this feature in Safari and then download the compressed results from dREG gateway.
Secondly, when you click the genome browser link, please use the Left-Click, don't use Right-Click menu and the menu option "open a new tab".

Q: How long do my data and results keep in the dREG gateway?

A: One month.

Q: Do I have to create account before using this service?

A: Yes, this system is supported by an NSF funded supercomputing resource known as XSEDE, who regularly needs to report bulk usage statistics to NSF. Nevertheless, data that you provide are completely safe.

Q: How do I know the status of the computational nodes?

A: Since we can't update this web site very often, the gateway status is updated here on the dREG page based on the notifications of the XSEDE community.

Q: Who do I thank for the computing power?

A: This web-based tool is powered by SciGaP and Apache Airavata and the GPU servers are supported by the XSEDE.

Q: I have another question that is not on this FAQ. How can I contact you?

A: Yes, please contact us with any questions! Zhong(zw355 at cornell.edu). Charles(cgd24 at cornell.edu).