Setup¶

Preprocessing¶

RNA-Seq data will come to the computational biologist either as bcl or fastq files. The pipeline is set up to accommodate either as a starting point, so you can copy them straight to cloud storage.

If you receive bcl files, they are likely already in temporary storage in another Google bucket. We recommend you copy them directly between buckets. The directory name in temporary storage will probably be unwieldy, so feel free to copy the individual objects rather than the entire directory (as shown below) to a bucket linked to the experiment name you actually wish to use. As shown below, please copy files from temporary storage (rather than the entire directory) and make sure that you copy them to a subdirectory named bcl within your experiment bucket – this is required for the bcl2fastq step to run correctly. We have had trouble in the past with using the gsutil wildcard (*) within zsh, but it should work in bash.:

# copy files between cloud storage buckets
gsutil -m cp -r gs://path/to/temporary/storage/* gs://genomics_xavier_bucket/<experiment_type>/<experiment_name>/bcl

If you receive fastq files, you will probably receive them on the Broad cluster. You can copy them to your Google bucket just as you would copy between Google buckets.:

# copy files from Broad cluster to cloud storage
gsutil -m cp -r /path/to/fastq_files/on/broad/cluster/* gs://genomics_xavier_bucket/<experiment_type>/<experiment_name>/subdirectory_to_hold_fastq_files/

experiment_type and experiment_name are parameters that define the directory path (within genomics_xavier_bucket) and are explained in greater detail below.

SampleSheet.csv¶

If you are running the pipeline with input_data_type == bcl, then you will need to ensure that a SampleSheet is present at the top level of the bcl folder in your project’s Google bucket. A working SampleSheet.csv is shown below. The demultiplexing should work if you follow this format.

SampleSheet.csv¶

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2
1,WT1_Th0_1,,A1,N701,TAAGGCGA,S502,ATAGAGAG
2,WT1_Th0_2,,B1,N701,TAAGGCGA,S503,AGAGGATA
3,WT2_Th0_1,,C1,N701,TAAGGCGA,S504,TCTACTCT
4,WT2_Th0_2,,D1,N701,TAAGGCGA,S505,CTCCTTAC
5,WT3_Th0_1,,E1,N701,TAAGGCGA,S506,TATGCAGT
6,WT3_Th0_2,,F1,N701,TAAGGCGA,S507,TACTCCTT
7,KO1_Th0_1,,A2,N702,CGTACTAG,S502,ATAGAGAG
8,KO1_Th0_2,,B2,N702,CGTACTAG,S503,AGAGGATA
9,KO2_Th0_1,,C2,N702,CGTACTAG,S504,TCTACTCT
10,KO2_Th0_2,,D2,N702,CGTACTAG,S505,CTCCTTAC
11,KO3_Th0_1,,E2,N702,CGTACTAG,S506,TATGCAGT
12,KO3_Th0_2,,F2,N702,CGTACTAG,S507,TACTCCTT
13,WT1_Th1_1,,A3,N703,AGGCAGAA,S502,ATAGAGAG
14,WT1_Th1_2,,B3,N703,AGGCAGAA,S503,AGAGGATA
15,WT2_Th1_1,,C3,N703,AGGCAGAA,S504,TCTACTCT
16,WT2_Th1_2,,D3,N703,AGGCAGAA,S505,CTCCTTAC
17,WT3_Th1_1,,E3,N703,AGGCAGAA,S506,TATGCAGT
18,WT3_Th1_2,,F3,N703,AGGCAGAA,S507,TACTCCTT
19,KO1_Th1_1,,A4,N704,TCCTGAGC,S502,ATAGAGAG
20,KO1_Th1_2,,B4,N704,TCCTGAGC,S503,AGAGGATA
21,KO2_Th1_1,,C4,N704,TCCTGAGC,S504,TCTACTCT
22,KO2_Th1_2,,D4,N704,TCCTGAGC,S505,CTCCTTAC
23,KO3_Th1_1,,E4,N704,TCCTGAGC,S506,TATGCAGT
24,KO3_Th1_2,,F4,N704,TCCTGAGC,S507,TACTCCTT
25,WT1_Th17n_1,,A5,N705,GGACTCCT,S502,ATAGAGAG
26,WT1_Th17n_2,,B5,N705,GGACTCCT,S503,AGAGGATA
27,WT2_Th17n_1,,C5,N705,GGACTCCT,S504,TCTACTCT
28,WT2_Th17n_2,,D5,N705,GGACTCCT,S505,CTCCTTAC
29,WT3_Th17n_1,,E5,N705,GGACTCCT,S506,TATGCAGT
30,WT3_Th17n_2,,F5,N705,GGACTCCT,S507,TACTCCTT
31,KO1_Th17n_1,,A6,N706,TAGGCATG,S502,ATAGAGAG
32,KO1_Th17n_2,,B6,N706,TAGGCATG,S503,AGAGGATA
33,KO2_Th17n_1,,C6,N706,TAGGCATG,S504,TCTACTCT
34,KO2_Th17n_2,,D6,N706,TAGGCATG,S505,CTCCTTAC
35,KO3_Th17n_1,,E6,N706,TAGGCATG,S506,TATGCAGT
36,KO3_Th17n_2,,F6,N706,TAGGCATG,S507,TACTCCTT
37,WT1_Th17p_1,,A7,N707,CTCTCTAC,S502,ATAGAGAG
38,WT1_Th17p_2,,B7,N707,CTCTCTAC,S503,AGAGGATA
39,WT2_Th17p_1,,C7,N707,CTCTCTAC,S504,TCTACTCT
40,WT2_Th17p_2,,D7,N707,CTCTCTAC,S505,CTCCTTAC
41,WT3_Th17p_1,,E7,N707,CTCTCTAC,S506,TATGCAGT
42,WT3_Th17p_2,,F7,N707,CTCTCTAC,S507,TACTCCTT
43,KO1_Th17p_1,,A8,N708,CAGAGAGG,S502,ATAGAGAG
44,KO1_Th17p_2,,B8,N708,CAGAGAGG,S503,AGAGGATA
45,KO2_Th17p_1,,C8,N708,CAGAGAGG,S504,TCTACTCT
46,KO2_Th17p_2,,D8,N708,CAGAGAGG,S505,CTCCTTAC
47,KO3_Th17p_1,,E8,N708,CAGAGAGG,S506,TATGCAGT
48,KO3_Th17p_2,,F8,N708,CAGAGAGG,S507,TACTCCTT
49,WT1_Treg_1,,A9,N709,GCTACGCT,S502,ATAGAGAG
50,WT1_Treg_2,,B9,N709,GCTACGCT,S503,AGAGGATA
51,WT2_Treg_1,,C9,N709,GCTACGCT,S504,TCTACTCT
52,WT2_Treg_2,,D9,N709,GCTACGCT,S505,CTCCTTAC
53,WT3_Treg_1,,E9,N709,GCTACGCT,S506,TATGCAGT
54,WT3_Treg_2,,F9,N709,GCTACGCT,S507,TACTCCTT
55,KO1_Treg_1,,A10,N710,CGAGGCTG,S502,ATAGAGAG
56,KO1_Treg_2,,B10,N710,CGAGGCTG,S503,AGAGGATA
57,KO2_Treg_1,,C10,N710,CGAGGCTG,S504,TCTACTCT
58,KO2_Treg_2,,D10,N710,CGAGGCTG,S505,CTCCTTAC
59,KO3_Treg_1,,E10,N710,CGAGGCTG,S506,TATGCAGT
60,KO3_Treg_2,,F10,N710,CGAGGCTG,S507,TACTCCTT

Configuration¶

experiment_name¶

Refers to the common Xavier Lab naming convention of <date_initials>. This will help set the Google bucket location of the input and output files for the experiment (genomics_xavier_bucket/<experiment_type>/<experiment_name>).

experiment_type¶

Denotes the type of experiment that was run. It must correspond to one of the directories listed here, as this parameter is used to set the location for output.

input_data_type¶

Identifies the type of input files and determines whether or not the bcl2fastq step of the pipeline must be run. Must be a member of {bcl, fastq}.

organism¶

Identifies the type of organism from which experimental samples come and is used to map between sequences and genes. Must be a member of {human, mouse}.

reference_transcriptome¶

Path to reference transcriptome corresponding to organism listed above. For now, please use one of the 38 builds here, because the sequence/gene mapping schema are hard-coded into the R scripts that run differential expression and rely on the 38 build. It may be possible to generalize this at a later date.

Note that we always use the ensembl builds, which can be found here

cycle_number¶

Integer that specifies the cycle number of the sequencing for the experiment. Before performing pseudoalignment, kallisto builds a transcriptome index (essentially a dictionary of k-mer locations) for faster processing. cycle_number will set the k-mer length that kallisto uses when building the index. If the k-mer length is longer than your reads, then kallisto will get 0% alignment. As a general rule, we recommend keeping the default value of 31 if you have $38 \times 38$ reads. If your reads are of some other size (say $29 \times 30$), we have had success setting cycle_number as the smaller of those two numbers (29 in the example above). Just note that kallisto will not be able to align any reads with length < cycle_number.

logCPM_threshold¶

Float that sets a logCPM threshold for the differential expression portion of the pipeline. Gene/comparison combinations with a logCPM below this threshold are included in the output differential expression tables, but are not considered for FDR calculations (they automatically receive FDR = 1). By default, the pipeline will set this threshold value to 1.

Note that the logCPM value included in the differential expression is not the default provided by edgeR. The edgeR default is to provide logCPM equal to the log of the average CPM of the gene across all samples. We instead find the log of the average CPM for each gene across the samples in each condition of our comparison (so a gene/comparison combination will have two associated logCPM values) and then report the higher of the two.

read_pairs_file¶

This argument should be a path to a file in the Google bucket associated with your project. If input_data_type == ‘fastq’, you should create this manually and use this argument to point to the file you create. If the input type is bcl, then this file is automatically generated and this argument will specify to where the pipeline will write the file when it generates it.

Note

You will need to provide a filepath whether your input data type is bcl or fastq. The only difference is that the file itself need not already exist for bcl, whereas it must already exist for fastq.

Each line of this file should contain the names of the two files corresponding to the two ends of a paired end sequencing experiment. Usually, paired reads will have the same prefix followed by R1 (for the left reads) and R2 (for the right reads). Each line of read_pairs.tsv should contain only the names of these two files separated by a tab-character. It is very important that this is a tsv file (with the .tsv extension) and is considered uniform so that Cromwell’s read_pairs function properly interprets the file. One way to check this is to run the following line in bash

cat -A path/to/read_pairs.tsv

This should produce ^I (a marker of tabs) between the reads of each read pair and $ between each read pair. Make sure there is not an extra $ at the end of the file.

read_pairs_file¶

gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S1_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S1_R2.fastq$
gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S2_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S2_R2.fastq$
gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S3_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S3_R2.fastq$
gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L8_S1_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L8_S1_R2.fastq$

samples_described_file¶

This argument should be a path to a file in the Google bucket associated with your project. This file is required only if you want to perform pairwise differential gene expression as part of the pipeline. The dataset you have might contain multiple fastq files all corresponding to the same sample-type. This makes it necessary to describe all the fastq files, so that the gene counts for each of these different files can be combined when constructing the counts matrix. samples_described_file does this job. Each line of this file contains the sample-type and the filename prefix (as described in the previous bullet-point) separated by a tab-character.

samples_described_file¶

 B6S4    B6S4_S1
 B6S4    B6S4_S2
 B6S4    B6S4_S3
 B6S8    B6S8_S1
 B6S8    B6S8_S2
 B6S8    B6S8_S3
 B6S10    B6S10_S1
 B6S10    B6S10_S2

The names in the left column can be anything you want. The names in the right column must correspond to the names of the fastq files (up to the point where the read number is specified). For example, a file named BC3_NS_2h_2_S18_R2_001.fastq.gz would be referred to as BC3_NS_2h_2_S18.

ensemble_mirror¶

This argument should be a host name from which gene names can be looked up. This argument is fed into parameter host of function useMart in R package biomaRt. Look up this function for more details. Useful during server down time. Defaults to useast.ensembl.org.

samples_compared_file¶

This argument has been removed in version 19 and onwards. The pipeline automatically compares all sample groups in samples_decribed file. Use only the relevant differential expression comparison of your study.

This argument should be a path to a file in the Google bucket associated with your project. This file is required only if you want to perform pairwise differential gene expression as a part of the pipeline. If you are performing this kind of an analysis, each line of this file should contain one comparison expression in the format condition1_vs_condition2. For example,

samples_compared_file¶

B6S4_vs_B6S8
B6S4_vs_B6S10

Note that these names correspond to the names in the first column of samples_described_file.

If you want the pipeline to automatically perform differential expression for every pairwise comparison of conditions, you can submit a samples_compared_file with a single line reading all_vs_all.