Setup¶
Preprocessing¶
RNA-Seq data will come to the computationalist either as bcl or fastq files. The pipeline is set up to accommodate either as a starting point, so you can copy them straight to cloud storage.
If you receive bcl files, they are likely already in temporary storage in another Google bucket. We recommend you copy them directly between buckets. The directory name in temporary storage will probably be unwieldy, so feel free to copy the individual objects rather than the entire directory (as shown below) to a bucket linked to the experiment name you actually wish to use. As shown below, please copy files from temporary storage (rather than the entire directory) and make sure that you copy them to a subdirectory named bcl within your experiment bucket – this is required for the bcl2fastq step to run correctly. We have had trouble in the past with using the gsutil wildcard (*) within zsh, but it should work in bash.:
# copy files between cloud storage buckets
gsutil -m cp -r gs://path/to/temporary/storage/* gs://genomics_xavier_bucket/<experiment_type>/<experiment_name>/bcl
If you receive fastq files, you will probably receive them on the Broad cluster. You can copy them to your Google bucket just as you would copy between Google buckets.:
# copy files from Broad cluster to cloud storage
gsutil -m cp -r /path/to/fastq_files/on/broad/cluster/* gs://genomics_xavier_bucket/<experiment_type>/<experiment_name>/subdirectory_to_hold_fastq_files/
experiment_type and experiment_name are parameters that define the directory path (within genomics_xavier_bucket) and are explained in greater detail below.
Configuration¶
experiment_name¶
Refers to the common Xavier Lab naming convention of <date_initials>. This will help set the Google bucket location of the input and output files for the experiment; genomics_xavier_bucket/<experiment_type>/<experiment_name>.
experiment_type¶
Denotes the type of experiment that was run. It must correspond to one of the directories listed here, as this parameter is used to set the location for output.
input_data_type¶
Identifies the type of input files and determines whether or not the bcl2fastq step of the pipeline must be run. Must be a member of {bcl, fastq}.
organism¶
Identifies the type of organism from which experimental samples come and is used to map between sequences and genes. Must be a member of {human, mouse}.
reference_transcriptome¶
Path to reference transcriptome corresponding to organism listed above. For now, please use one of the 38 builds here, because the sequence/gene mapping schema are hard-coded into the R scripts that run differential expression and rely on the 38 build. It may be possible to generalize this at a later date.
Note that we always use the ensembl builds, which can be found here
read_pairs_file¶
This argument should be a path to a file in the Google bucket associated with your project. If input_data_type == ‘fastq’, you should create this manually and use this argument to point to the file you create. If the input type is bcl, then this file is automatically generated and this argument will specify to where the pipeline will write the file when it generates it. Each line of this file should contain the names of the two files corresponding to the two ends of a paired end sequencing experiment. Usually, paired reads will have the same prefix followed by R1 (for the left reads) and R2 (for the right reads). Each line of read_pairs.tsv should contain only the names of these two files separated by a tab-character. It is very important that this is a tsv file and is considered uniform so that Cromwell’s read_pairs function properly interprets the file. One way to check this is to run the following line in bash
cat -A path/to/read_pairs.tsv
This should produce ^I (a marker of tabs) between the reads of each read pair and $ between each read pair.
read_pairs_file¶gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S1_R1.fastq gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S1_R2.fastq
gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S2_R1.fastq gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S2_R2.fastq
gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S3_R1.fastq gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S3_R2.fastq
gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L8_S1_R1.fastq gs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L8_S1_R2.fastq
samples_described_file¶
This argument should be a path to a file in the Google bucket associated with your project. This file is required only if you want to perform pairwise differential gene expression (via Sleuth) as a part of the pipeline. The dataset you have might contain multiple fastq files all corresponding to the same sample-type. This makes it necessary to describe all the fastq files, so that the gene counts for each of these different files can be combined when constructing the counts matrix. samples_described_file does this job. Each line of this file contains the sample-type and the filename prefix (as described in the previous bullet-point) separated by a tab-character.
samples_described_file¶ B6S4 B6S4_S1
B6S4 B6S4_S2
B6S4 B6S4_S3
B6S8 B6S8_S1
The names in the left column can be anything you want. The names in the right column must correspond to the names of the fastq files (up to the point where the read number is specified). For example, a file named BC3_NS_2h_2_S18_R2_001.fastq.gz would be referred to as BC3_NS_2h_2_S18.
samples_compared_file¶
This argument should be a path to a file in the Google bucket associated with your project. This file is required only if you want to perform pairwise differential gene expression (via Sleuth) as a part of the pipeline. If you are performing this kind of an analysis, each line of this file should contain one comparison expression in the format sample1_vs_sample2. For example,
samples_compared_file¶B6S4_vs_B6S8
Note that these names correspond to the names in the first column of samples_described_file.
SampleSheet.csv¶
If you are running the pipeline with input_data_type == bcl, then you will need to ensure that a SampleSheet is present at the top level of the bcl folder in your project’s Google bucket. A working SampleSheet.csv is shown below. The demultiplexing should work if you follow this format. Note that the only columns that are strictly necessary (to our knowledge) are sample_id, sample_name, and index.
SampleSheet.csv¶[Data],,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2
1,WT1_Th0_1,,A1,N701,TAAGGCGA,S502,ATAGAGAG
2,WT1_Th0_2,,B1,N701,TAAGGCGA,S503,AGAGGATA
3,WT2_Th0_1,,C1,N701,TAAGGCGA,S504,TCTACTCT
4,WT2_Th0_2,,D1,N701,TAAGGCGA,S505,CTCCTTAC
5,WT3_Th0_1,,E1,N701,TAAGGCGA,S506,TATGCAGT
6,WT3_Th0_2,,F1,N701,TAAGGCGA,S507,TACTCCTT
7,KO1_Th0_1,,A2,N702,CGTACTAG,S502,ATAGAGAG
8,KO1_Th0_2,,B2,N702,CGTACTAG,S503,AGAGGATA
9,KO2_Th0_1,,C2,N702,CGTACTAG,S504,TCTACTCT
10,KO2_Th0_2,,D2,N702,CGTACTAG,S505,CTCCTTAC
11,KO3_Th0_1,,E2,N702,CGTACTAG,S506,TATGCAGT
12,KO3_Th0_2,,F2,N702,CGTACTAG,S507,TACTCCTT
13,WT1_Th1_1,,A3,N703,AGGCAGAA,S502,ATAGAGAG
14,WT1_Th1_2,,B3,N703,AGGCAGAA,S503,AGAGGATA
15,WT2_Th1_1,,C3,N703,AGGCAGAA,S504,TCTACTCT
16,WT2_Th1_2,,D3,N703,AGGCAGAA,S505,CTCCTTAC
17,WT3_Th1_1,,E3,N703,AGGCAGAA,S506,TATGCAGT
18,WT3_Th1_2,,F3,N703,AGGCAGAA,S507,TACTCCTT
19,KO1_Th1_1,,A4,N704,TCCTGAGC,S502,ATAGAGAG
20,KO1_Th1_2,,B4,N704,TCCTGAGC,S503,AGAGGATA
21,KO2_Th1_1,,C4,N704,TCCTGAGC,S504,TCTACTCT
22,KO2_Th1_2,,D4,N704,TCCTGAGC,S505,CTCCTTAC
23,KO3_Th1_1,,E4,N704,TCCTGAGC,S506,TATGCAGT
24,KO3_Th1_2,,F4,N704,TCCTGAGC,S507,TACTCCTT
25,WT1_Th17n_1,,A5,N705,GGACTCCT,S502,ATAGAGAG
26,WT1_Th17n_2,,B5,N705,GGACTCCT,S503,AGAGGATA
27,WT2_Th17n_1,,C5,N705,GGACTCCT,S504,TCTACTCT
28,WT2_Th17n_2,,D5,N705,GGACTCCT,S505,CTCCTTAC
29,WT3_Th17n_1,,E5,N705,GGACTCCT,S506,TATGCAGT
30,WT3_Th17n_2,,F5,N705,GGACTCCT,S507,TACTCCTT
31,KO1_Th17n_1,,A6,N706,TAGGCATG,S502,ATAGAGAG
32,KO1_Th17n_2,,B6,N706,TAGGCATG,S503,AGAGGATA
33,KO2_Th17n_1,,C6,N706,TAGGCATG,S504,TCTACTCT
34,KO2_Th17n_2,,D6,N706,TAGGCATG,S505,CTCCTTAC
35,KO3_Th17n_1,,E6,N706,TAGGCATG,S506,TATGCAGT
36,KO3_Th17n_2,,F6,N706,TAGGCATG,S507,TACTCCTT
37,WT1_Th17p_1,,A7,N707,CTCTCTAC,S502,ATAGAGAG
38,WT1_Th17p_2,,B7,N707,CTCTCTAC,S503,AGAGGATA
39,WT2_Th17p_1,,C7,N707,CTCTCTAC,S504,TCTACTCT
40,WT2_Th17p_2,,D7,N707,CTCTCTAC,S505,CTCCTTAC
41,WT3_Th17p_1,,E7,N707,CTCTCTAC,S506,TATGCAGT
42,WT3_Th17p_2,,F7,N707,CTCTCTAC,S507,TACTCCTT
43,KO1_Th17p_1,,A8,N708,CAGAGAGG,S502,ATAGAGAG
44,KO1_Th17p_2,,B8,N708,CAGAGAGG,S503,AGAGGATA
45,KO2_Th17p_1,,C8,N708,CAGAGAGG,S504,TCTACTCT
46,KO2_Th17p_2,,D8,N708,CAGAGAGG,S505,CTCCTTAC
47,KO3_Th17p_1,,E8,N708,CAGAGAGG,S506,TATGCAGT
48,KO3_Th17p_2,,F8,N708,CAGAGAGG,S507,TACTCCTT
49,WT1_Treg_1,,A9,N709,GCTACGCT,S502,ATAGAGAG
50,WT1_Treg_2,,B9,N709,GCTACGCT,S503,AGAGGATA
51,WT2_Treg_1,,C9,N709,GCTACGCT,S504,TCTACTCT
52,WT2_Treg_2,,D9,N709,GCTACGCT,S505,CTCCTTAC
53,WT3_Treg_1,,E9,N709,GCTACGCT,S506,TATGCAGT
54,WT3_Treg_2,,F9,N709,GCTACGCT,S507,TACTCCTT
55,KO1_Treg_1,,A10,N710,CGAGGCTG,S502,ATAGAGAG
56,KO1_Treg_2,,B10,N710,CGAGGCTG,S503,AGAGGATA
57,KO2_Treg_1,,C10,N710,CGAGGCTG,S504,TCTACTCT
58,KO2_Treg_2,,D10,N710,CGAGGCTG,S505,CTCCTTAC
59,KO3_Treg_1,,E10,N710,CGAGGCTG,S506,TATGCAGT
60,KO3_Treg_2,,F10,N710,CGAGGCTG,S507,TACTCCTT