Setup¶
Preprocessing¶
The primary inputs to the CRISPR Screen pipeline are .fastq files located in genomics_xavier_bucket/CRISPR_Screen_Sequencing/<experiment_name>/fastq/. Unlike many of our other analyses, these .fastq files should not be separated by sample, but rather into indices and reads. The shell script to run bcl2fastq in this way (on the Broad server) can be found here. Before running this script, you want to make sure that it ignores the SampleSheet.csv file that should be in the directory. In order to do this, you can either rename the SampleSheet or move it to a different location while the script is running. The .fastq files will be located in the Reads subdirectory of your experiment directory on the Broad server.
The CRISPR Screen pipeline is also somewhat non-traditional in that it has the capacity to incorporate data from multiple experiments into the same analysis, allowing you to treat runs as replicates. This affects how you will need to name the .fastq files you just generated. If you are not using multiple runs, your .fastq files will need to be named:
single run¶all_I1.fastq.gz
all_R1.fastq.gz
If you do have multiple runs, you can specify names for them in your configuration file (described below). For example, if you had two runs you were calling “x7” and “x9”, your .fastq files would need to be named:
multiple runs¶x7_I1.fastq.gz
x7_R1.fastq.gz
x9_I1.fastq.gz
x9_R1.fastq.gz
If you have multiple runs, they must have the same sample names (as shown in the condition_file below) in each run.
As a reminder, the .fastq files will need to be copied into the genomics_xavier_bucket/CRISPR_Screen_Sequencing/<experiment_name>/fastq/ Google bucket before running the pipeline.
Configuration¶
experiment_name¶
Refers to the common Xavier Lab naming convention of <date_initials>. This will set the Google bucket location of the input and output files for the experiment; genomics_xavier_bucket/CRISPR_Screen_Sequencing/<experiment_name>.
library_name¶
Refers to the chip and ref files required to map between perturbations and genes. The pipeline will search the crispr_reference_data bucket for an associated chip and ref file (name must be <library_name>_<{chip,ref}>.tsv). If one of those exists, the pipeline will build the other. If neither exists, the pipeline will not work.
Users can find GPP library files here if they need to add a new chip or ref file to the Google bucket. We strongly recommend loading the chip file with Preferred in the description from the Broad site linked above, then letting the pipeline build the ref file. If you add a new library chip or ref file to the Google bucket, please make sure it is a .tsv file.
Additionally, the chip file should have only two columns; the barcode sequence and the gene symbol. Chip files from the Broad website may have three columns (the two listed previously plus gene ID), so you may need to remove the extraneous column before moving it to Google Cloud.
If you get a custom library file and you are not sure whether it is a ref or chip file, you can follow this rule of thumb – a ref file has unique values in the gene symbol column, whereas the chip file does not. The uniquencess in the ref file is enforced by adding an index at the end of the gene symbol. Below are examples of how the tops of a pair of ref and chip files might look:
ref_file¶CAAAACTCCTATTATCCACC AAK1_1
CAAAAGGCCGGATATTTACC AAK1_2
GGGTGCAATTGAAGTCTCTG AAK1_3
TGGCAAAATCATCACTACGA AAK1_4
AGCACCGGAACCACGCCCGG AATK_5
CTCCTCAAGTCCACAGACGT AATK_6
GCATAGCAACCTGCTCGTCG AATK_7
GTGTCAGCCAACAACAACAG AATK_8
chip_file¶CAAAACTCCTATTATCCACC AAK1
CAAAAGGCCGGATATTTACC AAK1
GGGTGCAATTGAAGTCTCTG AAK1
TGGCAAAATCATCACTACGA AAK1
AGCACCGGAACCACGCCCGG AATK
CTCCTCAAGTCCACAGACGT AATK
GCATAGCAACCTGCTCGTCG AATK
GTGTCAGCCAACAACAACAG AATK
runs¶
Array of names referring to multiple runs (if they exist). If there is only one run, you should set runs as [“all”]. In the multiple run example from above, runs would be [“x7”, “x9”]. As noted in Preprocessing section above, the individual elements of the runs parameter refer to the prefixes of the fastq files. Within the pipeline, they also become suffixes of the sample names in the gene counts file (as illustrated below in the samples_described_file).
condition_file¶
Path to the file matching sample names with barcodes (information that should be contained in the SampleSheet). These sample names (in the second column) should not contain runs values as suffixes. This should be saved as a .tsv file.
condition_file¶TTGAACCG Pre_sorting_1
CCTCCAAT APC_high_1
TTAGACTA APC_low_1
GGTCACCG Pre_sorting_2
CCTCTGTA APC_high_2
TTGACAAT APC_low_2
samples_described_file¶
Path to the file matching sample names with their associated conditions. In the example below, each condition (Pre-sorting, APC_high, and APC_low) has two samples matched to it from each of the runs (x7 and x9). The sample names in the second column should contain runs values as suffixes. This should be saved as a .tsv file.
samples_described_file_multiple_runs¶Pre_sorting Pre_sorting_1_x7
APC_high APC_high_1_x7
APC_low APC_low_1_x7
Pre_sorting Pre_sorting_2_x7
APC_high APC_high_2_x7
APC_low APC_low_2_x7
Pre_sorting Pre_sorting_1_x9
APC_high APC_high_1_x9
APC_low APC_low_1_x9
Pre_sorting Pre_sorting_2_x9
APC_high APC_high_2_x9
APC_low APC_low_2_x9
An example with only one run would look very similar:
samples_described_file_one_run¶Pre_sorting Pre_sorting_1_all
APC_high APC_high_1_all
APC_low APC_low_1_all
Pre_sorting Pre_sorting_2_all
APC_high APC_high_2_all
APC_low APC_low_2_all
samples_compared_file¶
Path to the file specifying the pairs of conditions on which we want differential expression performed. Conditions must match the names in the first column of samples_described_file and be separated by “_vs_”. This should be saved as a .tsv file.
samples_compared_file¶Pre_sorting_vs_APC_high
Pre_sorting_vs_APC_low
APC_high_vs_APC_low