Setup

Preprocessing

The primary inputs to the CRISPR Screen pipeline are .fastq files located in genomics_xavier_bucket/CRISPR_Screen_Sequencing/<experiment_name>/fastq/. Unlike many of our other analyses, these .fastq files should not be separated by sample, but rather into indices and reads. The shell script to run bcl2fastq in this way (on the Broad server) can be found here. Before running this script, you want to make sure that it ignores the SampleSheet.csv file that should be in the directory. In order to do this, you can either rename the SampleSheet or move it to a different location while the script is running. The .fastq files will be located in the Reads subdirectory of your experiment directory on the Broad server.

The CRISPR Screen pipeline is also somewhat non-traditional in that it has the capacity to incorporate data from multiple experiments into the same analysis, allowing you to treat runs as replicates. This affects how you will need to name the .fastq files you just generated. If you are not using multiple runs, your .fastq files will need to be named:

single run
all_I1.fastq.gz
all_R1.fastq.gz

If you do have multiple runs, you can specify names for them in your configuration file (described below). For example, if you had two runs you were calling “x7” and “x9”, your .fastq files would need to be named:

multiple runs
x7_I1.fastq.gz
x7_R1.fastq.gz
x9_I1.fastq.gz
x9_R1.fastq.gz

If you have multiple runs, they must have the same sample names (as shown in the condition_file below) in each run.

As a reminder, the .fastq files will need to be copied into the genomics_xavier_bucket/CRISPR_Screen_Sequencing/<experiment_name>/fastq/ Google bucket before running the pipeline.

Configuration

experiment_name

Refers to the common Xavier Lab naming convention of <date_initials>. This will set the Google bucket location of the input and output files for the experiment; genomics_xavier_bucket/CRISPR_Screen_Sequencing/<experiment_name>.

library_name

Refers to the chip and ref files required to map between perturbations and genes. The pipeline will search the crispr_reference_data bucket for an associated chip and ref file (name must be <library_name>_<{chip,ref}>.tsv). If one of those exists, the pipeline will build the other. If neither exists, the pipeline will not work.

Users can find GPP library files here if they need to add a new chip or ref file to the Google bucket. We strongly recommend loading the chip file with Preferred in the description from the Broad site linked above, then letting the pipeline build the ref file. If you add a new library chip or ref file to the Google bucket, please make sure it is a .tsv file.

Additionally, the chip file should have only two columns; the barcode sequence and the gene symbol. Chip files from the Broad website may have three columns (the two listed previously plus gene ID), so you may need to remove the extraneous column before moving it to Google Cloud.

If you get a custom library file and you are not sure whether it is a ref or chip file, you can follow this rule of thumb – a ref file has unique values in the gene symbol column, whereas the chip file does not. The uniquencess in the ref file is enforced by adding an index at the end of the gene symbol. Below are examples of how the tops of a pair of ref and chip files might look:

ref_file
CAAAACTCCTATTATCCACC        AAK1_1
CAAAAGGCCGGATATTTACC        AAK1_2
GGGTGCAATTGAAGTCTCTG        AAK1_3
TGGCAAAATCATCACTACGA        AAK1_4
AGCACCGGAACCACGCCCGG        AATK_5
CTCCTCAAGTCCACAGACGT        AATK_6
GCATAGCAACCTGCTCGTCG        AATK_7
GTGTCAGCCAACAACAACAG        AATK_8
chip_file
CAAAACTCCTATTATCCACC        AAK1
CAAAAGGCCGGATATTTACC        AAK1
GGGTGCAATTGAAGTCTCTG        AAK1
TGGCAAAATCATCACTACGA        AAK1
AGCACCGGAACCACGCCCGG        AATK
CTCCTCAAGTCCACAGACGT        AATK
GCATAGCAACCTGCTCGTCG        AATK
GTGTCAGCCAACAACAACAG        AATK

runs

Array of names referring to multiple runs (if they exist). If there is only one run, you should set runs as [“all”]. In the multiple run example from above, runs would be [“x7”, “x9”]. As noted in Preprocessing section above, the individual elements of the runs parameter refer to the prefixes of the fastq files. Within the pipeline, they also become suffixes of the sample names in the gene counts file (as illustrated below in the samples_described_file).

condition_file

Path to the file matching sample names with barcodes (information that should be contained in the SampleSheet). These sample names (in the second column) should not contain runs values as suffixes. This should be saved as a .tsv file.

condition_file
TTGAACCG    Pre_sorting_1
CCTCCAAT    APC_high_1
TTAGACTA    APC_low_1
GGTCACCG    Pre_sorting_2
CCTCTGTA    APC_high_2
TTGACAAT    APC_low_2

samples_described_file

Path to the file matching sample names with their associated conditions. In the example below, each condition (Pre-sorting, APC_high, and APC_low) has two samples matched to it from each of the runs (x7 and x9). The sample names in the second column should contain runs values as suffixes. This should be saved as a .tsv file.

samples_described_file_multiple_runs
Pre_sorting Pre_sorting_1_x7
APC_high    APC_high_1_x7
APC_low     APC_low_1_x7
Pre_sorting Pre_sorting_2_x7
APC_high    APC_high_2_x7
APC_low     APC_low_2_x7
Pre_sorting Pre_sorting_1_x9
APC_high    APC_high_1_x9
APC_low     APC_low_1_x9
Pre_sorting Pre_sorting_2_x9
APC_high    APC_high_2_x9
APC_low     APC_low_2_x9

An example with only one run would look very similar:

samples_described_file_one_run
Pre_sorting Pre_sorting_1_all
APC_high    APC_high_1_all
APC_low     APC_low_1_all
Pre_sorting Pre_sorting_2_all
APC_high    APC_high_2_all
APC_low     APC_low_2_all

samples_compared_file

Path to the file specifying the pairs of conditions on which we want differential expression performed. Conditions must match the names in the first column of samples_described_file and be separated by “_vs_”. This should be saved as a .tsv file.

samples_compared_file
Pre_sorting_vs_APC_high
Pre_sorting_vs_APC_low
APC_high_vs_APC_low