Setup

Configuration

reference_genome

This is the reference genome (in fasta format) for the organism from which your samples are coming. We recommend checking the rnaseq_reference_data/genome/ Google bucket in the Genomics-Xavier workspace to see if the reference you need already exists. If so, we suggest using the pre-existing genome in that location. If not, we suggest adding your new reference genome to that location. You need only provide the fasta file; the pipeline will search for index files necessary for alignment via BWA in the rnaseq_reference_data/bwa_indices/ bucket and will create them if they do not already exist.

Finally, we recommend (if possible) that you use genome builds from the UCSC Genome Browser. Other versions may have different methods for referring to genomic regions, which could cause the sample argument for region not to work. For example; if the UCSC genome expects a region to be called like chr2:240630275-240630475, another genome may expect the same region to be specified as 2:240630275-240630475.

read_info_file

Each line of this file should contain the names of the two files corresponding to the two ends of a paired end sequencing experiment, preceded by the group name. Usually, paired reads will have the same prefix followed by R1 (for the left reads) and R2 (for the right reads). Each line of read_info.tsv should contain only the group name and names of these two files, each separated by a tab-character. It is very important that this is a tsv file (with the .tsv extension) and is considered uniform so that Cromwell’s read_pairs function properly interprets the file. One way to check this is to run the following line in bash

cat -A path/to/read_info.tsv

This should produce ^I (a marker of tabs) between each piece of information for a given sample and $ between each sample. Make sure there is not an extra $ at the end of the file.

read_pairs_file
B6l4^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S1_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S1_R2.fastq$
B6l4^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S2_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S2_R2.fastq$
B6l4^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S3_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L4_S3_R2.fastq$
B6l8^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L8_S1_R1.fastq^Igs://genomics-xavier/<experiment_type>/<experiment_name>/fastq/B6L8_S1_R2.fastq$

regions

This parameter sets the genome regions in which you would like to look for variants. If you want to search for variants in the entire genome, use the argument “full_genome”. Otherwise, the region should be of the form chr<x>:start_index-end_index. For example, chr2:240630275-240630475.

The argument takes the form of an array, even if you are looking in only one region. For example, a valid argument would be [‘chr2:240630275-240630475’].

If you are having trouble with this argument, it may be that there is an inconsistency between your reference genome’s handling of genomic locations and what you specified.

vcf_output_path

This parameter sets the output directory for the VCF files/data generated by the pipeline.