Introduction
The TreeVal pipeline has a few requirements before being able to run:
-
The
gene_alignment_data
(this refers to the alignment.data_dir and alignment.geneset data noted in the yaml, which we will explain later) andsynteny_data
follow a particular directory structure. -
HiC CRAM files must be pre-indexed in the same location as the CRAM file, e.g.,
samtools index {cram file}
. A check and automated indexing of the cram file will be added in the future. -
Finally, the yaml file (which is described below in Full Samplesheet). This needs to contain all of the information related to the assembly for the pipeline to run.
Prior to running TreeVal
Please ensure you read the following sections on Directory Structure (gene_alignment_data
, synteny
, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then please leave an issue report.
We now also support ( and encourage ) using the nf-co2footprint plugin (on Nextflow versions >= 23.07) which generates statistics on how much energy your pipeline uses as well as the amount of Co2 it helps produce. As it is pre-release, you will need to compile this plugin your self and store it in your $NXF_HOME/plugins
directory, which you can find with echo $NXF_HOME
. We have included the relevant config file co2footprint.config
in this repo. The plugin can be used be including -plugins nf-co2footprint@{VERSION} -c co2footprint.config
in your nextflow command. Please head to the website to find out more NF-CO2FOOTPRINT.
Local testing
Details
We provide a complete set of test data that can be used to test the pipeline locally.
First, choose a download location ${TREEVAL_TEST_DATA}
and run this command:
cd ${TREEVAL_TEST_DATA}
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" TreeValTinyData/gene_alignment_data/fungi/csv_data/LaetiporusSulphureus.gfLaeSulp1-data.csv
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" assets/github_testing/TreeValTinyTest.yaml
nextflow run main.nf -profile test_github,singularity {OTHER ARGUMENTS}
This downloads and de-compresses the test-data. The sed commands then rewrites references from GitHub locations to local locations in the gene_alignment csv file as well as the treeval yaml file.
You should now be able to run the pipeline as you see fit.
Directory Structure
Details
The working example found below in the Gene alignment and synteny sections (below), will cover setting up the synteny
and gene_alignment_data
directories as well as downloading some example data.
These two sub-workflows, for now, need the use of the variables defined_class
, synteny_genome_path
, data_dir
and geneset
. These variables are found inside the yaml ( this is the file that will tell TreeVal what and where everything is ). Currently, we don't use common_name
, e.g., bee
, wasp
, moth
, etc. However, we hope to make use of it in the future as our gene_alignment_data
"database" grows and requires a higher degree of organisation.
First, you should set up a directory in our recommended structure:
treeval-resources
│
├─ gene_alignment_data/
│ ├─ { defined_class }
│ ├─ csv_data
│ │ └─ { Organism.Accession }-data.csv # Generated by our scripts
│ ├─ { Organism } # Here and below is generated by our scripts
│ ├─ { Organism.Accession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ └─ peps
│ └─ { Chunked fasta files }
│
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
│
├─ synteny/
│ └─ {defined_class}
│
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
│
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
The above naming will be further explained
defined_class
can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, the value that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data
as well as syntenic genomes.
Synteny
Details
For synteny you should store the full genomic fasta file, of any high quality genome you want to be compared against, in teh above created directory.
For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/
folder. Any TreeVal run you now perform where the defined_class
is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three unless needed. Again, this is something we could expand on with the common_name
field if requested in the future.
Gene Alignment and Synteny Data and Directories
Details
Seeing as this can be quite a complicated to set up, here's a walk through.
Step 1 - Set up the directories
Lets set up the directory structure as if we want to run it on a bird genome.
mkdir -p gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
The naming of the bird folder here is important, keep this in mind.
So now we have this structure:
~/treeval-resources
│
├─ synteny/
│ └─ bird/
│
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
│
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
Step 2 - Download some data
Now, let's download our syntenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).
cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
gunzip bTaeGut1_4.fasta.gz
This leaves us with a file called bTaeGut1_4.fasta
the genomic assembly of bTaeGut1_4
(the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
Now lets move into the raw_data
folder and download some data, this may take some time.
cd ../../gene_alignment_prep/raw_data/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
Keep note of how these files have been downloaded, the scripts used here need the file name to be in this format: GallusGallus-GRCg7b.rna.fasta.gz
or {Organism}-{Accession}.{data-type}.fasta.gz
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz
on the end denotes that the file is compressed) in this format we can't use it.
The below code will look through the current folder for files ending with .fasta.gz
and decompresses it, it will then run our python script, GA_data_prep.py
.
Command:
for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done
The Python script will clean the headers common in NCBI and ensembl fasta files and then split the files into 10 header/sequence pairs per file.
A fasta file may be made up of anywhere between 10's to many thousands of these pairs. So in the case of our cdna
and pep
files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds
and rna
files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep
and cdna
will be cut up by a number you give or by 100. This is because the size of pep
and cdna
files are so much larger.
The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.
The GA_data_prep.py
will produce a large amount of output in your terminal. Looking like:
python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100
Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep
or cdna
file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
If you now type ls
you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta
) and the folder GallusGallus
. This folder can now be moved to its permanent home.
mv GallusGallus/ ../../gene_alignment_data/bird/
Step 3 -- Generate the CSV
This file will act as an index of all files we have produced in the gene_alignment_data
folder, and thankfully is a very simple step.
cd ../
python3 scripts/GA_csv_gen.py /full/path/to/gene_alignment_data/
Running this will look like:
============> CorvusMon1.bCorMon1 -- bird
Generating CSV for: CorvusMon1.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============> CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for: CorvusMoneduloides.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============> Gallus_gallus.UW_022020 -- bird
Generating CSV for: Gallus_gallus.UW_022020
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============> Gallus_gallus.GRCg6a -- bird
Generating CSV for: Gallus_gallus.GRCg6a
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============> GallusGallus.GRCg7b -- bird
Generating CSV for: GallusGallus.GRCg7b
Save Path: /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv
So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information:
head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.
Step 4 -- Understand where we are at
So we have now generated the directory structure for gene_alignment_data
. So now let's use what we know to fill out the yaml.
The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.
Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path
and data_dir
. These refer to the directories we made earlier so we can replace them as such:
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
I said earlier that the the fact we called a folder bird
was important, this is because it now becomes our defined_class
:
defined_class: bird
During pipeline execution, this is appended onto the end of data_dir
and synteny_genome_path
in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/
will be used for syntenic alignments. Likewise with our alignment.data_dir
, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/
and then appends csv_data/
.
In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/
folder and look like GallusGallus.GRCg7b-data.csv
. These (minus the -data.csv
) will be what we enter into the geneset
field in the yaml. The common_name
is a field we don't currently use.
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
However, what is cool about this field (geneset) is that you can add as many as you want. So say you have the alignment.data_dir
for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4
. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"
Hopefully this explains things a bit better and you understand how this sticks together!
HiC data Preparation
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
Longread Data Preparation
Details
Before running the pipeline, longread data must to be in the fasta.gz
format. Because of the software we use this data with, it must also be long-read data and single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq
.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips (compresses) the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
You can do it all in one line, bam -> fasta.gz
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
This should be now be automatically completed by the pipeline, however i'll leave it here incase you need it.
Full samplesheet
Details
We provide a complete set of test data that can be used to test the pipeline locally.
First, choose a download location ${TREEVAL_TEST_DATA}
and run this command:
cd ${TREEVAL_TEST_DATA}
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" TreeValTinyData/gene_alignment_data/fungi/csv_data/LaetiporusSulphureus.gfLaeSulp1-data.csv
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" assets/github_testing/TreeValTinyTest.yaml
nextflow run main.nf -profile test_github,singularity {OTHER ARGUMENTS}
This downloads and de-compresses the test-data. The sed commands then rewrites references from GitHub locations to local locations in the gene_alignment csv file as well as the treeval yaml file.
You should now be able to run the pipeline as you see fit.
Directory Structure
Details
The working example found below in the Gene alignment and synteny sections (below), will cover setting up the synteny
and gene_alignment_data
directories as well as downloading some example data.
These two sub-workflows, for now, need the use of the variables defined_class
, synteny_genome_path
, data_dir
and geneset
. These variables are found inside the yaml ( this is the file that will tell TreeVal what and where everything is ). Currently, we don't use common_name
, e.g., bee
, wasp
, moth
, etc. However, we hope to make use of it in the future as our gene_alignment_data
"database" grows and requires a higher degree of organisation.
First, you should set up a directory in our recommended structure:
treeval-resources
│
├─ gene_alignment_data/
│ ├─ { defined_class }
│ ├─ csv_data
│ │ └─ { Organism.Accession }-data.csv # Generated by our scripts
│ ├─ { Organism } # Here and below is generated by our scripts
│ ├─ { Organism.Accession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ └─ peps
│ └─ { Chunked fasta files }
│
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
│
├─ synteny/
│ └─ {defined_class}
│
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
│
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
The above naming will be further explained
defined_class
can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, the value that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data
as well as syntenic genomes.
Synteny
Details
For synteny you should store the full genomic fasta file, of any high quality genome you want to be compared against, in teh above created directory.
For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/
folder. Any TreeVal run you now perform where the defined_class
is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three unless needed. Again, this is something we could expand on with the common_name
field if requested in the future.
Gene Alignment and Synteny Data and Directories
Details
Seeing as this can be quite a complicated to set up, here's a walk through.
Step 1 - Set up the directories
Lets set up the directory structure as if we want to run it on a bird genome.
mkdir -p gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
The naming of the bird folder here is important, keep this in mind.
So now we have this structure:
~/treeval-resources
│
├─ synteny/
│ └─ bird/
│
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
│
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
Step 2 - Download some data
Now, let's download our syntenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).
cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
gunzip bTaeGut1_4.fasta.gz
This leaves us with a file called bTaeGut1_4.fasta
the genomic assembly of bTaeGut1_4
(the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
Now lets move into the raw_data
folder and download some data, this may take some time.
cd ../../gene_alignment_prep/raw_data/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
Keep note of how these files have been downloaded, the scripts used here need the file name to be in this format: GallusGallus-GRCg7b.rna.fasta.gz
or {Organism}-{Accession}.{data-type}.fasta.gz
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz
on the end denotes that the file is compressed) in this format we can't use it.
The below code will look through the current folder for files ending with .fasta.gz
and decompresses it, it will then run our python script, GA_data_prep.py
.
Command:
for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done
The Python script will clean the headers common in NCBI and ensembl fasta files and then split the files into 10 header/sequence pairs per file.
A fasta file may be made up of anywhere between 10's to many thousands of these pairs. So in the case of our cdna
and pep
files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds
and rna
files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep
and cdna
will be cut up by a number you give or by 100. This is because the size of pep
and cdna
files are so much larger.
The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.
The GA_data_prep.py
will produce a large amount of output in your terminal. Looking like:
python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100
Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep
or cdna
file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
If you now type ls
you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta
) and the folder GallusGallus
. This folder can now be moved to its permanent home.
mv GallusGallus/ ../../gene_alignment_data/bird/
Step 3 -- Generate the CSV
This file will act as an index of all files we have produced in the gene_alignment_data
folder, and thankfully is a very simple step.
cd ../
python3 scripts/GA_csv_gen.py /full/path/to/gene_alignment_data/
Running this will look like:
============> CorvusMon1.bCorMon1 -- bird
Generating CSV for: CorvusMon1.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============> CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for: CorvusMoneduloides.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============> Gallus_gallus.UW_022020 -- bird
Generating CSV for: Gallus_gallus.UW_022020
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============> Gallus_gallus.GRCg6a -- bird
Generating CSV for: Gallus_gallus.GRCg6a
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============> GallusGallus.GRCg7b -- bird
Generating CSV for: GallusGallus.GRCg7b
Save Path: /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv
So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information:
head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.
Step 4 -- Understand where we are at
So we have now generated the directory structure for gene_alignment_data
. So now let's use what we know to fill out the yaml.
The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.
Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path
and data_dir
. These refer to the directories we made earlier so we can replace them as such:
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
I said earlier that the the fact we called a folder bird
was important, this is because it now becomes our defined_class
:
defined_class: bird
During pipeline execution, this is appended onto the end of data_dir
and synteny_genome_path
in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/
will be used for syntenic alignments. Likewise with our alignment.data_dir
, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/
and then appends csv_data/
.
In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/
folder and look like GallusGallus.GRCg7b-data.csv
. These (minus the -data.csv
) will be what we enter into the geneset
field in the yaml. The common_name
is a field we don't currently use.
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
However, what is cool about this field (geneset) is that you can add as many as you want. So say you have the alignment.data_dir
for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4
. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"
Hopefully this explains things a bit better and you understand how this sticks together!
HiC data Preparation
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
Longread Data Preparation
Details
Before running the pipeline, longread data must to be in the fasta.gz
format. Because of the software we use this data with, it must also be long-read data and single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq
.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips (compresses) the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
You can do it all in one line, bam -> fasta.gz
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
This should be now be automatically completed by the pipeline, however i'll leave it here incase you need it.
Full samplesheet
Details
For synteny you should store the full genomic fasta file, of any high quality genome you want to be compared against, in teh above created directory.
For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/
folder. Any TreeVal run you now perform where the defined_class
is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three unless needed. Again, this is something we could expand on with the common_name
field if requested in the future.
Gene Alignment and Synteny Data and Directories
Details
Seeing as this can be quite a complicated to set up, here's a walk through.
Step 1 - Set up the directories
Lets set up the directory structure as if we want to run it on a bird genome.
mkdir -p gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
The naming of the bird folder here is important, keep this in mind.
So now we have this structure:
~/treeval-resources
│
├─ synteny/
│ └─ bird/
│
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
│
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
Step 2 - Download some data
Now, let's download our syntenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).
cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
gunzip bTaeGut1_4.fasta.gz
This leaves us with a file called bTaeGut1_4.fasta
the genomic assembly of bTaeGut1_4
(the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
Now lets move into the raw_data
folder and download some data, this may take some time.
cd ../../gene_alignment_prep/raw_data/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
Keep note of how these files have been downloaded, the scripts used here need the file name to be in this format: GallusGallus-GRCg7b.rna.fasta.gz
or {Organism}-{Accession}.{data-type}.fasta.gz
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz
on the end denotes that the file is compressed) in this format we can't use it.
The below code will look through the current folder for files ending with .fasta.gz
and decompresses it, it will then run our python script, GA_data_prep.py
.
Command:
for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done
The Python script will clean the headers common in NCBI and ensembl fasta files and then split the files into 10 header/sequence pairs per file.
A fasta file may be made up of anywhere between 10's to many thousands of these pairs. So in the case of our cdna
and pep
files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds
and rna
files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep
and cdna
will be cut up by a number you give or by 100. This is because the size of pep
and cdna
files are so much larger.
The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.
The GA_data_prep.py
will produce a large amount of output in your terminal. Looking like:
python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100
Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep
or cdna
file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
If you now type ls
you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta
) and the folder GallusGallus
. This folder can now be moved to its permanent home.
mv GallusGallus/ ../../gene_alignment_data/bird/
Step 3 -- Generate the CSV
This file will act as an index of all files we have produced in the gene_alignment_data
folder, and thankfully is a very simple step.
cd ../
python3 scripts/GA_csv_gen.py /full/path/to/gene_alignment_data/
Running this will look like:
============> CorvusMon1.bCorMon1 -- bird
Generating CSV for: CorvusMon1.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============> CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for: CorvusMoneduloides.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============> Gallus_gallus.UW_022020 -- bird
Generating CSV for: Gallus_gallus.UW_022020
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============> Gallus_gallus.GRCg6a -- bird
Generating CSV for: Gallus_gallus.GRCg6a
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============> GallusGallus.GRCg7b -- bird
Generating CSV for: GallusGallus.GRCg7b
Save Path: /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv
So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information:
head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.
Step 4 -- Understand where we are at
So we have now generated the directory structure for gene_alignment_data
. So now let's use what we know to fill out the yaml.
The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.
Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path
and data_dir
. These refer to the directories we made earlier so we can replace them as such:
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
I said earlier that the the fact we called a folder bird
was important, this is because it now becomes our defined_class
:
defined_class: bird
During pipeline execution, this is appended onto the end of data_dir
and synteny_genome_path
in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/
will be used for syntenic alignments. Likewise with our alignment.data_dir
, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/
and then appends csv_data/
.
In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/
folder and look like GallusGallus.GRCg7b-data.csv
. These (minus the -data.csv
) will be what we enter into the geneset
field in the yaml. The common_name
is a field we don't currently use.
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
However, what is cool about this field (geneset) is that you can add as many as you want. So say you have the alignment.data_dir
for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4
. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"
Hopefully this explains things a bit better and you understand how this sticks together!
HiC data Preparation
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
Longread Data Preparation
Details
Before running the pipeline, longread data must to be in the fasta.gz
format. Because of the software we use this data with, it must also be long-read data and single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq
.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips (compresses) the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
You can do it all in one line, bam -> fasta.gz
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
This should be now be automatically completed by the pipeline, however i'll leave it here incase you need it.
Full samplesheet
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
Longread Data Preparation
Details
Before running the pipeline, longread data must to be in the fasta.gz
format. Because of the software we use this data with, it must also be long-read data and single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq
.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips (compresses) the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
You can do it all in one line, bam -> fasta.gz
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
This should be now be automatically completed by the pipeline, however i'll leave it here incase you need it.
Full samplesheet
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
This should be now be automatically completed by the pipeline, however i'll leave it here incase you need it.
Full samplesheet
YAML is "Yet Another Markdown Language", it is a human-readable format that we use to tell TreeVal a number of things. This includes; assembly location, telomere motif, longread data files (in fasta.gz format) and HiC cram files. The full Yaml is detailed below.
YAML contents
The following is an example YAML file we have used during production: nxOscDF5033.yaml and is shown below. This contains some annotations we believe to be helpful, information on the alignment, synteny, longread and hic data.
assembly
assem_level
: scaffold or contig level assembly (not used).assem_version
: Used to complete sample_id.sample_id
: ToLID of the sample.latin_name
: Latin identification of speciesdefined_class
: Clade name (as used to group synteny sequences and to complete alignment/data_dir).project_id
: Project id for the ticket (not used)
reference_file
: Sample .fa file.assem_reads
read_type
: { hifi | clr | ont | illumina } To be used in future update.read_data
: path (ending with/
) to folder containing fasta.gz files.supplementary_data
: Will be required in future development.
hic_data
:hic_cram
: path (ending with/
) to folder containing cram files.hic_aligner
: choice betweenbwamam2
andminimap2
alignment
data_dir
: Gene alignment data path (ending with/
).common_name
: For future implementation (adding bee, wasp, ant etc)geneset_id
: a csv list of geneset data to be used
kmer_profile
:kmer_length
: length of kmer to be used in plottingdir
: directory containing old plot to be regenerated if applicable
self_comp
motif_len
: Length of motif to be used in self complementary sequence findingmummer_chunk
: Size of chunks used by MUMMER module.
synteny
synteny_genome_path
: Path to syntenic genomes grouped by clade.
outdir
: Will be required in future development.intron:
size
: base pair size of introns default is 50k
telomere
:teloseq
: Telomeric motif
busco
lineages_path
: path to folder above lineages folderlineage
: Example isnematode_odb10
Notes on using BUSCO
The pipeline requires the use of the BUSCO subworkflows. Create the database directory and move into the directory:
DATE=2023_03
BUSCO=/path/to/databases/busco_${DATE}
mkdir -p $BUSCO
cd $BUSCO
Download BUSCO data and lineages to allow BUSCO to run in offline mode.
wget -r -nH https://busco-data.ezlab.org/v5/data/
The trailing slash after data is important, otherwise wget doesn't download the subdirectories.
Tar gunzip (decompress) all folders that have been stored as tar.gz, in the same parent directories as where they were stored:
find v5/data -name "*.tar.gz" | while read -r TAR; do tar -C `dirname $TAR` -xzf $TAR; done
If you have GNU parallel installed, you can also use the command below which will run faster as it will run the decompression commands in parallel:
find v5/data -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"
Sub-workflows
YAML_INPUT
- Reads the input yaml and generates parameters used by other workflows.
GENERATE_GENOME
- Builds genome description file of the reference genome.
READ_COVERAGE
- Produces read coverage based on pacbio long read fasta file.
GAP_FINDER
- Identifies contig gaps in the input genome.
REPEAT_DENSITY
- Reports the intensity of regional repeats within an input assembly.
HIC_MAPPING
- Aligns illumina HiC short reads to the input genome, generates mapping file in three format for visualisation: .pretext, .hic and .mcool
TELO_FINDER
- Find a user given motif in the input genome.
GENE_ALIGNMENT
- Aligns the peptide and nuclear data from assemblies of related species to the input genome.
INSILICO_DIGEST
- Generates a map of enzymatic digests using 3 Bionano enzymes.
SELFCOMP
- Identifies regions of self-complementary sequence.
SYNTENY
- Generates syntenic alignments between other high quality genomes.
BUSCO_ANALYSIS
- Uses BUSCO to identify ancestral elements. Also use to identify ancestral Lepidopteran genes (merian units).
KMER
- Generating kmer graphs of the assembly.
KMER_READ_COVERAGE
- Generating read coverage using kmers.
Running the pipeline
The typical command for running the pipeline is as follows (if you require the RAPID workflow you can append -entry RAPID
to the command):
nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir <OUTDIR> -profile singularity,sanger
With the treeval.yaml
containing the information from the above YAML Contents section
Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull sanger-tol/treeval
Reproducibility
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the sanger-tol/treeval releases page and find the latest version number - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
.
This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.
Core Nextflow arguments
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended.
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
-resume
-resume
Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
-c
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Custom configuration
Resource requests
Resource requests
Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.
nf-core/configs
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs
git repository. Before you do this, test that the config file works with your pipeline of choice using the -c
parameter. You can then create a pull request to the nf-core/configs
repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs
), and amending nfcore_custom.config
to include your custom profile.
See the main Nextflow documentation for more information about creating your own configuration files.
If you have any questions or issues please send us a message on Slack on the #configs
channel.
Running in the background
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
Nextflow memory requirements
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'