Introduction
The TreeVal pipeline has a few requirements before being able to run:
-
The
gene_alignment_data
(where this refers to the alignment.data_dir and alignment.geneset data noted in the yaml, which we will explain later) and synteny data much follow a particular directory structure -
HiC CRAM files much already be pre-indexed in the same location as the CRAM file, e.g.,
samtools index {cram file}
. If this would be more helpful to the community as an automated process then please submit an issue. -
Finally, the yaml file which is described below in Full Samplesheet. This needs to contain all of the information related to the assembly for the pipeline to run.
Prior to running TreeVal
Please ensure you read the following sections on Directory Structure (
gene_alignment_data
, synteny
, scripts), HiC data prep and Pacbio data prep. Without these you may not be able to successfully run the TreeVal pipeline. If nothing is clear then leave an issue report.
Local testing
Details
We provide a complete set of test databases that can be used to test the pipeline locally.
First, choose a download location ${TREEVAL_TEST_DATA}
and run this command:
cd ${TREEVAL_TEST_DATA}
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" TreeValTinyData/gene_alignment_data/fungi/csv_data/LaetiporusSulphureus.gfLaeSulp1-data.csv
Then, modify the configuration file to point at that download location and off you go:
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" assets/github_testing/TreeValTinyTest.yaml
nextflow run . -profile test_github,singularity
Directory Structure
Details
Working example found below in the Gene alignment and synteny section (below), this will cover setting up synteny
and gene_alignment_data
directories as well as downloading some example data.
These two sub-workflows, for now, need the use of the variables classT
, synteny_genome_path
, data_dir
and geneset
. These variables are found inside the yaml ( this is the file that will tell TreeVal what and where everything is ). Currently, we don't use common_name
, e.g., bee
, wasp
, moth
, etc. However, we hope to make use of it in the future as our gene_alignment_data
"database" grows.
First, you should set up a directory in our recommended structure:
treeval-resources
│
├─ gene_alignment_data/
│ ├─ { classT }
│ ├─ csv_data
│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
│ ├─ { Name } # Here and below is generated by our scripts
│ ├─ { Name.Assession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ ├─ peps
│ └─ { Chunked fasta files }
│
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
│
├─ synteny/
│ └─ {classT}
│
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
│
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
classT
can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data
as well as syntenic genomes.

Synteny
Details
For synteny, below the classT
variable you should store the full genomic fasta file of any high quality genome you want to be compared against.
For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/
folder. Any TreeVal run you now perform where the classT
is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three. Again, this is something we could expand on with the common_name
field if people want in the future, submit a feature request.
Gene Alignment and Synteny Data and Directories
Details
Seeing as this can be quite a complicated to set up here's a walk through.
Step 1 - Set up the directories
Lets set up the system as if we want to run it on a bird genome.
mkdir -p gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
The naming of the bird folder here is important, keep this in mind.
So now we have this structure:
~/treeval-resources
│
├─ synteny/
│ └─ bird/
│
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
│
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
Step 2 - Download some data
First, let's download out sytenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).
cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
gunzip bTaeGut1_4.fasta.gz
This leaves us with a file called bTaeGut1_4.fasta
the genomic assembly of bTaeGut1_4
(the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
Now lets move into the raw_data
folder and download some data, this may take some time.
cd ../../gene_alignment_prep/raw_data/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz
on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.
This is a for loop written in bash, this will look through the current folder we are in for files ending with .fasta.gz
and then gunzip them. This unzips the file, uncompresses it so it is usable, and then runs our python script, GA_data_prep.py
.
for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done
<p>
This python script command here, in English, means. Take the file, uncompressed, that I downloaded from NCBI and cut it into pieces. A Fasta file looks something like this, with headers (lines starting with `>`) and sequence (the usual ATGC's):
</p>
<pre><code>
> SCAFFOLD_1_DATA_ON_METHODS
> ATGCGCATGCATGCATCGACTTCGAGCATCGTAG
> SCAFFOLD_2_DATA_ON_METHODS
> ACCAGTGCTAGCTAGCTACGTGTGGGTTTGCCCCGTTT
The headers here will be trimmed, to only essential data that you need in order to fine the sequence in your database of choice.
Fasta file may be made up of anywhere between 10's to many thousands of these. So in the case of our cdna
and pep
files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds
and rna
files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep
and cdna
will be cut up by a number you give or by 100. This is because the size of pep
and cdna
files are so much larger.
The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.
So the following script will produce a large amount of output in your terminal. Looking like:
python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100
Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep
or cdna
file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
If you now type ls
you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta
) and the folder GallusGallus
. This folder we can now move to its permanent home.
mv GallusGallus/ ../../gene_alignment_data/bird/
Step 3 -- Generate the CSV
This file will act as an index of all files we have produced in the gene_alignment_data
folder, and thankfully is a very simple step.
cd ../
python3 scripts/GA_csv_gen.py /path/to/gene_alignment_data/
Running this will look like:
============> CorvusMon1.bCorMon1 -- bird
Generating CSV for: CorvusMon1.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============> CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for: CorvusMoneduloides.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============> Gallus_gallus.UW_022020 -- bird
Generating CSV for: Gallus_gallus.UW_022020
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============> Gallus_gallus.GRCg6a -- bird
Generating CSV for: Gallus_gallus.GRCg6a
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============> GallusGallus.GRCg7b -- bird
Generating CSV for: GallusGallus.GRCg7b
Save Path: /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv
So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information, this looks like:
head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.
Step 4 -- Understand where we are at
So we have now generated the directory structure for gene_alignment_data
. So now lets use what we know to fill out the yaml.
The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.
Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path
and data_dir
. These refer to the directories we made earlier so we can replace them as such:
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
I said earlier that the the fact we called a folder bird
was important, this is because it now becomes our classT
:
classT: bird
During the running of the pipeline, this is appended onto the end of data_dir
and synteny_genome_path
in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/
will be used for syntenic alignments. Likewise with our alignment.data_dir
, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/
and then appends csv_data
.
In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/
folder and look like GallusGallus.GRCg7b-data.csv
. These (minus the -data.csv
) will be what we enter into the geneset
field in the yaml. The common_name
is a field we don't currently use.
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
However, what is cool about this field is that you can add as many as you want. So say you have the alignment.data_dir
for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4
. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"
Hopefully this explains things a bit better and you understand how this sticks together!
HiC data Preparation
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
PacBio Data Preparation
Details
Before running the pipeline data has to be in the fasta.gz
format. Because of the software we use this data with it must also be long-read data as well as single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
Full samplesheet
Details
We provide a complete set of test databases that can be used to test the pipeline locally.
First, choose a download location ${TREEVAL_TEST_DATA}
and run this command:
cd ${TREEVAL_TEST_DATA}
curl https://tolit.cog.sanger.ac.uk/test-data/resources/treeval/TreeValTinyData.tar.gz | tar xzf -
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" TreeValTinyData/gene_alignment_data/fungi/csv_data/LaetiporusSulphureus.gfLaeSulp1-data.csv
Then, modify the configuration file to point at that download location and off you go:
sed -i "s|/home/runner/work/treeval/treeval|${TREEVAL_TEST_DATA}|" assets/github_testing/TreeValTinyTest.yaml
nextflow run . -profile test_github,singularity
Directory Structure
Details
Working example found below in the Gene alignment and synteny section (below), this will cover setting up synteny
and gene_alignment_data
directories as well as downloading some example data.
These two sub-workflows, for now, need the use of the variables classT
, synteny_genome_path
, data_dir
and geneset
. These variables are found inside the yaml ( this is the file that will tell TreeVal what and where everything is ). Currently, we don't use common_name
, e.g., bee
, wasp
, moth
, etc. However, we hope to make use of it in the future as our gene_alignment_data
"database" grows.
First, you should set up a directory in our recommended structure:
treeval-resources
│
├─ gene_alignment_data/
│ ├─ { classT }
│ ├─ csv_data
│ │ └─ { Name.Assession }-data.csv # Generated by our scripts
│ ├─ { Name } # Here and below is generated by our scripts
│ ├─ { Name.Assession }
│ ├─ cdna
│ │ └─ { Chunked fasta files }
│ ├─ rna
│ │ └─ { Chunked fasta files }
│ ├─ cds
│ │ └─ { Chunked fasta files }
│ ├─ peps
│ └─ { Chunked fasta files }
│
├─ gene_alignment_prep/
│ ├─ scripts/ # We supply these in this repo
│ ├─ raw_fasta/ # Storing your fasta downloaded from NCBI or Ensembl
│ └─ treeval-datasets.tsv # Organism, common_name, clade, family, group, link_to_data, notes
│
├─ synteny/
│ └─ {classT}
│
├─ treeval_yaml/ # Storage folder for you yaml files, it's useful to keep them
│
└─ treeval_stats/ # Storage for you treeval output stats file whether for upload to our repo
classT
can be your own system of classification, as long as it is consistent. At Sanger we use the below, we advise you do too. Again, this value, that is entered into the yaml (the file we will use to tell TreeVal where everything is), is used to find gene_alignment_data
as well as syntenic genomes.
Synteny
Details
For synteny, below the classT
variable you should store the full genomic fasta file of any high quality genome you want to be compared against.
For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/
folder. Any TreeVal run you now perform where the classT
is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three. Again, this is something we could expand on with the common_name
field if people want in the future, submit a feature request.
Gene Alignment and Synteny Data and Directories
Details
Seeing as this can be quite a complicated to set up here's a walk through.
Step 1 - Set up the directories
Lets set up the system as if we want to run it on a bird genome.
mkdir -p gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
The naming of the bird folder here is important, keep this in mind.
So now we have this structure:
~/treeval-resources
│
├─ synteny/
│ └─ bird/
│
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
│
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
Step 2 - Download some data
First, let's download out sytenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).
cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
gunzip bTaeGut1_4.fasta.gz
This leaves us with a file called bTaeGut1_4.fasta
the genomic assembly of bTaeGut1_4
(the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
Now lets move into the raw_data
folder and download some data, this may take some time.
cd ../../gene_alignment_prep/raw_data/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz
on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.
This is a for loop written in bash, this will look through the current folder we are in for files ending with .fasta.gz
and then gunzip them. This unzips the file, uncompresses it so it is usable, and then runs our python script, GA_data_prep.py
.
for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done
<p>
This python script command here, in English, means. Take the file, uncompressed, that I downloaded from NCBI and cut it into pieces. A Fasta file looks something like this, with headers (lines starting with `>`) and sequence (the usual ATGC's):
</p>
<pre><code>
> SCAFFOLD_1_DATA_ON_METHODS
> ATGCGCATGCATGCATCGACTTCGAGCATCGTAG
> SCAFFOLD_2_DATA_ON_METHODS
> ACCAGTGCTAGCTAGCTACGTGTGGGTTTGCCCCGTTT
The headers here will be trimmed, to only essential data that you need in order to fine the sequence in your database of choice.
Fasta file may be made up of anywhere between 10's to many thousands of these. So in the case of our cdna
and pep
files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds
and rna
files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep
and cdna
will be cut up by a number you give or by 100. This is because the size of pep
and cdna
files are so much larger.
The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.
So the following script will produce a large amount of output in your terminal. Looking like:
python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100
Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep
or cdna
file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
If you now type ls
you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta
) and the folder GallusGallus
. This folder we can now move to its permanent home.
mv GallusGallus/ ../../gene_alignment_data/bird/
Step 3 -- Generate the CSV
This file will act as an index of all files we have produced in the gene_alignment_data
folder, and thankfully is a very simple step.
cd ../
python3 scripts/GA_csv_gen.py /path/to/gene_alignment_data/
Running this will look like:
============> CorvusMon1.bCorMon1 -- bird
Generating CSV for: CorvusMon1.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============> CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for: CorvusMoneduloides.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============> Gallus_gallus.UW_022020 -- bird
Generating CSV for: Gallus_gallus.UW_022020
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============> Gallus_gallus.GRCg6a -- bird
Generating CSV for: Gallus_gallus.GRCg6a
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============> GallusGallus.GRCg7b -- bird
Generating CSV for: GallusGallus.GRCg7b
Save Path: /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv
So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information, this looks like:
head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.
Step 4 -- Understand where we are at
So we have now generated the directory structure for gene_alignment_data
. So now lets use what we know to fill out the yaml.
The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.
Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path
and data_dir
. These refer to the directories we made earlier so we can replace them as such:
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
I said earlier that the the fact we called a folder bird
was important, this is because it now becomes our classT
:
classT: bird
During the running of the pipeline, this is appended onto the end of data_dir
and synteny_genome_path
in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/
will be used for syntenic alignments. Likewise with our alignment.data_dir
, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/
and then appends csv_data
.
In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/
folder and look like GallusGallus.GRCg7b-data.csv
. These (minus the -data.csv
) will be what we enter into the geneset
field in the yaml. The common_name
is a field we don't currently use.
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
However, what is cool about this field is that you can add as many as you want. So say you have the alignment.data_dir
for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4
. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"
Hopefully this explains things a bit better and you understand how this sticks together!
HiC data Preparation
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
PacBio Data Preparation
Details
Before running the pipeline data has to be in the fasta.gz
format. Because of the software we use this data with it must also be long-read data as well as single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
Full samplesheet
Details
For synteny, below the classT
variable you should store the full genomic fasta file of any high quality genome you want to be compared against.
For bird we recommend the Golden Eagle ( Aquila chrysaetos ) and the Zebrafinch (Taeniopygia guttata), which can be downloaded from NCBI. Rename, these files to something more human readable, and drop them into the synteny/bird/
folder. Any TreeVal run you now perform where the classT
is bird will run a syntenic alignment against all genomes in that folder. It would be best to keep this to around three. Again, this is something we could expand on with the common_name
field if people want in the future, submit a feature request.
Gene Alignment and Synteny Data and Directories
Details
Seeing as this can be quite a complicated to set up here's a walk through.
Step 1 - Set up the directories
Lets set up the system as if we want to run it on a bird genome.
mkdir -p gene_alignment_prep/scripts/
cp treeval/bin/treeval-dataprep/* gene_alignment_prep/scripts/
mkdir -p gene_alignment_prep/raw_fasta/
mkdir -p gene_alignment_data/bird/csv_data/
mkdir -p synteny/bird/
The naming of the bird folder here is important, keep this in mind.
So now we have this structure:
~/treeval-resources
│
├─ synteny/
│ └─ bird/
│
├─ gene_alignment_data/
│ └─ bird/
│ └─ csv_data/
│
└─ gene_alignment_prep/
├─ scripts/
└─ raw_fasta/
Step 2 - Download some data
First, let's download out sytenic alignment data. I think the Zebrafinch (Taeniopygia guttata) would be good against the Chicken (Gallus gallus).
cd synteny/bird/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/957/565/GCA_003957565.4_bTaeGut1.4.pri/GCA_003957565.4_bTaeGut1.4.pri_genomic.fna.gz -o bTaeGut1_4.fasta.gz
gunzip bTaeGut1_4.fasta.gz
This leaves us with a file called bTaeGut1_4.fasta
the genomic assembly of bTaeGut1_4
(the Tree of Life ID) for this species) also known as Taeniopygia guttata, the Australian Zebrafinch.
Now lets move into the raw_data
folder and download some data, this may take some time.
cd ../../gene_alignment_prep/raw_data/
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_cds_from_genomic.fna.gz -o GallusGallus-GRCg7b.cds.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz -o GallusGallus-GRCg7b.cdna.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_protein.faa.gz -o GallusGallus-GRCg7b.pep.fasta.gz
curl https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_rna.fna.gz -o GallusGallus-GRCg7b.rna.fasta.gz
Now that's all downloaded we need to prep it. At this point it is all still gzipped (the .gz
on the end denotes that the file is compressed) in this format we can't use it. So lets use some bash magic.
This is a for loop written in bash, this will look through the current folder we are in for files ending with .fasta.gz
and then gunzip them. This unzips the file, uncompresses it so it is usable, and then runs our python script, GA_data_prep.py
.
for i in *.fasta.gz; do
gunzip $i;
python3 GA_data_prep.py ${i/.gz} ncbi 10;
done
<p>
This python script command here, in English, means. Take the file, uncompressed, that I downloaded from NCBI and cut it into pieces. A Fasta file looks something like this, with headers (lines starting with `>`) and sequence (the usual ATGC's):
</p>
<pre><code>
> SCAFFOLD_1_DATA_ON_METHODS
> ATGCGCATGCATGCATCGACTTCGAGCATCGTAG
> SCAFFOLD_2_DATA_ON_METHODS
> ACCAGTGCTAGCTAGCTACGTGTGGGTTTGCCCCGTTT
The headers here will be trimmed, to only essential data that you need in order to fine the sequence in your database of choice.
Fasta file may be made up of anywhere between 10's to many thousands of these. So in the case of our cdna
and pep
files they need to be cut up to let TreeVal have a chance in reading them all in a small time frame. cds
and rna
files will be cut up into 1,000 header-sequence pairs per file. The number given on the command-line is ignored. pep
and cdna
will be cut up by a number you give or by 100. This is because the size of pep
and cdna
files are so much larger.
The smaller the number you chunk a file, the smaller the files you produce which means you will also make many more files so there is a trade off.
So the following script will produce a large amount of output in your terminal. Looking like:
python3 ../scripts/GA_data_prep.py GallusGallus-GRCg7b.cds.fasta ncbi 100
Your using at least Version 3.6, You are good to go...
os imported
argparse imported
regex imported
WORKING ON: cds--GallusGallus-GRCg7b
Records per file: 1000
Entryfunction called
GallusGallus-GRCg7b.cds.fasta
File found at %s GallusGallus-GRCg7b.cds.fasta
Renaming headers
Read_fasta called
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus1000cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus2001cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus3002cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus4003cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus5004cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus6005cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus7006cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus8007cds.MOD.fa
File saved: -- ./GallusGallus/GallusGallus.GRCg7b/cds/GallusGallus9008cds.MOD.fa
This is pretty much telling us that, yes you have given me a file and for every 1000 (i'm ignoring the number you gave me because this isn't a pep
or cdna
file) header, sequence pairs I have come across I have made a new file found here. You'll notice that it has also generated a new set of folders. This is based off of how we have named the file.
If you now type ls
you should see the files we have downloaded (GallusGallus-GRCg7b.cds.fasta
) and the folder GallusGallus
. This folder we can now move to its permanent home.
mv GallusGallus/ ../../gene_alignment_data/bird/
Step 3 -- Generate the CSV
This file will act as an index of all files we have produced in the gene_alignment_data
folder, and thankfully is a very simple step.
cd ../
python3 scripts/GA_csv_gen.py /path/to/gene_alignment_data/
Running this will look like:
============> CorvusMon1.bCorMon1 -- bird
Generating CSV for: CorvusMon1.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMon1.bCorMon1-data.csv
============> CorvusMoneduloides.bCorMon1 -- bird
Generating CSV for: CorvusMoneduloides.bCorMon1
Save Path: /gene_alignment_data/bird/csv_data/CorvusMoneduloides.bCorMon1-data.csv
============> Gallus_gallus.UW_022020 -- bird
Generating CSV for: Gallus_gallus.UW_022020
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.UW_022020-data.csv
============> Gallus_gallus.GRCg6a -- bird
Generating CSV for: Gallus_gallus.GRCg6a
Save Path: /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
============> GallusGallus.GRCg7b -- bird
Generating CSV for: GallusGallus.GRCg7b
Save Path: /gene_alignment_data/bird/csv_data/GallusGallus.GRCg7b-data.csv
So what is happening is that it is moving through the directory, identifying each unique folder and generating a CSV summarising the data found in those directories into a csv with the following information, this looks like:
head -n 5 /gene_alignment_data/bird/csv_data/Gallus_gallus.GRCg6a-data.csv
org,type,data_file
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus9002cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus28453cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus18005cds.MOD.fa
Gallus_gallus.GRCg6a,cds,/gene_alignment_data/bird/Gallus_gallus/Gallus_gallus.GRCg6a/cds/Gallus_gallus6001cds.MOD.fa
This is all useful for the pipeline which generates job ids based on the org column, groups files by org and type columns and then pulls data from the data file.
Step 4 -- Understand where we are at
So we have now generated the directory structure for gene_alignment_data
. So now lets use what we know to fill out the yaml.
The yaml is a file that we need in order to tell the pipeline where everything is, an example can be found here.
Here we can see a number of fields that need to be filled out, the easiest being synteny_genome_path
and data_dir
. These refer to the directories we made earlier so we can replace them as such:
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
synteny_genome_path: /FULL/PATH/TO/treeval-resources/synteny
I said earlier that the the fact we called a folder bird
was important, this is because it now becomes our classT
:
classT: bird
During the running of the pipeline, this is appended onto the end of data_dir
and synteny_genome_path
in order to find the correct files to use. So now all of the files inside /FULL/PATH/TO/treeval-resources/synteny/bird/
will be used for syntenic alignments. Likewise with our alignment.data_dir
, TreeVal will turn this into /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/
and then appends csv_data
.
In Step 3, we generated some files which will be living in our /FULL/PATH/TO/treeval-resources/gene_alignment_data/bird/csv_data/
folder and look like GallusGallus.GRCg7b-data.csv
. These (minus the -data.csv
) will be what we enter into the geneset
field in the yaml. The common_name
is a field we don't currently use.
alignment:
data_dir: /FULL/PATH/TO/treeval-resources/gene_alignment_data/
common_name: "" # For future implementation (adding bee, wasp, ant etc)
geneset: "GallusGallus.GRCg7b"
However, what is cool about this field is that you can add as many as you want. So say you have the alignment.data_dir
for the Finch saved as TaeniopygiaGuttata.bTaeGut1_4
. The geneset field becomes: geneset: "GallusGallus.GRCg7b,TaeniopygiaGuttata.bTaeGut1_4"
Hopefully this explains things a bit better and you understand how this sticks together!
HiC data Preparation
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
PacBio Data Preparation
Details
Before running the pipeline data has to be in the fasta.gz
format. Because of the software we use this data with it must also be long-read data as well as single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
Full samplesheet
Details
Illumina HiC read files should be presented in an unmapped CRAM format, each must be accompanied by an index file (.crai) generated by samtools index. If your unmapped HiC reads are in FASTQ format, you should first convert them to CRAM format by using samtools import methods. Examples are below:
Conversion of FASTQ to CRAM
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
samtools import -@8 -r ID:{prefix} -r CN:{hic-kit} -r PU:{prefix} -r SM:{sample_name} {prefix}_R1.fastq.gz {prefix}_R2.fastq.gz -o {prefix}.cram
Indexing of CRAM
samtools index {prefix}.cram
PacBio Data Preparation
Details
Before running the pipeline data has to be in the fasta.gz
format. Because of the software we use this data with it must also be long-read data as well as single stranded. This means you could use ONT too (except duplex reads).
The below commands should help you convert from mapped bam to fasta.gz, or from fastq to fasta.
If your data isn't already in these formats, then let us know and we'll see how we can help.
BAM -> FASTQ
This command iterates through your bam files and converts them to fastq via samtools.
cd { TO FOLDER OF BAM FILES }
mkdir fastq
for i in *bam
do
echo $i
j=${i%.bam}
echo $j
samtools bam2fq ${i} > fastq/${j}.fq
done
FASTQ -> FASTA
This command creates a fasta
folder (to store our fasta files), moves into the fastq
folder and then converts fastq
to fasta
using seqtk seq.
mkdir fasta
cd fastq
for i in *fq; do
echo $i
j=${i%.fq}
echo $j
seqtk seq -a $i > ../fasta/${j}.fasta
done
FASTA -> FASTA.GZ
This simply gzips the fasta files.
for i in .fasta; do
echo $i
gzip $i
done
Or if you're a command line ninja
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
samtools bam2fq {prefix}.bam| seqtk seq -a - | gzip - > {prefix}.fasta.gz
Pretext Accessory File Ingestion
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
Full samplesheet
Details
Note: This will require you to install bigwigToBedGraph from the ucsc package. Instructions on downloading this can be found at EXAMPLE #3
The PreText files generated by the pipeline are not automatically ingested into the pretext files. For this you must use the following code:
cd {outdir}/hic_files
bigWigToBedGraph {coverage.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "coverage"
bigWigToBedGraph {repeat_density.bigWig} /dev/stdout | PretextGraph -i { your.pretext } -n "repeat_density"
cat {telomere.bedgraph} | awk -v OFS="\t" '{$4 = 1000; print}'|PretextGraph -i { your.pretext } -n "telomere"
cat {gap.bedgraph} | awk -v OFS="\t" '{$4= 1000; print}'| PretextGraph -i { your.pretext } -n "gap"
Full samplesheet
YAML is "Yet Another Markdown Language", it is a human-readable format that we use to tell TreeVal a number of things. This includes; assembly location, telomere motif, pacbio data files (in fasta.gz format) and HiC cram files. The full Yaml is detailed below.
YAML contents
The following is an example YAML file we have used during production: nxOscDF5033.yaml and is shown below. This contains some annotations we believe to be helpful, information on the alignment, synteny, pacbio and hic are explained here: pacbio, genealignment and synteny. nxOscDF5033.yaml
assembly
sample_id
: ToLID of the sample.latin_name
: Latin identification of speciesclassT
: Clade name (as used to group synteny sequences and to complete alignment/data_dir).TicketType
:
reference_file
: Sample .fa fileassem_reads
pacbio
: path to folder containing fasta.gz files.hic
: path to folder containing cram filessupplementary
: #Will be required in future development.
alignment
data_dir
: Gene alignment data pathcommon_name
: # For future implementation (adding bee, wasp, ant etc)geneset
: a csv list of geneset data to be used
self_comp
motif_len
: Length of motif to be used in self complementary sequence findingmummer_chunk
: Size of chunks used by MUMMER module.
synteny
synteny_genome_path
: Path to syntenic genomes grouped by clade.
outdir
: Will be required in future development.intron:
size
: base pair size of introns default is 50k
telomere
:teloseq
: Telomeric motif
busco
lineages_path
: path to folder above lineages folderlineage
: Example isnematode_odb10
Notes on using BUSCO
The pipeline requires the use of the BUSCO subworkflows. Create the database directory and move into the directory:
DATE=2023_03
BUSCO=/path/to/databases/busco_${DATE}
mkdir -p $BUSCO
cd $BUSCO
Download BUSCO data and lineages to allow BUSCO to run in offline mode.
wget -r -nH https://busco-data.ezlab.org/v5/data/
The trailing slash after data is important. Otherwise wget doesn't download the subdirectories, which make up the database.
Tar gunzip all folders that have been stored as tar.gz, in the same parent directories as where they were stored:
find v5/data -name "*.tar.gz" | while read -r TAR; do tar -C `dirname $TAR` -xzf $TAR; done
If you have GNU parallel installed, you can also use the command below which will run faster as it will run the decompression commands in parallel:
find v5/data -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"
Sub-workflows
YAML_INPUT
- Reads the input yaml and generates parameters used by other workflows.
GENERATE_GENOME
- Builds genome description file of the reference genome.
LONGREAD_COVERAGE
- Produces read coverage based on pacbio long read fasta file.
GAP_FINDER
- Identifies contig gaps in the input genome.
REPEAT_DENSITY
- Reports the intensity of regional repeats within an input assembly.
HIC_MAPPING
- Aligns illumina HiC short reads to the input genome, generates mapping file in three format for visualisation: .pretext, .hic and .mcool
TELO_FINDER
- Find a user given motif in the input genome.
GENE_ALIGNMENT
- Aligns the peptide and nuclear data from assemblies of related species to the input genome.
INSILICO_DIGEST
- Generates a map of enzymatic digests using 3 Bionano enzymes.
SELFCOMP
- Identifies regions of self-complementary sequence.
SYNTENY
- Generates syntenic alignments between other high quality genomes.
BUSCO_ANALYSIS
- Uses BUSCO to identify ancestral elements. Also use to identify ancestral Lepidopteran genes (merian units).
Running the pipeline
The typical command for running the pipeline is as follows:
nextflow run sanger-tol/treeval --input assets/treeval.yaml --outdir <OUTDIR> -profile singularity, sanger
With the treeval.yaml
containing the information from the above YAML Contents section
Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull sanger-tol/treeval
Reproducibility
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the sanger-tol/treeval releases page and find the latest version number - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
.
This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.
Core Nextflow arguments
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g FastQC except for Singularity which directly downloads Singularity images via https hosted by the Galaxy project and Conda which downloads and installs software locally from Bioconda.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended.
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
-resume
-resume
Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
-c
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Custom configuration
Resource requests
Resource requests
Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.
nf-core/configs
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs
git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c
parameter. You can then create a pull request to the nf-core/configs
repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs
), and amending nfcore_custom.config
to include your custom profile.
See the main Nextflow documentation for more information about creating your own configuration files.
If you have any questions or issues please send us a message on Slack on the #configs
channel.
Running in the background
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
Nextflow memory requirements
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'