![]()
fastreeR is a hybrid toolkit combining a high-performance Java backend (BioInfoJava-Utils, a modular Java library for bioinformatics pipelines) with flexible and user-friendly interfaces across multiple platforms and environments, enabling seamless integration into a variety of genomic workflows. It enables fast computation of distance matrices and phylogenetic trees from genetic variant data in VCF or genomic sequences in FASTA format.
Integration and Accessibility
fastreeR offers interface, which is accessible in the following ways:
- 🆕 Java Backend (v2.7.0) !! introduces windowed / streaming VCF distance & tree output. Emit one distance matrix (or Newick tree) per genomic window of N base pairs (
--window-bp) or per N consecutive variants (--window-variants) forVCF2DISTandVCF2TREE, with optional long-form TSV output (--long). Windows never straddle chromosomes. - Java Backend (v2.5.0) introduces embedding-based distance calculation for VCF files. Provide pre-computed variant embeddings (from genomic language models like BioFM, DNA-BERT, Nucleotide Transformer, etc.) to weight variant contributions during distance computation.
- Java Backend (v2.3.0) supports reading from gzip (for example .gz), bzip2 (for example .bz2) and xz compressed VCF files.
- Java Backend (v2.2.0) implements streaming bootstrap; from VCF file get a newick tree with encoded bootstrap support values.
- Java Backend (v2.0.0) 100x times FASTreER and only a couple hundred MB RAM needed. Java 11+ suggested.
-
Bioconda: install with
conda install -c bioconda fastreer(recipe) - Docker: available on DockerHub and GHCR for containerized execution
-
PyPI: install with
pip install fastreer(repository) - Python CLI: through a lightweight Python wrapper that calls the Java backend
-
R / Bioconductor: via
rJava(package) - Galaxy: available on Galaxy Toolshed.
- Pure Java API: developers can integrate this library directly in Java-based pipelines or software.
-
fastreeR: Fast Tree Reconstruction Tools for Genomics (VCF/FASTA to Distance/Tree)
- Integration and Accessibility
- Key Features
- Requirements
- Installation and Usage
- Distances from VCF
- Embedding-Based Distance Calculation
- Windowed / Streaming Output
- CLI Interface
- Integration with Java Backend
- Integration with R
- Sample data
- Citation
- Author
- License
Key Features
- 📁 Input from standard VCF (gz, bzip2, xz compressed or uncompressed) and FASTA files.
- 🪟 Windowed / streaming output emits one distance matrix or Newick tree per genomic window (by base pairs or variant count) for
VCF2DISTandVCF2TREE. - 🧠 Embedding-based distance calculation using pre-computed variant embeddings from genomic language models.
- 🥾 Streaming bootstrap support from VCF to NEWICK.
- 🚀 With a superior multithreaded concurrency model and minimal RAM usage, from GBs down to just MBs!
- ⚡ Ultra-fast computation of sample-wise cosine distances from large VCF and D2S k-mer based distances from FASTA files.
- Generate phylogenetic trees directly from VCF or distance matrices using hierarchical clustering (single, complete, or average linkage; complete by default).
- Multithreaded execution for speed and scalability.
- Cluster distance matrices hierarchically with dynamic tree pruning.
- Clean Python CLI for scripting and pipeline integration
- Streamlined integration with R via
rJava - Available on Galaxy Toolshed
- Compatible with standard bioinformatics formats (PHYLIP, Newick)
Requirements
- Java 11+
- Python 3.7+
- Maven (if you want to build from the source)
- GNU/Linux, Windows or macOS
Memory requirements for VCF input
No more GBs of RAM! Only the distance matrix is kept in memory:
4 bytes x (#samples²) x #threads- Example: 1000 samples with 32 threads → ~128MB RAM
VCF caching is minimal: Only 2 VCF lines per thread are pre-cached.
- In the simple diploid case (e.g.,
0/1,1|0), each genotype requires ~4 characters (8 bytes). - For 1000 samples and 32 threads, this adds up to ~1MB RAM.
JVM will need at least 64-128 MB in order to efficiently run.
Total memory footprint: just a few hundred MB, even for large datasets.
It is not straightforward to define a strict minimum amount of RAM required for a given number of SNPs and samples, as JVM behavior can vary across different systems and configurations. From our own experiments, a rough estimate for the minimum usable memory is around 10 bytes per variant per sample. For example, a VCF file with 1 million variants and 1,000 samples would require at least 10 x 10⁶ x 10³ = 10 GB of allocated memory. However, running with this minimal allocation may result in frequent and prolonged garbage collection events, leading to significantly longer runtimes. For optimal execution, we recommend allocating 15-20 bytes per variant per sample (i.e., 15-20 GB for the same example), which reduces garbage collection overhead and ensures smoother performance.
In order to allocate RAM, a special parameter needs to be passed while JVM initializes. JVM parameters can be passed by setting java.parameters option. The -Xmx parameter, followed (without space) by an integer value and a letter, is used to tell JVM what is the maximum amount of heap RAM that it can use. The letter in the parameter (uppercase or lowercase), indicates RAM units. For example, parameters -Xmx1024m or -Xmx1024M or -Xmx1g or -Xmx1G, allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM.
In order to allocate 1024MB of RAM for the JVM, through R code, use:
options(java.parameters = "-Xmx1024M")When using fastreeR as a CLI, then RAM allocation in MB can be achieved with the relevant argument --mem MEM.
Installation and Usage
Via Docker
fastreeR is available as a lightweight, multithreaded, platform-independent Docker image hosted on both DockerHub and GHCR.
From DockerHub:
Or from GitHub Container Registry (GHCR):
To compute a tree directly from a VCF file:
docker run --rm -v $(pwd):/data gkanogiannis/fastreer:latest \
VCF2TREE -i /data/input.vcf -o /data/output.nwk --threads 4This:
- Mounts your working directory
$(pwd)inside the container - Reads
input.vcfand writesoutput.nwkrelative to your host - Uses 4 threads for faster computation
The Docker image includes:
- Java 21
- Python3
- All required
.jarlibraries - The
fastreeR.pyCLI entry point
Example: FASTA to distance
docker run --rm -v $(pwd):/data gkanogiannis/fastreer \
FASTA2DIST -i /data/sequences.fasta -o /data/sequences.dist -k 4 -t 2Memory tuning. Use the --mem option to control how much memory is allocated to the Java backend:
docker run --rm -v $(pwd):/data gkanogiannis/fastreer \
VCF2TREE -i /data/input.vcf -o /data/output.nwk --mem 128Internally, this sets the Java heap to
-Xmx128G.
As a PyPI Module
You can install the Python CLI directly from PyPI using:
This will install the fastreeR command-line tool (fastreer) and include the Java backend jars required for running all commands.
To check it installed correctly:
Via a Python CLI wrapper
Another easy method for using fastreeR is by its Python CLI:
Note: If you want to use a custom backend location, set the environment variable FASTREER_JAR_DIR.
As an R package
To install fastreeR as an R package:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("fastreeR")You can install the development version of fastreeR R package like so:
devtools::install_github("gkanogiannis/fastreeR")With Galaxy
Search in Galaxy Tools for fastreer or ask your Galaxy Admin to install it from toolshed.
From java backend source
To build the Java backend from source code:
git clone https://github.com/gkanogiannis/fastreeR.git
git clone https://github.com/gkanogiannis/BioInfoJava-Utils.git
pushd BioInfoJava-Utils
mvn clean initialize package && popdThen copy the resulting .jar file(s) to the fastreeR/inst/java/ directory:
Finally run the tool from its Python CLI:
Distances from VCF
Calculates a cosine type dissimilarity measurement between the n samples of a VCF file.
Biallelic or multiallelic (maximum 7 alternate alleles) SNP and/or INDEL variants are considered, phased or not. Some VCF encoding examples are:
- heterozygous variants :
1/0or0/1or0/2or1|0or0|1or0|2 - homozygous to the reference allele variants :
0/0or0|0 - homozygous to the first alternate allele variants :
1/1or1|1
If there are n samples and m variants, an nxn zero-diagonal symmetric distance matrix is calculated. The calculated cosine type distance (1-cosine_similarity)/2 is in the range [0,1] where value 0 means completely identical samples (cosine is 1), value 0.5 means perpendicular samples (cosine is 0) and value 1 means completely opposite samples (cosine is -1).
The calculation is performed by a Java back-end implementation, that supports multi-core CPU utilization and can be demanding in terms of memory resources.
Output distances is a PHYLIP compatible file will contain n+1 lines. The first line contains the number n of samples and number m of variants, separated by space. Each of the subsequent n lines contains n+1 values, separated by space. The first value of each line is a sample name and the rest n values are the calculated distances of this sample to all the samples. Example output file of the distances of 3 samples calculated from 1000 variants:
| 3 1000 | |||
|---|---|---|---|
| Sample1 | 0.0 | 0.5 | 0.2 |
| Sample2 | 0.5 | 0.0 | 0.9 |
| Sample3 | 0.2 | 0.9 | 0.0 |
Embedding-Based Distance Calculation
Version 2.5.0 of the Java backend introduces support for embedding-based distance calculation in VCF2DIST and VCF2TREE. This feature allows you to incorporate pre-computed variant embeddings (e.g., from genomic language models like BioFM, DNA-BERT, Nucleotide Transformer, or custom embeddings) to compute distances in embedding space rather than genotype space.
How It Works
Instead of computing cosine similarity directly from genotype vectors, the embedding mode:
- Projects each sample into embedding space:
H_i = Σ_v dosage_i^v × e_v - Computes cosine distance between sample embeddings
This captures functional relationships between variants - samples with alleles at functionally similar positions become more similar in embedding space.
Embedding File Formats
TSV Format:
#VARIANT_ID DIM_0 DIM_1 DIM_2 ...
chr1:12345:A:G 0.123 -0.456 0.789 ...
chr1:67890:C:T 0.567 0.123 -0.890 ...
HuggingFace JSON Format:
Embedding Command Line Options
| Option | Description |
|---|---|
-e, --embeddings |
Path to variant embeddings file |
--embeddings-format |
Format: TSV or HUGGINGFACE (auto-detected if not specified) |
--variant-key |
Variant key format: CHROM_POS, CHROM_POS_REF_ALT (default), or VCF_ID
|
Embedding Examples
# Distance matrix with embeddings (TSV format, auto-detected)
python fastreeR.py VCF2DIST -i samples.vcf.gz -o distances.tsv -e variant_embeddings.tsv -t 4
# Tree with embeddings and bootstrap (HuggingFace format)
python fastreeR.py VCF2TREE -i samples.vcf.gz -o tree.nwk -e embeddings.json --embeddings-format HUGGINGFACE -b 100
# Standard mode (no embeddings) - existing behavior
python fastreeR.py VCF2DIST -i samples.vcf.gz -o distances.tsvVariants without matching embeddings are automatically skipped, and the tool reports how many variants were used vs. skipped.
Windowed / Streaming Output
Version 2.7.0 of the Java backend introduces windowed output for VCF2DIST and VCF2TREE. Instead of producing a single genome-wide distance matrix or tree, the tools can stream one matrix (or Newick tree) per genomic window. This enables local-ancestry analyses, introgression scans, recombination-rate studies, and any workflow that needs sample relationships measured along the genome.
How Windowing Works
Variants are streamed in input order and grouped into windows defined either by base-pair span (--window-bp) or by consecutive variant count (--window-variants). When a window closes, all worker threads synchronize on a barrier, the per-window distance matrix is reduced from shared accumulators, the writer emits the window, and the accumulators are zeroed before the next window opens. Windows never straddle chromosomes; a contig change always closes the current window.
The non-windowed code path is unchanged and remains byte-identical to previous releases.
Windowing Command Line Options
| Option | Description |
|---|---|
--window-bp |
Emit one matrix/tree per window of N base pairs (mutually exclusive with --window-variants) |
--window-variants |
Emit one matrix/tree per N consecutive variants (mutually exclusive with --window-bp) |
--step |
Window step. Defaults to window size (tiled). Sliding windows (step != size) are not yet implemented. |
--min-variants |
Minimum number of variants required to emit a window (default 1; smaller windows are skipped silently) |
--long |
(VCF2DIST only) Emit long-form TSV chrom, start, end, sample_i, sample_j, dist instead of matrices |
Output Formats
VCF2DIST default (concatenated matrices), one block per window:
# window chrom=chr1 start=0 end=100000 nvariants=842 nsamples=3
3 842
s1 0 0.4231 0.5102
s2 0.4231 0 0.3987
s3 0.5102 0.3987 0
# window chrom=chr1 start=100000 end=200000 nvariants=917 nsamples=3
...
VCF2DIST --long, single TSV with one row per sample pair per window:
chrom start end sample_i sample_j dist
chr1 0 100000 s1 s2 0.4231
chr1 0 100000 s1 s3 0.5102
chr1 0 100000 s2 s3 0.3987
...
VCF2TREE, one Newick tree per window, prefixed by a header comment:
# window chrom=chr1 start=0 end=100000 nvariants=842 nsamples=3
(s1:0.21,(s2:0.19,s3:0.18):0.05);
# window chrom=chr1 start=100000 end=200000 nvariants=917 nsamples=3
(s2:0.20,(s1:0.22,s3:0.17):0.04);
...
Windowing Examples
# Distance matrices in 100kb tiled windows
python fastreeR.py VCF2DIST -i samples.vcf.gz -o per_window.dist --window-bp 100000 -t 4
# Long-form TSV, one matrix per 500 consecutive variants
python fastreeR.py VCF2DIST -i samples.vcf.gz -o per_window.tsv --window-variants 500 --long -t 4
# Per-window phylogenetic trees (Newick)
python fastreeR.py VCF2TREE -i samples.vcf.gz -o per_window.nwk --window-bp 250000 -t 4
# Skip windows with fewer than 50 variants
python fastreeR.py VCF2DIST -i samples.vcf.gz -o per_window.dist --window-bp 100000 --min-variants 50Windowing Limitations
-
Sliding windows (
--stepdifferent from window size) are reserved for a future release; passing them throws an error. -
Bootstrap (
-b/--bootstrap) is rejected when combined with windowing. -
Embeddings (
-e/--embeddings) are rejected when combined with windowing.
Windowed output from R
vcf2dist() and vcf2tree() accept the same windowing parameters (windowBp, windowVariants, windowStep, windowMinVariants, plus longFormat for vcf2dist). When any window parameter is set the return value changes to one of:
-
vcf2dist(..., windowBp = 100000)— namedlistofdistobjects, one per window (names are"chrom:start-end"). -
vcf2dist(..., windowVariants = 500, longFormat = TRUE)— single long-formdata.framewith columnschrom, start, end, sample_i, sample_j, dist. -
vcf2tree(..., windowBp = 250000)—data.framewith columnschrom, start, end, nvariants, newick.
library(fastreeR)
vcf <- system.file("extdata", "samples.vcf.gz", package = "fastreeR")
# Per-window distance matrices (list of dist)
windows <- vcf2dist(vcf, windowBp = 100000)
length(windows); head(names(windows))
# Per-window trees as a data.frame
trees <- vcf2tree(vcf, windowVariants = 500)
trees[1, ]CLI Interface
The Python CLI (fastreeR.py) interfaces with the Java backend via subprocess, providing a unified command-line interface for all supported tools.
Commands
General Syntax
| COMMAND | Description |
|---|---|
VCF2DIST |
Compute a cosine distance matrix from a VCF file (genome-wide or per window) |
VCF2TREE |
Compute a Newick hierarchical-clustering tree from a VCF (genome-wide or per window) |
DIST2TREE |
Compute a Newick hierarchical-clustering tree from a distance matrix |
FASTA2DIST |
Compute a D2S distance matrix from a FASTA file |
VCF2EMB |
Generate variant embeddings from VCF using BioFM language model |
Examples
Compute Newick tree directly from a VCF file.
You can also request bootstrap replicates directly from the VCF source. The Java backend will perform streaming bootstrap sampling and encode bootstrap support values at internal nodes of the returned Newick string. For example:
The generated Newick will contain node support values (percentage across replicates) which can be inspected with phylogenetic tools such as ape in R.
Generate Variant Embeddings from VCF using BioFM
The VCF2EMB command uses the BioFM-265M genomic language model to generate embeddings for each variant in a VCF file. These embeddings can then be used with VCF2DIST or VCF2TREE for embedding-based distance calculation.
Supports gzipped input files: VCF (.vcf.gz), reference genome (.fa.gz, .fasta.gz, .fna.gz), and annotation (.gff.gz, .gff3.gz) files are automatically decompressed during processing.
Prerequisites:
-
Python 3.11 environment (required by biofm-eval):
-
Install PyTorch:
-
Install biofm-eval from source (not available on PyPI):
Download reference genome (GRCh38): NCBI
Download gene annotations (GENCODE v38): GENCODE
# Generate embeddings in TSV format (supports gzipped inputs)
python fastreeR.py VCF2EMB -i input.vcf.gz -o embeddings.tsv \
-r GRCh38.fna.gz -a gencode.v38.annotation.gff3.gz --verbose
# Generate embeddings in HuggingFace JSON format
python fastreeR.py VCF2EMB -i input.vcf.gz -o embeddings.json \
-r GRCh38.fna -a gencode.v38.annotation.gff3 -f HUGGINGFACE
# Use GPU for faster processing
python fastreeR.py VCF2EMB -i input.vcf.gz -o embeddings.tsv \
-r GRCh38.fna -a gencode.v38.annotation.gff3 --device cuda
# Process only first 1000 variants
python fastreeR.py VCF2EMB -i input.vcf.gz -o embeddings.tsv \
-r GRCh38.fna -a gencode.v38.annotation.gff3 --max-variants 1000You can set default paths via environment variables:
Output Examples
- Distance matrices: PHYLIP-compatible text
- Trees: Newick format
- Output is streamed line-by-line (suitable for large datasets)
Options (common to all commands)
-
-i, --input: Input file (VCF or distance matrix). Use-for stdin. -
-o, --output: Output file. If omitted, prints to stdout. -
-t, --threads: Number of threads (default: 1). -
--mem MEM: Max RAM for JVM in MB (default: 256). -
--lib LIB: Path to the folder containing backend JAR libraries (default: inst/java) -
--verbose: Print progress information to stderr. -
--pipe-stderr: Pipe stderr and forward from Python (default: direct passthrough to terminal). -
--version: Print version and citation information.
Embedding options (VCF2DIST and VCF2TREE only)
-
-e, --embeddings: Path to variant embeddings file for embedding-based distance calculation. -
--embeddings-format: Embeddings file format:TSVorHUGGINGFACE(auto-detected if not specified). -
--variant-key: Variant key format for embedding lookup:CHROM_POS,CHROM_POS_REF_ALT(default), orVCF_ID.
Windowing options (VCF2DIST and VCF2TREE only)
-
--window-bp N: Emit one matrix/tree per window ofNbase pairs (mutually exclusive with--window-variants). -
--window-variants N: Emit one matrix/tree perNconsecutive variants (mutually exclusive with--window-bp). -
--step N: Window step (defaults to window size, i.e. tiled). Sliding windows are not yet implemented. -
--min-variants N: Minimum number of variants required to emit a window (default 1). -
--long: (VCF2DISTonly) Emit long-form TSVchrom, start, end, sample_i, sample_j, distinstead of concatenated matrices.
VCF2EMB options (embedding generation)
-
-i, --input: Input VCF file. -
-o, --output: Output embeddings file (default: stdout). -
-r, --reference: Path to reference genome FASTA file (or setBIOFM_REFERENCE_GENOMEenv var). -
-a, --annotation: Path to gene annotation GFF3 file (or setBIOFM_GENE_ANNOTATIONenv var). -
-m, --model: HuggingFace model name or local path (default:m42-health/BioFM-265M). -
-f, --format: Output format:TSVorHUGGINGFACE(default:TSV). -
--variant-key: Variant key format in output:CHROM_POS,CHROM_POS_REF_ALT(default), orVCF_ID. -
--max-variants: Maximum number of variants to process (default: all). -
--batch-size: Batch size for embedding extraction (default: 32). -
--device: Device for model inference:cudaorcpu(default: auto-detect).
Integration with Java Backend
The CLI wraps tools from the BioInfoJava-Utils project and dynamically builds the Java classpath from all .jar files located in inst/java/.
Integration with R
All core functionality is available via the fastreeR R package (Bioconductor/devel):
See fastreeR R manual and fastreeR R vignette for usage in R.
Sample data
Toy vcf, fasta and distance sample data files are provided in inst/extdata.
samples.vcf.gz
Sample VCF file of 100 individuals and 1000 variants, in Chromosome22, from the 1K Genomes project. Original file available at http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/
vcfFile <- system.file("extdata", "samples.vcf.gz", package = "fastreeR")samples.vcf.dist.gz
Distances from the previous sample VCF
vcfDist <- system.file("extdata", "samples.vcf.dist.gz", package = "fastreeR")samples.vcf.istats
Individual statistics from the previous sample VCF
vcfIstats <- system.file("extdata", "samples.vcf.istats", package = "fastreeR")samples.fasta.gz
Sample FASTA file of 48 random bacteria RefSeq from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/.
fastaFile <- system.file("extdata", "samples.fasta.gz", package = "fastreeR")samples.fasta.dist.gz
Distances from the previous sample FASTA
fastaDist <- system.file("extdata", "samples.fasta.dist.gz", package = "fastreeR")Citation
If you use fastreeR in your research, please cite:
Anestis Gkanogiannis (2016) A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
BMC Bioinformatics 17, 311.
https://doi.org/10.1186/s12859-016-1186-3
https://github.com/gkanogiannis/fastreeR
Author
Anestis Gkanogiannis
Bioinformatics/ML Scientist
Linkedin: https://www.linkedin.com/in/anestis-gkanogiannis/
Website: https://github.com/gkanogiannis
ORCID: 0000-0002-6441-0688
License
fastreeR is licensed under the GNU General Public License v3.0.
See the LICENSE file for details.