Welcome to the Phylo2Vec demo! Here, we will quickly visit the main functions of phylo2vec, including:
- A primer on computational phylogenetics, running popular software such as rapidNJ and IQ-TREE for inferring trees from molecular sequences
- Visualising trees using ete3 or Biopython
- Using basic phylo2vec functions: sam trees, converting trees to the phylo2vec format, performing operations on trees
- Using modern tools and infrastructure present in the phylo2vec package: pixi, pytest, and pytest-benchmark
# Constants
data_dir = "../data"
tree_dir = "../trees"What is phylo2vec?¶
phylo2vec is a library for encoding and manipulating binary (phylogenetic) trees under a compact vector format. In its current version, the library is useful for:
- Sampling random trees
- Fast comparison of trees
- Compressing trees or files with many trees
The current version of Phylo2Vec (1.x) relies on a core written in Rust, with bindings to Python and R. That means that you do not need to know Rust to use the package, Python or R are sufficient! To become more familiar with Rust, we recommend this interactive book.
Before we get started, we will quickly introduce pixi, the package manger we use to orchestrate dependencies in this workshop (and in the phylo2vec package).
A minimal pixi cheatsheet¶
Pixi is a past package manager built on top of the conda ecosystem. By default, it resolves packages from conda-forge, a community-driven channel with the most complete, up-to-date collection of scientific and data packages.
For the workshop (19.09.2025), installations are done for you via GitHub codespaces! However, if you want to run this notebook independently, you may either:
- Install the dependencies separately (not recommended)
- Use pixi to manage everything in a reproducible and isolated way
We will here briefly go through pixi basics: installing it, creating a pixi.toml, managing dependencies, and running scripts
Installing pixi¶
curl -fsSL https://pixi.sh/install.sh | bashCreating a new project¶
pixi initAdding dependencies¶
pixi add numpy pandas matplotlib
pixi remove matplotlib
pixi add bioconda::rapidnj # Channel specification for conda packages
pixi add r-ggplot2 # Adding an R packageSetting up a pixi.toml for a project¶
Here is a minimal file to create a custom project
[project]
name = "pixi-workshop"
version = "0.1.0"
description = "Minimal pixi project for the 19.09.2025 workshop"
channels = ["conda-forge", "bioconda", "r"]
[dependencies]
python = ">=3.11,<3.13"
numpy = "*"
pandas = "*"
bioconda::rapidnj = "*"
r-ggplot2 = "*"Running scripts¶
Define scripts in your pixi.toml, e.g.:
[scripts]
start = "python main.py"
notebook = "jupyter lab"and run them with:
pixi run start
pixi run notebookYou may also run executables via pixi run, e.g., pixi run rapidnj
A short primer on phylogenetics¶
The overarching goal of phylogenetics is to study the evolutionary history between different entities (e.g., animal species, viral strains, languages...) using observable data.
The main artefact of phylogenetics is a phylogenetic tree, a diagram which describe the evolutionary relationships between the entities of interest. A phylogenetic tree shares many similarities with the notion of trees defined in graph theory/computer science:
- It connects vertices (or nodes) with edges (or branches).
- It is acyclic
- Any two nodes in a tree are connected by a unique path.
- Terminal nodes, or leaves correspond to observable species (with observable data)
- Internal nodes, or ancestral nodes correspond to unobserved species. They are inferred using inference algorithms
- Edges, or branches connect nodes and may have variable lengths (branch lengths). In phylogenetics, branch lengths may represent time, or the amount of evolution between a parent node and its child.
In the tree below, different species of coronaviruses are mapped, highlighting the different genera (Sarbecovirus, Hibecovirus, Embecovirus, etc.). Tree tips correspond to different viral strains, and internal nodes correspond to unobserved ancestors, inferred from the data at hand.

Most modern phylogenetic analyses rely on molecular data, as it provides a precise and objective measure of evolutionary relationships. Thus phylogeneticists compare DNA, RNA, or protein sequences across entities (or taxa, in biology) to infer common ancestry and divergence times.
A simple genetic analysis requires at least three steps:
- Collecting sequences
- Aligning sequences
- Inferring a phylogenetic tree
Sequence alignment is necessary to compare homologous positions in sequences as genes, and genomes more generally, vary in length.
- Within a gene, some entities might present length-altering mutations such as insertions or deletions, or other structural changes.
- Across genomes, different organisms may not have the same set of genes, making alignment essential for comparing homologous regions.
For the purpose of this tutorial, we will start with pre-aligned sequences. If interested, see chapter 3 of the book https://
Phylogenetic inference is the task of inferring plausible phylogenetic trees from a molecular sequence alignment (MSA). Several approaches exist:
- The simplest is to use distance-based methods, such as neighbor-joining.
- Example tool: rapidNJ
- A state-of-the-art method is maximum likelihood, which evaluates how well different trees explain the observed data under a chosen model of sequence evolution.
- Bayesian inference goes further by estimating the probability distribution of trees, incorporating prior information and quantifying uncertainty.
- Example tool: BEAST.
Processing genetic sequences in Python¶
A useful library for manipulating genetic sequence sets in Python is Biopython. We will quickly visualise the first dataset of the workshop, which comprises mitochondrial DNA from (mostly) primates. The sequence is in the FASTA format, whose basic structure looks like this:
>species1
atcgatcgatcg
>species2
tacgtacgtacg
>spescies3
atgcatggatgg(nucleotides or amino acids may also be uppercase)
from Bio import AlignIO, SeqIO
from collections import Counter
import pandas as pd
def summary(file, format="fasta"):
aln = AlignIO.read(file, format)
n_seq = len(aln)
aln_length = aln.get_alignment_length()
# Count gaps per column
gap_fractions = []
all_counts = []
conserved_cols = 0
for i in range(aln_length):
col = aln[:, i]
counts = Counter(col)
_, count = counts.most_common(1)[0]
if count == n_seq: # fully conserved
conserved_cols += 1
gap_fractions.append(counts.get("-", 0))
all_counts.append(counts)
nucleotide_counts = pd.DataFrame(all_counts).fillna(0).astype(int).sum(0).to_dict()
return {
"n_sequences": n_seq,
"alignment_length": aln_length,
"percent_conserved_sites": conserved_cols / aln_length * 100,
"percent_gap": sum(gap_fractions) / (n_seq * aln_length) * 100,
"nucleotide_counts": nucleotide_counts,
}
summary("../data/primates.fa"){'n_sequences': 14,
'alignment_length': 232,
'percent_conserved_sites': 5.172413793103448,
'percent_gap': 0.03078817733990148,
'nucleotide_counts': {'a': 1217, 'c': 1304, 't': 599, 'g': 127, '-': 1}}Neighbor-joining with rapidNJ¶
Under the neighbour joining framework, we build a tree by iteratively joining pairs of taxa with the smallest evolutionary distance.
rapidNJ is an example tool for very fast neighbour joining. Using pixi we run pixi run rapidnj to run the tool:
!pixi run rapidnj -hRapid neighbour-joining. An implementation of the canonical neighbour-joining method which utilize a fast search heuristic to reduce the running time. RapidNJ can be used to reconstruct large trees using a very small amount of memory by utilizing the HDD as storage.
USAGE: rapidnj INPUT [OPTIONS]
The INPUT can be a distance matrix in phylip (.phylip) format or a multiple alignment in stockholm (.sth) or phylip format (.phylip).
OPTIONS:
-h, --help display this help message and exit.
-v, --verbose turn on verbose output.
-i, --input-format ARG Specifies the type of input. pd = distance
matrix in phylip format, sth = multiple alignment in (single line) stockholm format.
fa = multiple alignment in (single line) FASTA format.
-o, --output-format ARG Specifies the type of output. t = phylogenetic tree in newick format
(default), m = distance matrix.
-a, --evolution-model ARG Specifies which sequence evolution method to use when computing
distance estimates from multiple alignments. jc = juke cantor,
kim = Kimura's distance (default).
-m, --memory-size The maximum amount of memory which rapidNJ is allowed to use (in MB).
Default is 90% of all available memory.
-k, --rapidnj-mem ARG Force RapidNJ to use a memory efficient version of rapidNJ. The 'arg'
specifies the percentage of a sorted distance matrix which should be
stored in memory (arg=10 means 10%).
-d, --rapidnj-disk ARG Force RapidNJ to use HDD caching where 'arg' is the directory used to
store cached files.
-c, --cores ARG Number of cores to use for computating distance matrices from multiple
alignments. All available cores are used by default.
-b --bootstrap ARG Compute bootstrap values using ARG samples. The output tree will be
annotated with the bootstrap values.
-t, --alignment-type ARG Force the input alignment to be treated as: p = protein alignment,
d = DNA alignment.
-n --no-negative-length Adjust for negative branch lengths.
-x --output-file ARG Output the result to this file instead of stdout.
To get a tree (the default option), we can run:
primate_rapidnj_file = f"{tree_dir}/primates.rapidnj"
!pixi run rapidnj -v $data_dir/primates.fa -x $primate_rapidnj_fileRapidNJ v. 2.2.2ating environment
64 bit system detected.
Using 4 core(s) for distance estimation
Input format determined as FASTA
Reading data...
Input type determined as DNA.
Number of sequences: 14
Sequence length: 232
Matrix size: 14
15996 MB of memory is available
Using RapidNJ
Using 0.000747681 MB for distance matrix
Using 0.00149536 MB for sortedMatrix
Total memory consumption is 0.00224304 MB
Computing distance matrix...
Fastdist is enabled
Using Kimura algorithm to calculate distances
Computing phylogetic tree...
100.00%
We get here an output in the so-called Newick format.
This format represents taxa using nested parentheses, for example: ((Taxon1:BranchLength1, Taxon2:BranchLength2):BranchLength3);. Subtrees are enclosed recursively in parentheses, and every tree ends with a semicolon (;).
Although not immediately easy to read, we can observe that:
- ✅ Great apes (Humans, Chimpanzees, Gorillas, Orangutans) form a separate clade, as expected.
- Macaques branch off later (Rhesus Macaque, Japanese Macaque, Crab-eating Macaque).
- ❌ Bovine appears as the most basal lineage relative to this primate set, which does not reflect the accepted mammalian phylogeny
- ❌ The placement of the Squirrel Monkey deep within the tree is also unusual.
Maximum likelihood with IQ-TREE¶
Under the maximum likelihood framework, we build a tree by finding the topology and branch lengths that maximize the probability of observing the given sequence alignment under a specified model of sequence evolution. The optimal tree is identified by exploring possible tree topologies using many tree rearrangements and selecting the one with the highest likelihood score.
IQ-TREE is an example tool for fast maximum likelihood estimation. For a detailed tutorial on IQ-TREE see this tutorial. Using pixi we run pixi run iqtree to run the tool:
!pixi run iqtree -h | head -n 50IQ-TREE version 3.0.1 for Linux x86 64-bit built Jul 9 2025
Developed by Bui Quang Minh, Thomas Wong, Nhan Ly-Trong, Huaiyan Ren
Contributed by Lam-Tung Nguyen, Dominik Schrempf, Chris Bielow,
Olga Chernomor, Michael Woodhams, Diep Thi Hoang, Heiko Schmidt
Usage: iqtree [-s ALIGNMENT] [-p PARTITION] [-m MODEL] [-t TREE] ...
GENERAL OPTIONS:
-h, --help Print (more) help usages
-s FILE[,...,FILE] PHYLIP/FASTA/NEXUS/CLUSTAL/MSF alignment file(s)
-s DIR Directory of alignment files
--seqtype STRING BIN, DNA, AA, NT2AA, CODON, MORPH (default: auto-detect)
-t FILE|PARS|RAND Starting tree (default: 99 parsimony and BIONJ)
-o TAX[,...,TAX] Outgroup taxon (list) for writing .treefile
--prefix STRING Prefix for all output files (default: aln/partition)
--seed NUM Random seed number, normally used for debugging purpose
--safe Safe likelihood kernel to avoid numerical underflow
--mem NUM[G|M|%] Maximal RAM usage in GB | MB | %
--runs NUM Number of indepedent runs (default: 1)
-v, --verbose Verbose mode, printing more messages to screen
-V, --version Display version number
--quiet Quiet mode, suppress printing to screen (stdout)
-fconst f1,...,fN Add constant patterns into alignment (N=no. states)
--epsilon NUM Likelihood epsilon for parameter estimate (default 0.01)
-T NUM|AUTO No. cores/threads or AUTO-detect (default: 1)
--threads-max NUM Max number of threads for -T AUTO (default: all cores)
CHECKPOINT:
--redo Redo both ModelFinder and tree search
--redo-tree Restore ModelFinder and only redo tree search
--undo Revoke finished run, used when changing some options
--cptime NUM Minimum checkpoint interval (default: 60 sec and adapt)
PARTITION MODEL:
-p FILE|DIR NEXUS/RAxML partition file or directory with alignments
Edge-linked proportional partition model
-q FILE|DIR Like -p but edge-linked equal partition model
-Q FILE|DIR Like -p but edge-unlinked partition model
-S FILE|DIR Like -p but separate tree inference
--subsample NUM Randomly sub-sample partitions (negative for complement)
--subsample-seed NUM Random number seed for --subsample
LIKELIHOOD/QUARTET MAPPING:
--lmap NUM Number of quartets for likelihood mapping analysis
--lmclust FILE NEXUS file containing clusters for likelihood mapping
--quartetlh Print quartet log-likelihoods to .quartetlh file
TREE SEARCH ALGORITHM:
--ninit NUM Number of initial parsimony trees (default: 100)
--ntop NUM Number of top initial trees (default: 20)
primate_iqtree_prefix = f"{tree_dir}/primates"
!pixi run iqtree -s $data_dir/primates.fa -m GTR+G --prefix $primate_iqtree_prefix --redo
primate_iqtree_file = f"{primate_iqtree_prefix}.treefile"IQ-TREE version 3.0.1 for Linux x86 64-bit built Jul 9 2025
Developed by Bui Quang Minh, Thomas Wong, Nhan Ly-Trong, Huaiyan Ren
Contributed by Lam-Tung Nguyen, Dominik Schrempf, Chris Bielow,
Olga Chernomor, Michael Woodhams, Diep Thi Hoang, Heiko Schmidt
Host: codespaces-a75d16 (AVX2, FMA3, 15 GB RAM)
Command: /workspaces/phylo2vec-workshop/.pixi/envs/default/bin/iqtree -s ../data/primates.fa -m GTR+G --prefix ../trees/primates --redo
Seed: 254513 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Tue Sep 16 21:41:00 2025
Kernel: AVX+FMA - 1 threads (4 CPU cores detected)
HINT: Use -nt option to specify number of threads because your CPU has 4 cores!
HINT: -nt AUTO will automatically determine the best number of threads to use.
Reading alignment file ../data/primates.fa ... Fasta format detected
Reading fasta file: done in 7.9869e-05 secs using 72.62% CPU
Alignment most likely contains DNA/RNA sequences
Constructing alignment: done in 0.000244175 secs using 74.54% CPU
Alignment has 14 sequences with 232 columns, 217 distinct patterns
191 parsimony-informative, 29 singleton sites, 12 constant sites
Gap/Ambiguity Composition p-value
Analyzing sequences: done in 1.2944e-05 secs using 69.53% CPU
1 Mouse 0.00% failed 3.42%
2 Bovine 0.00% passed 10.27%
3 Lemur 0.00% passed 17.69%
4 Tarsier 0.00% failed 3.53%
5 SquirMonk 0.43% passed 8.42%
6 JpnMacaq 0.00% passed 85.47%
7 RhesusMac 0.00% passed 96.42%
8 CrabEMac 0.00% passed 46.18%
9 BarbMacaq 0.00% passed 75.38%
10 Gibbon 0.00% passed 6.16%
11 Orang 0.00% failed 0.83%
12 Gorilla 0.00% passed 20.22%
13 Chimp 0.00% passed 38.37%
14 Human 0.00% passed 17.55%
**** TOTAL 0.03% 3 sequences failed composition chi2 test (p-value<5%; df=3)
Checking for duplicate sequences: done in 3.9935e-05 secs
Create initial parsimony tree by phylogenetic likelihood library (PLL)... 0.000 seconds
NOTE: 0 MB RAM (0 GB) is required!
Estimate model parameters (epsilon = 0.100)
1. Initial log-likelihood: -3001.337
2. Current log-likelihood: -2659.176
3. Current log-likelihood: -2632.929
4. Current log-likelihood: -2626.155
5. Current log-likelihood: -2623.111
6. Current log-likelihood: -2621.219
7. Current log-likelihood: -2619.644
8. Current log-likelihood: -2618.263
9. Current log-likelihood: -2617.070
10. Current log-likelihood: -2616.045
11. Current log-likelihood: -2615.173
12. Current log-likelihood: -2614.445
13. Current log-likelihood: -2613.812
14. Current log-likelihood: -2613.296
15. Current log-likelihood: -2612.851
16. Current log-likelihood: -2612.498
17. Current log-likelihood: -2612.180
18. Current log-likelihood: -2611.917
19. Current log-likelihood: -2611.687
20. Current log-likelihood: -2611.505
21. Current log-likelihood: -2611.354
22. Current log-likelihood: -2611.228
23. Current log-likelihood: -2611.114
Optimal log-likelihood: -2610.990
Rate parameters: A-C: 0.12392 A-G: 8.40821 A-T: 0.04733 C-G: 0.00010 C-T: 3.90865 G-T: 1.00000
Warning! Some parameters hit the boundaries
Base frequencies: A: 0.375 C: 0.402 G: 0.039 T: 0.184
Gamma shape alpha: 2.813
Parameters optimization took 23 rounds (0.130 sec)
Wrote distance file to...
Computing ML distances based on estimated model parameters...
Calculating distance matrix: done in 0.00153696 secs using 99.29% CPU
Computing ML distances took 0.001618 sec (of wall-clock time) 0.001609 sec (of CPU time)
WARNING: Some pairwise ML distances are too long (saturated)
Setting up auxiliary I and S matrices: done in 4.77e-05 secs using 98.53% CPU
Constructing RapidNJ tree: done in 0.000117439 secs using 267.4% CPU
Computing RapidNJ tree took 0.000228 sec (of wall-clock time) 0.000417 sec (of CPU time)
Log-likelihood of RapidNJ tree: -2610.865
--------------------------------------------------------------------
| INITIALIZING CANDIDATE TREE SET |
--------------------------------------------------------------------
Generating 98 parsimony trees... 0.045 second
Computing log-likelihood of 96 initial trees ... 0.090 seconds
Current best score: -2610.865
Do NNI search on 20 best initial trees
Optimizing NNI: done in 0.00641386 secs using 99.97% CPU
Estimate model parameters (epsilon = 0.100)
BETTER TREE FOUND at iteration 1: -2610.032
Optimizing NNI: done in 0.00667933 secs using 99.61% CPU
Optimizing NNI: done in 0.0096443 secs using 99.99% CPU
Optimizing NNI: done in 0.00732248 secs using 99.93% CPU
Optimizing NNI: done in 0.0167583 secs using 94.04% CPU
UPDATE BEST LOG-LIKELIHOOD: -2610.030
Optimizing NNI: done in 0.014474 secs using 93.12% CPU
Optimizing NNI: done in 0.0124607 secs using 68.11% CPU
Optimizing NNI: done in 0.0100009 secs using 99.93% CPU
Optimizing NNI: done in 0.0160094 secs using 99.98% CPU
UPDATE BEST LOG-LIKELIHOOD: -2610.030
Optimizing NNI: done in 0.0152949 secs using 93.49% CPU
UPDATE BEST LOG-LIKELIHOOD: -2610.030
Iteration 10 / LogL: -2610.030 / Time: 0h:0m:0s
Optimizing NNI: done in 0.0111653 secs using 73.57% CPU
Optimizing NNI: done in 0.00961478 secs using 99.59% CPU
Optimizing NNI: done in 0.0111453 secs using 99.99% CPU
Optimizing NNI: done in 0.00945041 secs using 89.73% CPU
Optimizing NNI: done in 0.0162751 secs using 87.62% CPU
Optimizing NNI: done in 0.016639 secs using 88.02% CPU
Optimizing NNI: done in 0.00946432 secs using 79.25% CPU
Optimizing NNI: done in 0.0113147 secs using 82.15% CPU
Optimizing NNI: done in 0.0115496 secs using 91.43% CPU
Optimizing NNI: done in 0.0120946 secs using 91.69% CPU
Iteration 20 / LogL: -2610.521 / Time: 0h:0m:0s
Finish initializing candidate tree set (2)
Current best tree score: -2610.030 / CPU time: 0.380
Number of iterations: 20
--------------------------------------------------------------------
| OPTIMIZING CANDIDATE TREE SET |
--------------------------------------------------------------------
Optimizing NNI: done in 0.00817134 secs using 87.75% CPU
Optimizing NNI: done in 0.0087555 secs using 93.06% CPU
Optimizing NNI: done in 0.0104697 secs using 80.57% CPU
Optimizing NNI: done in 0.015367 secs using 93.62% CPU
Optimizing NNI: done in 0.0120921 secs using 91.73% CPU
Optimizing NNI: done in 0.0104406 secs using 90.41% CPU
Optimizing NNI: done in 0.0119219 secs using 83.64% CPU
Optimizing NNI: done in 0.0113996 secs using 90.96% CPU
Optimizing NNI: done in 0.01032 secs using 90.51% CPU
Optimizing NNI: done in 0.0081252 secs using 88.49% CPU
UPDATE BEST LOG-LIKELIHOOD: -2610.030
Iteration 30 / LogL: -2610.030 / Time: 0h:0m:0s (0h:0m:1s left)
Optimizing NNI: done in 0.0145707 secs using 79.34% CPU
Optimizing NNI: done in 0.0157809 secs using 99.99% CPU
Optimizing NNI: done in 0.0102798 secs using 90.17% CPU
Optimizing NNI: done in 0.00551568 secs using 99.97% CPU
Optimizing NNI: done in 0.0055107 secs using 99.99% CPU
Optimizing NNI: done in 0.0115302 secs using 91.59% CPU
Optimizing NNI: done in 0.00753178 secs using 73.93% CPU
Optimizing NNI: done in 0.0128843 secs using 99.99% CPU
Optimizing NNI: done in 0.00888996 secs using 77.59% CPU
Optimizing NNI: done in 0.0120788 secs using 91.81% CPU
Iteration 40 / LogL: -2610.522 / Time: 0h:0m:0s (0h:0m:1s left)
Optimizing NNI: done in 0.0159488 secs using 99.99% CPU
Optimizing NNI: done in 0.00930294 secs using 79.68% CPU
Optimizing NNI: done in 0.00856704 secs using 88.09% CPU
Optimizing NNI: done in 0.00823229 secs using 75.8% CPU
Optimizing NNI: done in 0.00688569 secs using 99.99% CPU
Optimizing NNI: done in 0.0101421 secs using 99.98% CPU
Optimizing NNI: done in 0.00964183 secs using 99.98% CPU
Optimizing NNI: done in 0.0352292 secs using 94.35% CPU
Optimizing NNI: done in 0.0115839 secs using 91.41% CPU
Optimizing NNI: done in 0.0143839 secs using 83.57% CPU
Iteration 50 / LogL: -2610.481 / Time: 0h:0m:0s (0h:0m:0s left)
Optimizing NNI: done in 0.00734714 secs using 99.97% CPU
Optimizing NNI: done in 0.00773934 secs using 99.98% CPU
Optimizing NNI: done in 0.00799488 secs using 99.88% CPU
Optimizing NNI: done in 0.00849806 secs using 88.67% CPU
Optimizing NNI: done in 0.0106682 secs using 90.85% CPU
Optimizing NNI: done in 0.0102659 secs using 90.23% CPU
Optimizing NNI: done in 0.0079836 secs using 99.95% CPU
Optimizing NNI: done in 0.0121343 secs using 91.71% CPU
Optimizing NNI: done in 0.00989302 secs using 79.81% CPU
Optimizing NNI: done in 0.0107249 secs using 91.08% CPU
Iteration 60 / LogL: -2610.120 / Time: 0h:0m:0s (0h:0m:0s left)
Optimizing NNI: done in 0.0132185 secs using 84.74% CPU
Optimizing NNI: done in 0.00958536 secs using 79.04% CPU
Optimizing NNI: done in 0.00764101 secs using 86.9% CPU
Optimizing NNI: done in 0.0136006 secs using 99.99% CPU
Optimizing NNI: done in 0.0162915 secs using 94% CPU
Optimizing NNI: done in 0.0085677 secs using 88.71% CPU
Optimizing NNI: done in 0.0148867 secs using 99.98% CPU
Optimizing NNI: done in 0.0118867 secs using 99.98% CPU
Optimizing NNI: done in 0.0132138 secs using 92.4% CPU
Optimizing NNI: done in 0.0106815 secs using 90.76% CPU
Iteration 70 / LogL: -2610.036 / Time: 0h:0m:1s (0h:0m:0s left)
Optimizing NNI: done in 0.0124541 secs using 91.91% CPU
Optimizing NNI: done in 0.0119708 secs using 92.06% CPU
Optimizing NNI: done in 0.0137887 secs using 92.52% CPU
Optimizing NNI: done in 0.0103076 secs using 90.78% CPU
Optimizing NNI: done in 0.011325 secs using 82.11% CPU
Optimizing NNI: done in 0.0116191 secs using 91.4% CPU
Optimizing NNI: done in 0.00770569 secs using 86.96% CPU
Optimizing NNI: done in 0.00797808 secs using 99.97% CPU
Optimizing NNI: done in 0.0108698 secs using 99.98% CPU
Optimizing NNI: done in 0.00966555 secs using 99.9% CPU
Iteration 80 / LogL: -2610.056 / Time: 0h:0m:1s (0h:0m:0s left)
Optimizing NNI: done in 0.0104278 secs using 99.98% CPU
Optimizing NNI: done in 0.00744252 secs using 99.98% CPU
Optimizing NNI: done in 0.0134421 secs using 92.69% CPU
Optimizing NNI: done in 0.00698514 secs using 71.38% CPU
Optimizing NNI: done in 0.0079144 secs using 87.55% CPU
Optimizing NNI: done in 0.0110533 secs using 99.85% CPU
Optimizing NNI: done in 0.00870961 secs using 88.55% CPU
Optimizing NNI: done in 0.0171796 secs using 99.99% CPU
Optimizing NNI: done in 0.0138803 secs using 92.84% CPU
Optimizing NNI: done in 0.00769526 secs using 99.98% CPU
Iteration 90 / LogL: -2610.284 / Time: 0h:0m:1s (0h:0m:0s left)
Optimizing NNI: done in 0.0133823 secs using 99.99% CPU
Optimizing NNI: done in 0.00925695 secs using 99.98% CPU
Optimizing NNI: done in 0.0102966 secs using 64.11% CPU
Optimizing NNI: done in 0.0100416 secs using 89.95% CPU
Optimizing NNI: done in 0.0158254 secs using 93.63% CPU
Optimizing NNI: done in 0.00764955 secs using 99.98% CPU
Optimizing NNI: done in 0.00399646 secs using 99.96% CPU
Optimizing NNI: done in 0.00832468 secs using 99.98% CPU
Optimizing NNI: done in 0.0129783 secs using 90.84% CPU
Optimizing NNI: done in 0.00867893 secs using 87.25% CPU
Iteration 100 / LogL: -2610.119 / Time: 0h:0m:1s (0h:0m:0s left)
Optimizing NNI: done in 0.0162701 secs using 92.32% CPU
Optimizing NNI: done in 0.012092 secs using 89.71% CPU
TREE SEARCH COMPLETED AFTER 102 ITERATIONS / Time: 0h:0m:1s
--------------------------------------------------------------------
| FINALIZING TREE SEARCH |
--------------------------------------------------------------------
Performs final model parameters optimization
Estimate model parameters (epsilon = 0.010)
1. Initial log-likelihood: -2610.030
2. Current log-likelihood: -2609.993
3. Current log-likelihood: -2609.942
4. Current log-likelihood: -2609.895
5. Current log-likelihood: -2609.856
6. Current log-likelihood: -2609.825
7. Current log-likelihood: -2609.800
8. Current log-likelihood: -2609.779
9. Current log-likelihood: -2609.761
Optimal log-likelihood: -2609.751
Rate parameters: A-C: 0.10467 A-G: 8.42079 A-T: 0.03801 C-G: 0.00010 C-T: 3.89573 G-T: 1.00000
Warning! Some parameters hit the boundaries
Base frequencies: A: 0.375 C: 0.402 G: 0.039 T: 0.184
Gamma shape alpha: 2.716
Parameters optimization took 9 rounds (0.051 sec)
BEST SCORE FOUND : -2609.751
Total tree length: 35.894
Total number of iterations: 102
CPU time used for tree search: 1.206 sec (0h:0m:1s)
Wall-clock time used for tree search: 1.307 sec (0h:0m:1s)
Total CPU time used: 1.404 sec (0h:0m:1s)
Total wall-clock time used: 1.506 sec (0h:0m:1s)
Analysis results written to:
IQ-TREE report: ../trees/primates.iqtree
Maximum-likelihood tree: ../trees/primates.treefile
Likelihood distances: ../trees/primates.mldist
Screen log file: ../trees/primates.log
Date and Time: Tue Sep 16 21:41:01 2025
!cat $tree_dir/primates.treefile(Mouse:8.3560490579,((Bovine:2.5963872339,Lemur:4.8261981617):1.1364082316,Tarsier:5.3615931116):0.8929613977,(SquirMonk:2.9419033743,((JpnMacaq:0.0000022159,(RhesusMac:0.0964670212,(CrabEMac:0.1339728562,BarbMacaq:0.5369976250):0.1170793922):0.0205802779):1.4487955312,(Gibbon:1.3183511202,(Orang:0.8632207952,(Gorilla:0.1681820186,(Chimp:0.2545515524,Human:0.1205087466):0.2193635791):0.2118152634):0.4130048706):1.0934560678):0.9378962832):1.8281724040);
This tree seems already more plausible:
- ✅ Mouse is the outgroup, far outside the other mammals, which makes sense given its divergence.
- ✅ Squirrel Monkey separates next, rather than clustering with other New World monkeys
- ✅ Old World monkeys (Japanese Macaque, Rhesus Macaque, Crab-eating Macaque, Barbary Macaque) form a nested clade
- ✅ Apes (Gibbon, Orangutan, Gorilla, Chimpanzee, Human) form a well-supported, nested clade
- Bovine and Lemur form a clade that branches off early: unusual (bovines are not primates)
- Tarsier is slightly more derived than the Bovine–Lemur clade, which is also somewhat unexpected.
Factors that could enhance the accuracy of the tree include the choice of evolutionary model and, most importantly, the quality of the data. Here, our dataset is minuscule (232 base pairs!), so more data, for example by including more genes, would greatly improve our ability to reconstruct a plausible evolutionary history.
Visualizing trees¶
Several libraries in python already exist to visualize trees. A popular library is ete.
To have an ASCII drawing of the tree (without branch lengths represented), simply use print
# import PyQt5
from ete3 import Tree
# Load a tree from a Newick file*
primate_iqtree_tree = Tree(primate_iqtree_file)
# Print the tree in ASCII format
print(primate_iqtree_tree)
/-Mouse
|
| /-Bovine
| /-|
|--| \-Lemur
--| |
| \-Tarsier
|
| /-SquirMonk
| |
| | /-JpnMacaq
\-| /-|
| | | /-RhesusMac
| | \-|
| | | /-CrabEMac
\-| \-|
| \-BarbMacaq
|
| /-Gibbon
\-|
| /-Orang
\-|
| /-Gorilla
\-|
| /-Chimp
\-|
\-Human
Interactive visualisations are also possible using primate_iqtree_tree.show(). They require PyQt5 (pixi add --pypi PyQt5) and may crash your notebook.
We provide here a small function to plot a tree from an ete3 Tree in matplotlib.
import itertools
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.collections import LineCollection
def plot_tree(
tree,
align_names=False,
name_offset=None,
fontsize=9,
cdict=None,
outgroup=None,
ax=None,
):
"""
Plots a ete3.Tree object using matploltib.
Adapted from: https://gist.github.com/jolespin/5d90deff552138d73de7ed4bdd9ac57a
Parameters
----------
tree : str
An ete3.Tree object
align_names: bool
If True names will be aligned vertically, by default False
name_offset : float, optional
Offset relative to tips to write leaf_names. In BL scale, by default None
fontsize : int, optional
Text font size, by default 9
ax : matplotlib.Axes object, optional
Object on which the tree will be plotted, by default None
Returns
-------
ax : matplotlib.Axes object
The matplotlib axis containing the drawn tree
"""
if outgroup is not None:
if isinstance(outgroup, list):
outgroup = tree.get_common_ancestor(outgroup)
tree.set_outgroup(outgroup)
if ax is None:
ax = plt.gca()
aligned_lines = []
max_x = max(n.get_distance(tree) for n in tree.iter_leaves())
if name_offset is None:
name_offset = max_x / 50.0
node_pos = {n2: i for i, n2 in enumerate(tree.get_leaves()[::-1])}
node_list = itertools.chain(tree.iter_descendants(strategy="postorder"), [tree])
# draw tree
for node in node_list:
# Parent style
# pstyle = node.img_style
x = sum(n2.dist for n2 in node.iter_ancestors()) + node.dist
if node.is_leaf():
y = node_pos[node]
if align_names:
x = max_x
aligned_lines.append(((x, y), (max_x + name_offset, y)))
else:
y = np.mean([node_pos[n2] for n2 in node.children])
node_pos[node] = y
# draw vertical line
ax.plot(
[x, x],
[node_pos[node.children[0]], node_pos[node.children[-1]]],
c="k",
)
# draw horizontal lines
for child in node.children:
# Child style
# cstyle = child.img_style
ax.plot(
[x, x + child.dist],
[node_pos[child], node_pos[child]],
c="k",
)
# Node label
ax.text(
x + name_offset,
y,
node.name,
va="center",
size=fontsize,
c="k",
)
# Node point
ax.scatter(
x,
y,
c=(cdict.get(node.name, "none") if cdict else "k"),
zorder=10,
)
ali_line_col = LineCollection(aligned_lines, colors="k")
ax.add_collection(ali_line_col)
# ax.set_axis_off()
ax.set_yticks([])
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.xaxis.set_tick_params(labelsize=fontsize)
return axThe x-axis consitute the branch lengths
fig, axs = plt.subplots(1, 2, figsize=(10, 6))
primate_rapidnj_tree = Tree(primate_rapidnj_file)
plot_tree(primate_rapidnj_tree, ax=axs[0])
plot_tree(primate_iqtree_tree, ax=axs[1])
plt.show()
Manipulating trees with phylo2vec¶
Before looking at how to use the produced trees with phylo2vec, we will quickly visit its main functions.
To import phylo2vec, simply run import phylo2vec
import phylo2vec as p2vSampling a random tree¶
The definition of phylo2vec makes it very convenient to rapidly sample large, or lots of, trees.
Use sample_vector to sample a random tree topology.
p2v.sample_vector?Signature: p2v.sample_vector(n_leaves: int, ordered: bool = False) -> numpy.ndarray
Docstring:
Sample a random tree via Phylo2Vec, in vector form.
Parameters
----------
n_leaves : int
Number of leaves (>= 2)
ordered : bool, optional
If True, sample an ordered tree, by default False
True:
v_i in {0, 1, ..., i} for i in (0, n_leaves-1)
False:
v_i in {0, 1, ..., 2*i} for i in (0, n_leaves-1)
Returns
-------
numpy.ndarray
Phylo2Vec vector
File: /workspaces/phylo2vec-workshop/.pixi/envs/default/lib/python3.11/site-packages/phylo2vec/utils/vector.py
Type: functionv5 = p2v.sample_vector(5)
v5array([0, 2, 2, 6])Use sample_matrix to sample a random tree (topology + branch lengths)
p2v.sample_matrix?Signature: p2v.sample_matrix(n_leaves: int, ordered: bool = False) -> numpy.ndarray
Docstring:
Sample a random tree with branch lengths via Phylo2Vec, in matrix form.
Parameters
----------
n_leaves : int
Number of leaves (>= 2)
ordered : bool, optional
If True, sample an ordered tree, by default False
Returns
-------
numpy.ndarray
Phylo2Vec matrix
Dimensions (n_leaves, 3)
1st column: Phylo2Vec vector
2nd and 3rd columns: branch lengths of cherry [i] in the ancestry matrix
File: /workspaces/phylo2vec-workshop/.pixi/envs/default/lib/python3.11/site-packages/phylo2vec/utils/matrix.py
Type: functionm6 = p2v.sample_matrix(6).round(3)
m6array([[0. , 0.384, 0.843],
[2. , 0.158, 0.59 ],
[1. , 0.952, 0.08 ],
[0. , 0.067, 0.585],
[6. , 0.056, 0.361]])Conversion: v <--> Newick¶
Use to_newick to convert a vector to a Newick string.
newick5 = p2v.to_newick(v5)
newick6 = p2v.to_newick(m6)
print(f"Newick from v5: {newick5}")
print(f"Newick from m6: {newick6}")Newick from v5: (((0,1)6,(2,3)5)7,4)8;
Newick from m6: (((0:0.384,4:0.843)6:0.067,((1:0.158,3:0.59)7:0.952,5:0.08)8:0.585)9:0.056,2:0.361)10;
Use from_newick to convert a Newick string to a vector.
v5_new = p2v.from_newick(newick5)
print(f"Is v5 equal to v5_new? {np.array_equal(v5, v5_new)}")
m6_new = p2v.from_newick(newick6)
print(f"Is m6 equal to m6_new? {np.array_equal(m6, m6_new)}")Is v5 equal to v5_new? True
Is m6 equal to m6_new? True
Conversion: v <--> Edge list¶
An intuitive way to represent a tree topology (or any graph) is a list of edges
p2v.to_edges?Signature: p2v.to_edges(v: numpy.ndarray) -> List[Tuple[int, int]]
Docstring:
Convert a Phylo2Vec vector to an edge list
Each edge is represented as a list of two nodes (child, parent)
Parameters
----------
v : numpy.ndarray
Phylo2Vec vector
Returns
-------
edges : List[Tuple[int, int]]
List of (child, parent) edges
File: /workspaces/phylo2vec-workshop/.pixi/envs/default/lib/python3.11/site-packages/phylo2vec/base/edges.py
Type: functionUse to_edges to convert a vector to a list of tree edges (node1, node2)
Use from_edges to convert a list of edges back to a vector
edges5 = p2v.to_edges(v5)
print(edges5)
v5_edges = p2v.from_edges(edges5)
print(f"Is v5 equal to v5_edges? {np.array_equal(v5, v5_edges)}")[(2, 5), (3, 5), (0, 6), (1, 6), (6, 7), (5, 7), (7, 8), (4, 8)]
Is v5 equal to v5_edges? True
Conversion: v <--> Ancestry matrix¶
An intermediate way between the compact phylo2vec format and the Newick string of representing tree topology is what we call the “ancestry” matrix, a matrix of triplets [child1, child2, parent].
Use to_ancestry to convert a vector to an ancestry matrix
anc5 = p2v.to_ancestry(v5)
anc6 = p2v.to_ancestry(m6[:, 0].astype(int))
print(f"Ancestry from v5:\n{anc5}")
print(f"Ancestry from m6:\n{anc6}")Ancestry from v5:
[[2 3 5]
[0 1 6]
[6 5 7]
[7 4 8]]
Ancestry from m6:
[[ 0 4 6]
[ 1 3 7]
[ 7 5 8]
[ 6 8 9]
[ 9 2 10]]
The ancestry matrix is also useful to understand the order of the branch lengths in the matrix format. For a row of m6, the branch lengths are those of the children of triplet .
plot_tree(Tree(newick6))<Axes: >
Back to our primate trees¶
phylo2vec works on rooted integer trees, while both rapidNJ and IQ-TREE produce unrooted trees. Before loading the produced trees into phylo2vec, we need to:
- Root the trees (we will here use
Mouse, as we assume that it is the most distantly related taxon). For that, we use theset_outgroupfunction of ete3 - Apply a mapping of string to integer to keep the same information while having an integer tree. For that, we use the
create_label_mappingfunction of phylo2vec
from phylo2vec.utils.newick import create_label_mapping
def load_tree(file, root=None, with_branch_lengths=True):
# Load into ete3 for potential re-rooting
tr = Tree(file)
# Reroot
if root:
tr.set_outgroup(root)
# Write the tree back to Newick
if with_branch_lengths:
newick_str = tr.write(format=1, dist_formatter="%.06f")
else:
newick_str = tr.write(format=9)
# Apply a mapping
newick_int, label_mapping = create_label_mapping(newick_str.replace("'", ""))
data = {
"newick_str": newick_str,
"newick_int": newick_int,
"label_mapping": dict(sorted(label_mapping.items())),
}
# Convert to phylo2vec object
p2v_obj = p2v.from_newick(newick_int)
if p2v_obj.ndim == 1:
# ndim == 1 means it's a vector
data["v"] = p2v_obj
elif p2v_obj.ndim == 2:
# ndim == 2 means it's a matrix
data["m"] = p2v_obj
data["v"] = p2v_obj[:, 0].astype(int)
else:
raise ValueError("Unexpected ndim in phylo2vec object")
return dataprimate_iqtree_data = load_tree(primate_iqtree_file, root="Mouse", with_branch_lengths=False)
primate_iqtree_data{'newick_str': '(Mouse,(((Bovine,Lemur),Tarsier),(SquirMonk,((JpnMacaq,(RhesusMac,(CrabEMac,BarbMacaq))),(Gibbon,(Orang,(Gorilla,(Chimp,Human))))))));',
'newick_int': '(0,(((1,2),3),(4,((5,(6,(7,8))),(9,(10,(11,(12,13))))))));',
'label_mapping': {0: 'Mouse',
1: 'Bovine',
2: 'Lemur',
3: 'Tarsier',
4: 'SquirMonk',
5: 'JpnMacaq',
6: 'RhesusMac',
7: 'CrabEMac',
8: 'BarbMacaq',
9: 'Gibbon',
10: 'Orang',
11: 'Gorilla',
12: 'Chimp',
13: 'Human'},
'v': array([ 0, 1, 3, 5, 4, 5, 6, 7, 11, 9, 10, 11, 12])}Here:
newick_stris the orginal tree (without branch lengths)newick_intis the same tree, but with using integers to represent leaveslabel_mappingis a dict of the integer-to-taxon mappingvis the phylo2vec vector extracted fromnewick_int
plot_tree(Tree(primate_iqtree_data["newick_int"]))<Axes: >
One of the core functions of phylo2vec is to convert several objects representing trees to this compact vector format, and vice versa.
import phylo2vec as p2v
v = primate_iqtree_data["v"]
print(f"Vector representation: {repr(v)}")
print(f"Original integer Newick: {primate_iqtree_data['newick_int']}")
print(f"Converted integer Newick: {p2v.to_newick(v)}")Vector representation: array([ 0, 1, 3, 5, 4, 5, 6, 7, 11, 9, 10, 11, 12])
Original integer Newick: (0,(((1,2),3),(4,((5,(6,(7,8))),(9,(10,(11,(12,13))))))));
Converted integer Newick: (0,(((1,2)23,3)24,(4,((5,(6,(7,8)18)19)20,(9,(10,(11,(12,13)14)15)16)17)21)22)25)26;
Another example is a list of graph edges of the form (child, parent)
edges = p2v.to_edges(v)
print(f"Edge list: {edges}")
new_v = p2v.from_edges(edges)
assert np.array_equal(v, new_v)Edge list: [(12, 14), (13, 14), (11, 15), (14, 15), (10, 16), (15, 16), (9, 17), (16, 17), (7, 18), (8, 18), (6, 19), (18, 19), (5, 20), (19, 20), (20, 21), (17, 21), (4, 22), (21, 22), (1, 23), (2, 23), (23, 24), (3, 24), (24, 25), (22, 25), (0, 26), (25, 26)]
primate_rapidnj_data = load_tree(primate_rapidnj_file, root="'Mouse'")
primate_rapidnj_data{'newick_str': "('Mouse':0.267050,((((((('Human':0.115680,'Chimp':0.178270):0.088192,'Gorilla':0.112280):0.066043,'Orang':0.201080):0.068184,'Gibbon':0.301130):0.045591,((('RhesusMac':0.066186,'JpnMacaq':0.038473):0.032833,'CrabEMac':0.172450):0.116060,'BarbMacaq':0.174200):0.145580):0.085270,'SquirMonk':0.386660):0.061420,(('Tarsier':0.420990,'Lemur':0.367210):0.101000,'Bovine':0.343560):0.017041):0.267050);",
'newick_int': '(0:0.267050,(((((((1:0.115680,2:0.178270):0.088192,3:0.112280):0.066043,4:0.201080):0.068184,5:0.301130):0.045591,(((6:0.066186,7:0.038473):0.032833,8:0.172450):0.116060,9:0.174200):0.145580):0.085270,10:0.386660):0.061420,((11:0.420990,12:0.367210):0.101000,13:0.343560):0.017041):0.267050);',
'label_mapping': {0: 'Mouse',
1: 'Human',
2: 'Chimp',
3: 'Gorilla',
4: 'Orang',
5: 'Gibbon',
6: 'RhesusMac',
7: 'JpnMacaq',
8: 'CrabEMac',
9: 'BarbMacaq',
10: 'SquirMonk',
11: 'Tarsier',
12: 'Lemur',
13: 'Bovine'},
'm': array([[0.0000e+00, 4.2099e-01, 3.6721e-01],
[1.0000e+00, 1.0100e-01, 3.4356e-01],
[3.0000e+00, 6.6186e-02, 3.8473e-02],
[5.0000e+00, 3.2833e-02, 1.7245e-01],
[7.0000e+00, 1.1606e-01, 1.7420e-01],
[9.0000e+00, 1.1568e-01, 1.7827e-01],
[6.0000e+00, 8.8192e-02, 1.1228e-01],
[8.0000e+00, 6.6043e-02, 2.0108e-01],
[1.0000e+01, 6.8184e-02, 3.0113e-01],
[1.7000e+01, 4.5591e-02, 1.4558e-01],
[1.9000e+01, 8.5270e-02, 3.8666e-01],
[1.1000e+01, 6.1420e-02, 1.7041e-02],
[1.3000e+01, 2.6705e-01, 2.6705e-01]]),
'v': array([ 0, 1, 3, 5, 7, 9, 6, 8, 10, 17, 19, 11, 13])}Advantages of phylo2vec package¶
Sampling speed¶
Example of a simple benchmark against ete3 and IQ-TREE.
We use timeit for a simple benchmark of python commands.
We use time inside of pixi to minimise overhead when calling iqtree. We use the quiet option to limit print messages. Note that this benchmark is an upper bound of the actual time due to I/O operations.
big_n = 10000
# Sample a topology
%timeit p2v.sample_vector(big_n)
%timeit tr = Tree(); tr.populate(big_n)
# Sample a topology with branch lengths
%timeit p2v.sample_matrix(big_n)
!pixi run -- time iqtree -ru 10000 10000.tree -redo --quiet641 μs ± 36.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
29.4 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
674 μs ± 17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time: command not foundnvironment
Available tasks:
build
delete-release
download-latest-data
publish
release
set-default-repo
setup-data
start
unzip
update-release-data
zip 
Sampling speed of phylo2vec and other libraries. Author: Neil Scheidwasser. Source: Pandemic preparedness in a vector, PhD thesis (in preparation, 2025). License: CC BY 4.0.
Memory efficiency¶
Example of a memory benchmark against the Newick-formatted outputs from IQ-TREE and rapidNJ.
We use sys.getsizeof to evaluate the size of standard python objects, and the nbytes attribute to evaluate the size of NumPy array (the base for phylo2vec objects in Python).
import sys
print(f"Newick with branch lengths:\nNewick = {sys.getsizeof(primate_rapidnj_data['newick_str'])}, P2V = {primate_rapidnj_data['m'].nbytes}")
print(f"Newick without branch lengths:\nNewick = {sys.getsizeof(primate_iqtree_data['newick_str'])}, P2V = {primate_iqtree_data['v'].nbytes}")Newick with branch lengths:
Newick = 445, P2V = 312
Newick without branch lengths:
Newick = 183, P2V = 104

File size ratio 10,000 random tree topologies (resp. trees with branch lengths) with coronavirus taxa saved:
- as a plain-text file (reference)
- a hierarchical data format (HDF) file containing the trees as an array of phylo2vec vectors (resp. matrices) + and a mapping of taxon-to-integer labels
Author: Neil Scheidwasser. Source: Pandemic preparedness in a vector, PhD thesis (in preparation, 2025). License: CC BY 4.0.