Alignment and tree building using Nextstrain

Binder

Introduction

Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data. Real time tracking of pathogen evolution data and visualization is provided by nextstrain.org. For example, the real time tracking of genomic data for COVID-19 pandemic can be found here: nextstrain.org/ncov/global. This notebook will focus on using the bioinformatics toolkits provided by Nextstrain and setting up a docker based local Nextstrain server for output visualization.

Configure notebook for run

[1]:
CPU_THREADS = 1

Load required python libraries

[2]:
import os,requests,json,time
import pandas as pd

Fetch genome sequences

[3]:
def fetch_genome_fasta_from_ncbi(refseq_id,output_path='.',file_format='fasta'):
  '''
  A function for fetching the genome fasta sequences from NCBI

  :param refseq_id: NCBI genome id
  :param output_path: Path to dump genome files, default '.'
  :param file_format: Output fileformat, default fasta, supported formats are 'fasta' and 'gb'
  :return: output_file
  '''
  try:
    url = \
      'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id={0}&rettype={1}'.\
        format(refseq_id,file_format)
    r = requests.get(url)
    if r.status_code != 200:
      raise ValueError(
              'Failed to download data from NCBI for id {0}, error code: {1}'.\
                format(refseq_id,r.status_code))
    fasta_data = r.content.decode('utf-8')
    output_file = \
      os.path.join(
        os.path.abspath(output_path),
        '{0}.{1}'.format(refseq_id,file_format))
    with open(output_file,'w') as fp:
      fp.write(fasta_data)
    print('Downloaded genome seq for {0}'.format(refseq_id))
    return output_file
  except Exception as e:
    print('Failed to download data for {0} from NCBI, error: {1}'.format(refseq_id,e))
[4]:
genome_list = [
    'MN988668.1',
    'MN938384.1',
    'MN975262.1',
    'MN985325.1',
    'MN988713.1',
    'MN994467.1',
    'MN994468.1',
    'MN997409.1'
]
[5]:
fasta_list = []
for id in genome_list:
  fasta_list.\
    append(
      fetch_genome_fasta_from_ncbi(
        refseq_id=id,
        output_path='/tmp'))
  time.sleep(2)
fasta_list
Downloaded genome seq for MN988668.1
Downloaded genome seq for MN938384.1
Downloaded genome seq for MN975262.1
Downloaded genome seq for MN985325.1
Downloaded genome seq for MN988713.1
Downloaded genome seq for MN994467.1
Downloaded genome seq for MN994468.1
Downloaded genome seq for MN997409.1
[5]:
['/tmp/MN988668.1.fasta',
 '/tmp/MN938384.1.fasta',
 '/tmp/MN975262.1.fasta',
 '/tmp/MN985325.1.fasta',
 '/tmp/MN988713.1.fasta',
 '/tmp/MN994467.1.fasta',
 '/tmp/MN994468.1.fasta',
 '/tmp/MN997409.1.fasta']
[6]:
reference_gb = \
  fetch_genome_fasta_from_ncbi(
    'NC_045512.2',
    file_format='gb',
    output_path='/tmp')
reference_gb
Downloaded genome seq for NC_045512.2
[6]:
'/tmp/NC_045512.2.gb'
[7]:
MERGED_FASTA = '/tmp/corona_merged.fasta'
!rm -f $MERGED_FASTA
for f in fasta_list:
  !cat $f >> $MERGED_FASTA

## count the genome sequences in the merged fasta file
!grep '>' $MERGED_FASTA|wc -l
8

Multiple sequence alignment and tree building

[8]:
ALIGNED_FASTA = '/tmp/corona_merged.afa'
!augur align \
  --sequences $MERGED_FASTA \
  --reference-sequence $reference_gb \
  --nthreads $CPU_THREADS \
  --output $ALIGNED_FASTA \
  --fill-gaps

using mafft to align via:
        mafft --reorder --anysymbol --nomemsave --adjustdirection --thread 1 /tmp/corona_merged.afa.to_align.fasta 1> /tmp/corona_merged.afa 2> /tmp/corona_merged.afa.log

        Katoh et al, Nucleic Acid Research, vol 30, issue 14
        https://doi.org/10.1093%2Fnar%2Fgkf436

No gaps in alignment to trim (with respect to the reference, NC_045512.2)
[9]:
RAW_TREE = '/tmp/corona_raw.tree'
!augur tree \
  --alignment $ALIGNED_FASTA \
  --method fasttree \
  --output $RAW_TREE \
  --nthreads $CPU_THREADS
Cannot specify model unless using IQTree. Model specification ignored.
Building a tree via:
        FastTreeMP -nosupport -nt /tmp/corona_merged.afa  1> /tmp/corona_raw.tree 2> /tmp/corona_raw.tree.log
        Price et al: FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments.
        PLoS ONE 5(3): e9490. https://doi.org/10.1371/journal.pone.0009490


Building original tree took 1.8078341484069824 seconds

Metadata configs for Nextstrain

[10]:
metadata = [{
  'strain':'NC_045512.2',
  'virus':'SARS-CoV-2',
  'accession':'NC_045512',
  'date':'2020-03-30',
  'region':'China',
  'country':'China',
  'division':'China',
  'city':'Wuhan',
  'db':'genbank',
  'segment':'genome',
  'authors':'Baranov,P.V., Henderson,C.M., Anderson,C.B., Gesteland,R.F.,Atkins,J.F. and Howard,M.T.',
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2',
  'title':'Programmed ribosomal frameshifting in decoding the SARS-CoV genome',
  'journal':'Virology 332 (2), 498-510 (2005)',
  'paper_url':'https://www.ncbi.nlm.nih.gov/pubmed/15680415'
  },
  {
  'strain':'MN988668.1',
  'virus':'SARS-CoV-2',
  'accession':'MN988668',
  'date':'2020-03-11',
  'region':'China',
  'country':'China',
  'division':'China',
  'city':'Wuhan',
  'db':'genbank',
  'segment':'genome',
  'authors':'Chen,L., Liu,W., Zhang,Q., Xu,K., Ye,G., Wu,W., Sun,Z., Liu,F.,Wu,K., Mei,Y., Zhang,W., Chen,Y., Li,Y., Shi,M., Lan,K. and Liu,Y.',
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN988668.1',
  'title':'RNA based mNGS approach identifies a novel human coronavirus from two individual pneumonia cases in 2019 Wuhan outbreak',
  'journal':'Emerg Microbes Infect (2019) In press',
  'paper_url':''
  },
  {
  'strain':'MN938384.1',
  'virus':'SARS-CoV-2',
  'accession':'MN938384',
  'date':'2020-02-11',
  'region':'China',
  'country':'China',
  'division':'China',
  'city':'Shenzhen',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Chan,J.F.-W., Yuan,S., Kok,K.H., To,K.K.-W., Chu,H., Yang,J.,
            Xing,F., Liu,J., Yip,C.C.-Y., Poon,R.W.-S., Tsai,H.W., Lo,S.K.-F.,
            Chan,K.H., Poon,V.K.-M., Chan,W.M., Ip,J.D., Cai,J.P.,
            Cheng,V.C.-C., Chen,H., Hui,C.K.-M. and Yuen,K.Y.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN938384.1',
  'title':'A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster',
  'journal':'Lancet (2020) In press',
  'paper_url':''
  },
  {
  'strain':'MN975262.1',
  'virus':'SARS-CoV-2',
  'accession':'MN975262',
  'date':'2020-02-11',
  'region':'China',
  'country':'China',
  'division':'China',
  'city':'Shenzhen',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Chan,J.F.-W., Yuan,S., Kok,K.H., To,K.K.-W., Chu,H., Yang,J.,
            Xing,F., Liu,J., Yip,C.C.-Y., Poon,R.W.-S., Tsai,H.W., Lo,S.K.-F.,
            Chan,K.H., Poon,V.K.-M., Chan,W.M., Ip,J.D., Cai,J.P.,
            Cheng,V.C.-C., Chen,H., Hui,C.K.-M. and Yuen,K.Y.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN975262.1',
  'title':"""A familial cluster of pneumonia associated with the 2019 novel
            coronavirus indicating person-to-person transmission: a study of a
            family cluster""",
  'journal':'Lancet (2020) In press',
  'paper_url':''
  },
  {
  'strain':'MN985325.1',
  'virus':'SARS-CoV-2',
  'accession':'MN985325',
  'date':'2020-03-27',
  'region':'USA',
  'country':'USA',
  'division':'USA',
  'city':'Atlanta',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Harcourt,J., Tamin,A., Lu,X., Kamili,S., Sakthivel,S.K., Murray,J.,
            Queen,K., Tao,Y., Paden,C.R., Zhang,J., Li,Y., Uehara,A., Wang,H.,
            Goldsmith,C., Bullock,H.A., Wang,L., Whitaker,B., Lynch,B.,
            Gautam,R., Schindewolf,C., Lokugamage,K.G., Scharton,D.,
            Plante,J.A., Mirchandani,D., Widen,S.G., Narayanan,K., Makino,S.,
            Ksiazek,T.G., Plante,K.S., Weaver,S.C., Lindstrom,S., Tong,S.,
            Menachery,V.D. and Thornburg,N.J.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN985325.1',
  'title':"""Severe Acute Respiratory Syndrome Coronavirus 2 from Patient with
            2019 Novel Coronavirus Disease, United States""",
  'journal':'Emerging Infect. Dis. 26 (6) (2020) In press',
  'paper_url':''
  },
  {
  'strain':'MN988713.1',
  'virus':'SARS-CoV-2',
  'accession':'MN988713',
  'date':'2020-02-11',
  'region':'USA',
  'country':'USA',
  'division':'USA',
  'city':'Illinois',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Tao,Y., Queen,K., Paden,C.R., Zhang,J., Li,Y., Uehara,A., Lu,X.,
            Lynch,B., Sakthivel,S.K.K., Whitaker,B.L., Kamili,S., Wang,L.,
            Murray,J.R., Gerber,S.I., Lindstrom,S. and Tong,S.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN988713.1',
  'title':'Novel betacoronavirus, Illinois',
  'journal':'',
  'paper_url':''
  },
  {
  'strain':'MN994467.1',
  'virus':'SARS-CoV-2',
  'accession':'MN994467',
  'date':'2020-02-11',
  'region':'USA',
  'country':'USA',
  'division':'USA',
  'city':'California',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Uehara,A., Queen,K., Tao,Y., Li,Y., Paden,C.R., Zhang,J., Lu,X.,
            Lynch,B., Sakthivel,S.K.K., Whitaker,B.L., Kamili,S., Wang,L.,
            Murray,J.R., Gerber,S.I., Lindstrom,S. and Tong,S.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN994467.1',
  'title':'nCoV-2019 California case 1',
  'journal':'',
  'paper_url':''
  },
  {
  'strain':'MN994468.1',
  'virus':'SARS-CoV-2',
  'accession':'MN994468',
  'date':'2020-02-11',
  'region':'USA',
  'country':'USA',
  'division':'USA',
  'city':'California',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Uehara,A., Queen,K., Tao,Y., Li,Y., Paden,C.R., Zhang,J., Lu,X.,
            Lynch,B., Sakthivel,S.K.K., Whitaker,B.L., Kamili,S., Wang,L.,
            Murray,J.R., Gerber,S.I., Lindstrom,S. and Tong,S.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN994468.1',
  'title':'nCoV-2019 California case 2',
  'journal':'',
  'paper_url':''
  },
  {
  'strain':'MN997409.1',
  'virus':'SARS-CoV-2',
  'accession':'MN997409',
  'date':'2020-02-11',
  'region':'USA',
  'country':'USA',
  'division':'USA',
  'city':'Arizona',
  'db':'genbank',
  'segment':'genome',
  'authors':"""Tao,Y., Paden,C.R., Queen,K., Uehara,A., Li,Y., Zhang,J., Lu,X.,
            Lynch,B., Sakthivel,S.K.K., Whitaker,B.L., Kamili,S., Wang,L.,
            Murray,J.R., Gerber,S.I., Lindstrom,S. and Tong,S.""",
  'url':'https://www.ncbi.nlm.nih.gov/nuccore/MN997409.1',
  'title':'nCoV-2019 sequence from Arizona case',
  'journal':'',
  'paper_url':''
  }
]

[11]:
## load metadata and write tsv files
METADATA_FILE = '/tmp/metadata.tsv'

df = pd.DataFrame(metadata)
df.to_csv(METADATA_FILE,sep='\t',index=False)
[12]:
REFINED_TREE = '/tmp/corona_refined.tree'
BRANCH_LENGTHS_JSON = '/tmp/corona_branch_lengths.json'
!augur refine \
  --tree $RAW_TREE \
  --alignment $ALIGNED_FASTA \
  --metadata $METADATA_FILE \
  --output-tree $REFINED_TREE \
  --output-node-data $BRANCH_LENGTHS_JSON \
  --timetree \
  --coalescent opt \
  --date-confidence \
  --date-inference marginal \
  --clock-filter-iqd 4

1.07    TreeTime.reroot: with method or node: least-squares

1.07    TreeTime.reroot: rerooting will ignore covariance and shared ancestry.

1.09    TreeTime.reroot: with method or node: least-squares

1.09    TreeTime.reroot: rerooting will ignore covariance and shared ancestry.

1.11    WARNING: Previous versions of TreeTime (<0.7.0) RECONSTRUCTED sequences of
        tips at positions with AMBIGUOUS bases. This resulted in unexpected
        behavior is some cases and is no longer done by default. If you want to
        replace those ambiguous sites with their most likely state, rerun with
        `reconstruct_tip_states=True` or `--reconstruct-tip-states`.

1.16    TreeTime.reroot: with method or node: least-squares

1.16    TreeTime.reroot: rerooting will account for covariance and shared ancestry.

1.26    ###TreeTime.run: INITIAL ROUND

1.69    TreeTime.reroot: with method or node: least-squares

1.70    TreeTime.reroot: rerooting will account for covariance and shared ancestry.

1.72    ###TreeTime.run: rerunning timetree after rerooting

2.17    ###TreeTime.run: ITERATION 1 out of 2 iterations

2.92    ###TreeTime.run: ITERATION 2 out of 2 iterations

3.75    ###TreeTime.run: CONVERGED

5.71    ###TreeTime.run: FINAL ROUND - confidence estimation via marginal
        reconstruction

Inferred a time resolved phylogeny using TreeTime:
        Sagulenko et al. TreeTime: Maximum-likelihood phylodynamic analysis
        Virus Evolution, vol 4, https://academic.oup.com/ve/article/4/1/vex042/4794731

updated tree written to /tmp/corona_refined.tree
node attributes written to /tmp/corona_branch_lengths.json
[13]:
## Annotate the Phylogeny
### Reconstruct Ancestral Traits
TRAITS_JSON = '/tmp/corona_traits.json'
!augur traits \
  --tree $REFINED_TREE \
  --metadata $METADATA_FILE \
  --output $TRAITS_JSON \
  --columns region country \
  --confidence
Assigned discrete traits to 9 out of 9 taxa.

NOTE: previous versions (<0.7.0) of this command made a 'short-branch
length assumption. TreeTime now optimizes the overall rate numerically
and thus allows for long branches along which multiple changes
accumulated. This is expected to affect estimates of the overall rate
while leaving the relative rates mostly unchanged.
Assigned discrete traits to 9 out of 9 taxa.

NOTE: previous versions (<0.7.0) of this command made a 'short-branch
length assumption. TreeTime now optimizes the overall rate numerically
and thus allows for long branches along which multiple changes
accumulated. This is expected to affect estimates of the overall rate
while leaving the relative rates mostly unchanged.

Inferred ancestral states of discrete character using TreeTime:
        Sagulenko et al. TreeTime: Maximum-likelihood phylodynamic analysis
        Virus Evolution, vol 4, https://academic.oup.com/ve/article/4/1/vex042/4794731

results written to /tmp/corona_traits.json
[14]:
### Infer Ancestral Sequences
NT_MUTS_JSON = '/tmp/corona_nt_muts.json'
!augur ancestral \
  --tree $REFINED_TREE \
  --alignment $ALIGNED_FASTA \
  --output-node-data $NT_MUTS_JSON \
  --inference joint

Inferred ancestral sequence states using TreeTime:
        Sagulenko et al. TreeTime: Maximum-likelihood phylodynamic analysis
        Virus Evolution, vol 4, https://academic.oup.com/ve/article/4/1/vex042/4794731

ancestral mutations written to /tmp/corona_nt_muts.json
[15]:
### Identify Amino-Acid Mutations
AA_MUTS_JSON = '/tmp/corona_aa_muts.json'
!augur translate \
  --tree $REFINED_TREE \
  --ancestral-sequences $NT_MUTS_JSON \
  --reference-sequence $reference_gb \
  --output $AA_MUTS_JSON
Read in 12 features from reference sequence file
amino acid mutations written to /tmp/corona_aa_muts.json
[16]:
%%file /tmp/lat_longs.tsv
country,china,31.0740838,107.7450643
country,usa,42.6584011,-80.2716978
region,china,31.0740838,107.7450643
region,usa,42.6584011,-80.2716978
Writing /tmp/lat_longs.tsv
[17]:
%%file /tmp/colors.tsv
country,china,#511EA8
country,usa,#cd5700
region,china,#511EA8
region,usa,#cd5700
Writing /tmp/colors.tsv
[18]:
!sed -i 's|,|\t|g' /tmp/lat_longs.tsv
[19]:
!sed -i 's|,|\t|g' /tmp/colors.tsv
[20]:
%%file /tmp/auspice_config.json
{
  "title": "Covid-19 5 samples",
  "maintainers": [
    {"name": "Your name", "url": "your url"}
  ],
  "build_url": "your build url",
  "colorings": [
    {
      "key": "gt",
      "title": "Genotype",
      "type": "categorical"
    },
    {
      "key": "num_date",
      "title": "Date",
      "type": "continuous"
    },
    {
      "key": "author",
      "title": "Author",
      "type": "categorical"
    },
    {
      "key": "country",
      "title": "Country",
      "type": "categorical"
    },
    {
      "key": "region",
      "title": "Region",
      "type": "categorical"
    }
  ],
  "geo_resolutions": [
    "country",
    "region"
  ],
  "panels": [
     "tree",
     "map"
  ],
  "display_defaults": {
    "map_triplicate": true
  },
  "filters": [
    "country",
    "region",
    "author"
  ]
}
Writing /tmp/auspice_config.json
[21]:
### Export the Results
NEXTSTRAIN_SERVER_JSON = 'corona.json'
!augur export v2 \
  --tree $REFINED_TREE \
  --metadata $METADATA_FILE \
  --node-data $BRANCH_LENGTHS_JSON \
              $TRAITS_JSON \
              $NT_MUTS_JSON \
              $AA_MUTS_JSON \
  --colors /tmp/colors.tsv \
  --lat-longs /tmp/lat_longs.tsv \
  --auspice-config /tmp/auspice_config.json \
  --output $NEXTSTRAIN_SERVER_JSON
Validating schema of '/tmp/corona_aa_muts.json'...
Validating config file /tmp/auspice_config.json against the JSON schema
Validating schema of '/tmp/auspice_config.json'...
Validating produced JSON
Validating schema of 'corona.json'...
Validating that the JSON is internally consistent...
Validation of 'corona.json' succeeded.

Nextstrain visualization

  • Download corona.json file to /local/path/data
  • Install docker on local PC
  • Get docker image: docker pull nextstrain/base
  • Run docker image: docker run -it -p4000:4000 -v /local/path:/nextstrain.org -e HOST=0.0.0.0 nextstrain/base auspice view --datasetDir /nextstrain.org/data
  • Access localhost:4000 in browser

References

  • Hadfield et al., Nextstrain: real-time tracking of pathogen evolution, Bioinformatics (2018)
[ ]: