2-5. File format guide

πŸš¨πŸš¨πŸš¨μš°μ„  GenBank file은 μ‚­μ œ??, FASTAμ—μ„œ .faλŠ” 이제 λ˜λ‚˜μš”?? (이전에 μ•ˆλ¨)...🚨🚨🚨

Introduction

This page informs the submission file formats currently supported by the KNA and gives guidance to submitters about current and future file formats and policies regarding KNA submissions.

  • KNA accepts text formats such as FASTA(.fasta) for nucleotide sequences.

triangle-exclamation
  • KNA also accepts GFF(.gff) files for annotation

FASTA file (.fasta)

FASTA files are used to submit unannotated nucleotide sequences, which can represent contigs, scaffolds, or entire chromosomes. Each sequence record in a FASTA file consists of a header line (identifier) and a sequence line.

File Structure

  1. Header line (identifier)

    • Starts with a > character.

    • Followed by a sequence identifier and optionally additional information (e.g., description, organism).

    • No line breaks are allowed within the header.

  2. Sequence line(s)

    • Contains the nucleotide sequence for that record.

    • Only valid nucleotide characters are allowed:

      • Standard bases: A, C, G, T (uppercase or lowercase)

      • Unknown bases: N or n

    • No whitespace or newline characters should appear within a continuous sequence.

    • Sequences may be split across multiple lines for readability, but the concatenated sequence represents a single record.

Example

GFF file (.gff)

A GFF (Generic Feature Format) file is a 9-column tab-delimited annotation file used to describe genomic features such as genes, transcripts, exons, CDS, and regulatory elements, along with their genomic coordinates and attributes. GFF files submitted to KNA should conform to GFF3 or GTF specifications.

File Structure

Each row represents a single genomic feature with the following 9 columns:

  1. seqid – Sequence or chromosome name

  2. source – Annotation source or software

  3. type – Feature type (e.g., gene, mRNA, exon, CDS)

  4. start – Start position of the feature on the sequence

  5. end – End position of the feature

  6. score – Numeric score (use . if not applicable)

  7. strand – Strand (+ or -)

  8. phase – Reading frame for CDS (0, 1, 2), . if not applicable

  9. attributes – Key-value pairs such as ID, Parent, Name, Note

Example (GFF3)

Reference

Last updated