2-2. KNA submission type
Prokaryotic and Eukaryotic Genome
Whole Genome Shotgun (WGS) sequencing is a high-throughput sequencing method used to determine the complete genome sequence of an organism. The technique involves randomly breaking the genome into small fragments, sequencing them, and then assembling the overlapping sequences to reconstruct the entire genome. WGS projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required.
Acceptable WGS data
In principle, KNA can accept assemblies (i.e., overlapping reads) that are appropriately assembled sequences and can not accept redundant reads (i.e., raw read sequences). If you wish to publicize raw read sequences, we recommend you contact the Korea Sequence Read Archive (KRA).
The WGS entries are contigs (overlapping reads) and/or scaffolds (assembled contigs separated by gaps).
The WGS entries can contain consecutive "n" s to represent sequencing gaps.
MAG (Metagenome Assembly Genome)
Microorganisms comprise the majority of the planet's biological diversity. However, due to the varied environments and conditions in which these organisms reside, many of them cannot be cultured by standard techniques. Culture-independent methods are essential for understanding the genetic diversity, population structure, and ecological roles of these uncultured microorganisms.
Metagenomics is the culture-independent genomic analysis of a community of microorganisms. It provides a community-wide assessment of metabolic function and bypasses the need for the isolation and laboratory cultivation of individual species. The analysis of metagenomic data provides a way to identify new organisms and isolate complete genomes from unculturable species that are present within an environmental sample.
Metagenome projects may include raw sequence reads collected from an ecological or organismal source (submitted to the Korea Sequence Read Archive), assembled contigs and/or scaffolds derived from the raw sequence data, including partial genomes from taxonomically defined organisms, and in some cases, supporting sequences such as 16S ribosomal RNAs or fosmids (submitted to Korea Nucleotide sequence Archive).
Integration with BioSample MIMS Package To enhance metadata standardization for environmental samples, MAG submissions can be associated with the BioSample MIMS (Minimum Information about a Metagenomic/Environmental Sample) package. This allows submitters to:
Specify detailed environmental parameters, such as habitat type, geographic location, collection method, and physicochemical properties of the sample.
Link genomic assemblies to corresponding BioSample records, ensuring that both sequence and environmental context are captured.
Support interoperability with international repositories by providing harmonized metadata, facilitating data reuse and comparative analyses.
By combining MAG assemblies with BioSample MIMS metadata, KNA submissions provide both genomic and environmental context, enabling comprehensive studies of microbial diversity in uncultured communities.
TLS (Targeted Locus Study)
TLS is a large-scale targeted sequencing project (>2,500 sequences) for either a single gene locus from multiple organisms or multiple conserved elements derived from a single organism. TLS studies can accommodate large submissions of 16S ribosomal RNA isolated from an environmental source or a single gene compared across a population. These can be derived from uncultured or cultured organisms. These studies can also be composed of conserved elements, such as ultraconserved elements (UCEs) isolated from a single species.
Since 2016, INSDC has accepted sequence data, including 16S rRNA or some other targeted loci, mainly to be clustered into operational taxonomic units as TLS data type.
TSA (Transcriptome Shotgun Assembly )
TSA is an archive of computationally assembled transcript sequences from primary data such as ESTs and Next Generation Sequencing Technologies. Unlike genome assembly, which reconstructs entire genomes, TSA focuses on assembling expressed genes (mRNA transcripts) to study gene expression and functional genomics. TSA is particularly useful for studying organisms with unknown or incomplete reference genomes, allowing researchers to analyze gene expression and identify novel transcripts.
The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from KNA records because there are no physical counterparts to the assemblies.
Nucleotide (Organelle, Plasmid, Virus, Phage, or other genomic features)
Nucleotide sequence data refers to the precise order of nucleotides (adenine [A], thymine [T], cytosine [C], and guanine [G] in DNA; or adenine [A], uracil [U], cytosine [C], and guanine [G] in RNA) in a given DNA or RNA molecule. This data is fundamental for understanding genetic information and conducting biological research. This section includes information on Organelle, Plasmid, Virus, Phage, or other genomic features.
Synthetic Construct
Synthetic Construct refers to an artificially designed and assembled genetic sequence used to modify or create new biological functions in living cells. These constructs are typically made using recombinant DNA technology and serve various purposes, such as gene expression, protein production, or metabolic pathway engineering.
A Synthetic Construct consists of multiple genetic elements that work together to regulate gene function:
Promoter: A DNA sequence that initiates the transcription of the target gene.
Coding Sequence (CDS): The portion of DNA that encodes the protein or functional RNA.
Terminator: A sequence that signals the end of transcription.
Selectable Marker: A gene that allows researchers to identify successfully modified cells (e.g., antibiotic resistance genes).
Vector (Plasmid or Viral Vector): A carrier DNA molecule that transports the construct into the host cell.
PCR Primer
A PCR (Polymerase Chain Reaction) primer is a short, single-stranded DNA sequence that serves as a starting point for DNA synthesis during PCR. Primers are essential for amplifying specific DNA regions, as they provide a binding site for DNA polymerase to extend the new DNA strand.
PCR primers are typically 18–30 nucleotides long and are designed to be complementary to the target DNA sequence. Two primers are required for PCR:
Forward primer: Binds to the template strand and initiates replication in the 5' to 3' direction.
Reverse primer: Binds to the complementary strand, allowing DNA synthesis in the opposite direction.
Example of a PCR Primer Sequence
For amplifying a gene of interest (e.g., a GFP gene):
Forward primer: 5’-ATGGTGAGCAAGGGCGAGGAGC-3’
Reverse primer: 5’-TTACTTGTACAGCTCGTCCATGCC-3’
These primers target the start and end regions of the GFP gene, ensuring precise amplification.
Data Not Accepted for Submission
The following types of sequences cannot be accepted for registration:
Protein sequences without corresponding nucleotide submissions
Sequences containing a mixture of genomic and mRNA regions
Consensus or predicted sequences without physical nucleic acid counterparts
Sequences shorter than 200 nucleotides, except for Synthetic Construct and PCR Primer submissions
Last updated