SNP

1. SNP VCF Header

Metadata Field Descriptions

The VCF file header should start with the following metadata:

  • fileformat: Current VCF version ID (e.g., VCFv4.1)

  • fileDate: Date the file was generated or updated (YYYYMMDD format, e.g., 20120201)

  • handle: Registered submitter handle

  • batch: Experiment ID received from the metadata

  • bioproject_id: Registered BioProject ID (if available)

  • biosample_id: Comma-separated list of registered BioSample IDs (if available)

  • reference: The RefSeq Assembly accession number from NCBI or the genome accession number from KBDS KNA on which the variation position is based (e.g., GCF_000001405.40).

Note: If you do not place the required metadata in the VCF file header, your submission files will be returned to you for correction and resubmission.

Metadata Header Example

Example of KVar Metadata in a VCF formatted file:

##fileformat=VCFv4.1
##fileDate=20250101
##handle=KVar
##batch=Exome_SNP_Discovery
##bioproject_id=KAP000001
##biosample_id=KAS00000001,KAS00000002,KAS00000003,KAS00000004,KAS00000005
##reference=GCF_000001405.40

INFO Tag Descriptions

The VCF header continues with tag/value descriptions for required and optional KVar INFO tags. These descriptions should be placed in the header following the required metadata.

The INFO tag/value descriptions you provide in the VCF header will serve to define the data you place in the INFO column of the data table. These descriptions are an important part of the VCF header as they will allow users viewing your data in VCF format to identify a tag you placed in the INFO column and see definitions for values of that tag. The data you present in the INFO column of the data table will be meaningless to some users without the inclusion of the tag/value descriptions in the VCF header for those data.

INFO and FORMAT Field Definition Example

2. SNP VCF Data Table

Data Table Structure

Create a tab-delimited table to house your variations and variation data for your submission. The table header should include these six fixed, mandatory columns (in order):

These columns represent six fixed fields that must be filled out for each submitted variant. If you do not have data for a particular field, use a dot (".") to represent the missing value.

VCF Data Table Examples

VCF Data Table Field Values

CHROM

This field contains the chromosome identifier from the reference genome where the variant is located. KVar accepts only the "chr" prefix format (e.g., chr1, chr2, chr3, chrX, chrY, chrM). Entries for a specific CHROM should form a contiguous block within the VCF file.

POS

This field contains the reference position of the variant, which is the 1st base of the variation event. Positions are sorted numerically within each reference sequence chromosome (CHROM) in increasing order. When there are multiple variation alleles starting at the same POS, submit one variation type (VRT) per line.

Note: For short, simple insertions and deletions in which the REF or one of the ALT alleles would otherwise be null/empty, the POS field must contain the coordinates of the base preceding the indel event. See the Submission Data Table Special Case Examples section of this document for instruction on reporting insertion/deletion POS values.

ID

This field contains the unique local ID of the variant and is a required value (cannot be NULL). The ID provided here combined with the handle must be unique for a particular submitter. You can use an HGVS expression for the variant ID if you do not have a unique identifier of your own.

REF

This field contains the reference allele of the variant. The bases representing the reference allele can be any of the following: A, C, G, T, N (case insensitive).

Note:

  • In order for the variant to be included in KVar, the maximum length for the REF allele is 51bp.

  • For short, simple insertions and deletions in which the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base preceding the indel event.

ALT

This field contains a comma-separated list of alternate, non-reference alleles that you have called in at least one sample. You can use A, C, G, T or N (case insensitive). If there are no alternative alleles, put a dot (".") placeholder in the ALT column.

Note:

  • In order for the variant to be included in KVar, the maximum length of each ALT allele is 51bp.

  • For short, simple insertions and deletions in which the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base preceding the indel event.

QUAL

This field contains the quality score for the assertion if available.

FILTER

This field contains the filter status if available.

INFO

This field contains additional information for the reported variation. INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]

See the INFO Tag Descriptions and Examples section of this document for examples of the required and optional INFO Tags that KVar supports.

Submission Data Table Special Case Examples: Reporting POS, REF and ALT for insertion/deletion variants

For simple insertions and deletions where either the REF or one of the ALT alleles would otherwise be null/empty, include the base preceding the variation event (a "padding base") in the REF and ALT allele Strings, and report the coordinates of this "padding base" in POS.

The "padding base" is not required for complex substitutions or other events where all alleles have at least one base represented in their Strings.

Insertion Example

If the coordinates of first base of the insertion event ("G" at position 43219) in the above sequence were used as the reference position (POS) of this event, the REF field would have no value since the inserted bases are only present in the ALT allele. In such a case, report the coordinates of the base that precedes the insertion event—the "t" at position 43218—for POS and include this "padding base" in the REF and ALT Strings:

Deletion Example

If the coordinates of first base of this deletion event ("A" at position 701132) in the above sequence were used as the reference position (POS) of this variant, the ALT field would have no value since the deleted bases are only present in the reference (REF) allele. In such a case, report the coordinates of the base that precedes the deletion event—the "a" at position 701131—for POS and include this "padding base" in the REF and ALT Strings:

INFO Tag Descriptions and Examples

Required KVar VCF INFO Tags

Place the required tags in the INFO column of the data table and place the corresponding tag descriptions in the file header.

Variation Type (VRT) INFO Tag

The required "VRT" INFO tag allows you to define the kind of variation you are submitting to KVar. We use this information to verify position and that the reported alleles are consistent with reported variation type. Failure to include this required INFO tag will result in the delay of your submission.

Also only one variation type (VRT) can be reported per row. For instance if you have both a deletion variation and SNV at the same location, they should be reported as two separate rows with the corresponding VRT value.

VRT Tag/Value Description:

VRT Data Format Example:

VRT Value Descriptions:

  • 1 - SNV: Single nucleotide variation

  • 2 - DIV: Deletion/insertion variation

  • 3 - HETEROZYGOUS: Variable, but undefined at nucleotide level

  • 4 - STR: Short tandem repeat (microsatellite) variation

  • 5 - NAMED: Insertion/deletion variation of named repetitive element

  • 6 - NO VARIATION: Sequence scanned for variation, but none observed

  • 7 - MIXED: Cluster contains submissions from 2 or more allelic classes (not used)

  • 8 - MNV: Multiple nucleotide variation with alleles of common length greater than 1

  • 9 - Exception: Exception

Population ID and Population Frequency

Population frequency and SampleSet ID are required fields for all KVar submissions. Use the ##population_id= header field, but provide the SampleSet ID (obtained from the metadata) as the value. Place one ##population_id= line per SampleSet in the VCF header after the INFO tag/value descriptions and before your data table. The SampleSet IDs you provide must match the SampleSet IDs defined in your metadata submission.

Population frequency values are reported per SampleSet using additional columns in the data table. Add a FORMAT column that specifies the data types and order, then add one column per SampleSet (column header uses the SampleSet ID value from ##population_id). Under each SampleSet column, report the counts/frequencies per the FORMAT definition.

Header Tag/Value Definitions:

Header Example:

Data Table Example:

Optional KVar VCF INFO Tags

The following INFO tags are optional and need only be used if they describe available data. If you want to include any of the following INFO tags with your submitted data, place the tag in the INFO column of the data table and place the corresponding tag description in the file header.

Optional VCF INFO tags for KVar submissions include:

  • Alternate Designations (AD)

  • Ancestral Allele (AA)

  • Free Text for Comment (CMT)

  • LinkOut (LKO)

  • Number of Independent Observations (NIO)

  • OMIM/OMIA Record

  • PubMed ID (PMID)

  • Variant Allele Origin (SAO)

  • Variant Suspect Reason (SSR)

Alternate Designations (AD) or Names

The optional "AD" INFO tag allows you to provide KVar with a (comma-separated) set of alternative names or common names used to describe the same submitted variant.

AD Tag/Value Description:

AD Tag/Value Example:

Ancestral Allele (AA)

The optional "AA" INFO tag allows you to provide KVar with the ancestral allele (if you know it) for a variant.

AA Tag/Value Description:

AA Tag/Value Example:

Free Text for Comment (CMT)

The optional "CMT" INFO tag allows you to provide KVar with text about any additional important information that cannot be described (e.g. phenotypic information) using the other available INFO tags.

CMT Tag/Value Description:

CMT Data Format Example:

LinkOut (LKO)

The optional "LKO" INFO tag allows you to point to this variant on your organization's web site or to other relevant online information about your submission.

LKO Tag/Value Description:

LKO Data Format Example:

Number of Independent Observations (NIO)

The optional "NIO" INFO tag allows you to provide KVar with the number of times you observed this variant occur independently in your experimental analysis.

NIO Tag/Value Description:

NIO Tag/Value Example:

OMIM and OMIA (OMIM/OMIA) Records

The optional "OMIM" and "OMIA" INFO tags allow you to provide KVar with any available OMIM or OMIA record and variant ID (if available) associated with a variant.

OMIM and OMIA Tag/Value Descriptions:

  • OMIM:

  • OMIA:

OMIM and OMIA Data Format Example:

PubMed ID (PMID) INFO Tag

The optional "PMID" INFO tag allows you to provide KVar with the PubMed ID (if available) for an original publication associated with a variant. If multiple PubMed IDs (PMID) are available for a single variant, report them using a comma-separated list.

PMID Tag/Value Description:

PMID Data Format Example:

Variant Allele Origin (SAO) INFO Tag

The optional "SAO" or "Variant Allele Origin" INFO tag allows you to provide KVar with the source of the sample from which the variant was derived.

Note: Although the name we use to refer to Allele Origin has changed from "SNP Allele Origin" (SAO) to "Variant Allele Origin" to emphasize that the KVar database contains both rare and polymorphic variants, the database itself still uses the acronym "SAO".

SAO Tag/Value Description:

SAO Data Format Example:

Note: If you are providing more than one allele origin value, place the allele origin values in a comma-separated list in the order that they appear in the submission. List the value for the reference allele first, followed by the allele origin value for the 1st alternate allele, 2nd alternate allele, etc.

Variant Suspect Reason (SSR) INFO Tag

The optional "SSR" or "Variant Suspect Reason" INFO tag allows you to provide KVar with the reason you suspect that a variant is a false positive. Evidence for false positives can include information indicating the presence of a paralogous sequence in the genome or evidence of sequencing error or computation artifacts.

Note: Although the name we use to refer to the Suspect Reason code has changed from "SNP Suspect Reason" (SSR) to "Variant Suspect Reason" to emphasize that the KVar database contains both rare and polymorphic variants, the database itself still uses the acronym "SSR".

SSR Tag Description:

SSR Data Format Example:

3. Limitations and Notes

Variation Size and Data Submission Limitations

  • Submit studies with Study Variant Type = SV for variations larger than 50 bp.

  • Synthetic mutations are not accepted

  • Variations ascertained from cross-species alignments and analysis are not accepted

  • Personal human data cannot be accepted due to current NIH policy unless the participant is enrolled in a study with institutional oversight

Appendix: Example of a VCF Formatted KVar Submission

Last updated