SV

1. SV VCF Header

Metadata Field Descriptions

  • fileformat: The current VCF version ID (e.g., VCFv4.1)

  • fileDate: The date that the file was generated or the date when the file was updated. Use YYYYMMDD format (e.g., 20120201)

  • reference: The RefSeq Assembly accession number from NCBI or the genome accession number from KBDS KNA on which the variation position is based (e.g., GCF_000001405.40).

Note: If you do not place the required metadata in the VCF file header, your submission files will be returned to you for correction and resubmission.

Metadata Header Example

Example of KVar Required Metadata in a VCF formatted file:

##fileformat=VCFv4.1
##fileDate=20250101
##reference=GCF_000001405.40

ALT Tag Descriptions

Define the ALT symbolic alleles used for SVs in the header. Place ALT tag/value descriptions in the header following the required metadata; they will serve to define the data you place in the ALT column of the submission data table.

The ALT tag/value descriptions are an important part of the VCF header as they will allow users viewing your data in VCF format to identify a tag you placed in the ALT column and see definitions for values of that tag. The data you present in the ALT column will be meaningless to some users without the inclusion of the tag/value descriptions in the VCF header for those data.

ALT Tag Definitions:

INFO Tag Descriptions

The VCF header continues with tag/value descriptions for required and optional KVar INFO tags. These descriptions should be placed in the header following the ALT tag descriptions.

The INFO tag/value descriptions you provide in the VCF header will serve to define the data you place in the INFO column of the data table. These descriptions are an important part of the VCF header as they will allow users viewing your data in VCF format to identify a tag you placed in the INFO column and see definitions for values of that tag. The data you present in the INFO column of the data table will be meaningless to some users without the inclusion of the tag/value descriptions in the VCF header for those data.

Note: For detailed descriptions and examples of all available INFO tags, see the INFO Tag Descriptions and Examples section below.

2. SV VCF Data Table

Data Table Structure

Create a tab-delimited table to house your variations and variation data for your submission. The table header should include these eight fixed, mandatory columns (in order):

These columns represent eight fixed fields that must be filled out for each submitted variant. If you do not have data for a particular field, use a dot (".") to represent the missing value.

VCF Data Table Examples

VCF Data Table Field Values

CHROM

This field contains the chromosome identifier from the reference genome where the variant is located. KVar accepts only the "chr" prefix format (e.g., chr1, chr2, chr3, chrX, chrY, chrM). Entries for a specific CHROM should form a contiguous block within the VCF file.

POS

This field contains the reference position of the variant, which is the 1st base of the variation event. Positions are sorted numerically within each reference sequence chromosome (CHROM) in increasing order. All coordinates should be 1-based. You are permitted to have multiple records of different structural variation types (SVTYPE) at the same POS – list shortest variants first. Telomeres are indicated by using positions 1 (p-arm) or chromosome length (q-arm).

Note: Single nucleotide variants and small (< 50 bp) insertions and deletions must be submitted following the SNP page guidelines.

ID

This field contains a unique identifier (ID) for the variant and is required. The ID provided here combined with the handle must be unique for a particular submitter.

REF

This field contains the reference allele of the variant. The bases representing the reference allele can be any of the following: A, C, G, T, or N (case insensitive).

Although standard VCF specification requires that literal sequence representing the REF allele should be provided, it can be difficult to represent the REF allele as literal sequence since structural variants can be quite large and there is often ambiguity in the locations of their breakpoints. It is therefore acceptable to list only the first base of the REF allele.

ALT

This field contains the alternate allele of the variant. Although standard VCF specification requires that literal sequence representing the ALT allele should be provided, it can be difficult to represent the ALT allele as literal sequence since structural variants can be quite large and there is often ambiguity in the locations of their breakpoints. It is therefore preferable to provide one of several ALT tags, surrounded by angle brackets, to indicate the nature of variation at the ALT allele. If there are no alternative alleles, put a dot (".") placeholder in the ALT column.

Note: Although use of ALT tags is not required, it is strongly recommended when it is not possible to list the full sequence of the submitted structural variation. See the ALT Tag Definitions section above for the complete list of available ALT tags and their descriptions.

QUAL

This field contains the quality score for the assertion if available.

FILTER

This field contains the filter status if available.

INFO

This field contains additional information for the reported variation. INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]

See the INFO Tag Descriptions and Examples section of this document for examples of the required and optional INFO Tags that KVar supports.

INFO Tag Descriptions and Examples

Required KVar VCF INFO Tags

Place the required tags in the INFO column of the data table and place the corresponding tag descriptions in the file header.

Structural Variant Type (SVTYPE) INFO Tag

The required "SVTYPE" INFO tag allows you to define the kind of structural variation you are submitting to KVar. Failure to include this required INFO tag will result in the delay of your submission.

SVTYPE Tag/Value Description:

SVTYPE Data Format Example:

SVTYPE Value Descriptions:

  • DEL: Deletion relative to the reference

  • INS: Insertion of sequence relative to the reference

  • DUP: Region of elevated copy number relative to the reference

  • INV: Inversion of reference sequence

  • CNV: Copy number polymorphic region

  • BND: Breakend; used to represent complex rearrangements where breakpoints form novel adjacencies between different genomic locations

End Position (END) INFO Tag

The required "END" INFO tag specifies the end position of the variant described in this record.

END Tag/Value Description:

END Data Format Example:

Structural Variant Length (SVLEN) INFO Tag

The required "SVLEN" INFO tag specifies the difference in length between REF and ALT alleles.

SVLEN Tag/Value Description:

SVLEN Data Format Example:

Experiment ID (EXPERIMENT) INFO Tag

The required "EXPERIMENT" INFO tag specifies the Experiment ID from the metadata that generated this call.

EXPERIMENT Tag/Value Description:

EXPERIMENT Data Format Example:

SampleSet ID (SAMPLESET) INFO Tag

The required "SAMPLESET" INFO tag specifies the SampleSet ID from the metadata in which the variant was observed.

SAMPLESET Tag/Value Description:

SAMPLESET Data Format Example:

Imprecise Variant Coordinates

For imprecise structural variants whose breakpoints are not known to basepair resolution, KVar supports both standard VCF confidence intervals (CIPOS/CIEND) and dbVar-specific inner/outer coordinate ranges (POSrange/ENDrange), following the dbVar VCF submission format.

Note:

  • KVar follows the dbVar VCF format and accepts both CIPOS/CIEND and POSrange/ENDrange tags for reporting imprecise variant coordinates. Use whichever pair of terms is best supported by your underlying data.

  • One POSrange value must equal POS; one ENDrange value must equal END.

Optional KVar VCF INFO Tags

The following INFO tags are optional and need only be used if they describe available data. If you want to include any of the following INFO tags with your submitted data, place the tag in the INFO column of the data table and place the corresponding tag description in the file header.

Optional VCF INFO tags for KVar submissions include:

  • Mate Breakend ID (MATEID)

  • Description (DESC)

  • Genetic Origin (ORIGIN)

  • Phenotype (PHENO)

  • Links to External Databases (LINKS)

  • Validation Experiment (valEXPERIMENT)

  • Validated Flag (VALIDATED)

Mate Breakend ID (MATEID)

The optional "MATEID" INFO tag specifies the ID of mate breakends for complex rearrangements. MATEID is used when representing structural variants as breakends (SVTYPE=BND), where each breakend in a novel adjacency references its mate breakend. This tag is important for linking variant calls together, particularly when grouping variant calls into variant regions.

MATEID Tag/Value Description:

MATEID Data Format Example:

For breakend variants representing complex rearrangements:

For breakends with multiple mates:

Note: MATEID is used when representing structural variants as breakends (SVTYPE=BND). When a breakend has a single mate, provide one ID. When a breakend has multiple mates (e.g., due to breakend reuse or uncertainty in measurement), provide a comma-separated list of mate IDs. The MATEID tag is crucial for grouping variant calls into variant regions during downstream analysis. For more detailed examples of MATEID usage, refer to the 1000 Genomes Project VCF format documentation.

Description (DESC)

The optional "DESC" INFO tag allows you to provide KVar with any additional information about this call that is not covered elsewhere.

DESC Tag/Value Description:

DESC Data Format Example:

Genetic Origin (ORIGIN)

The optional "ORIGIN" INFO tag allows you to provide KVar with the origin of the allele if known.

ORIGIN Tag/Value Description:

ORIGIN Data Format Example:

Phenotype (PHENO)

The optional "PHENO" INFO tag allows you to provide KVar with phenotype(s) associated with this call. Note: All clinical assertions should be submitted to ClinVar, not KVar.

PHENO Tag/Value Description:

PHENO Data Format Example:

Links to External Databases (LINKS)

The optional "LINKS" INFO tag allows you to point to this variant in external databases or to other relevant online information about your submission.

LINKS Tag/Value Description:

LINKS Data Format Example:

Validation Experiment (valEXPERIMENT)

The optional "valEXPERIMENT" INFO tag allows you to provide KVar with the Experiment ID(s) from metadata of the experiment(s) used to validate this call, followed by a colon and 'Pass' or 'Fail'.

valEXPERIMENT Tag/Value Description:

valEXPERIMENT Data Format Example:

Validated Flag (VALIDATED)

The optional "VALIDATED" INFO tag is a flag indicating that the variant was validated by a follow-up experiment.

VALIDATED Tag/Value Description:

VALIDATED Data Format Example:

3. Examples of Structural Variation in KVar VCF Format

Insertion (precise)

An insertion of 981 base pairs immediately to the right of coordinate 14588694.

Insertion (imprecise)

An insertion of approximately 1500 base pairs, as determined by a paired-end mapping (PEM) experiment. The insertion took place somewhere between the coordinates corresponding to the mapped sequence reads.

Deletion (precise)

A deletion of 387 base pairs (the deleted bases are 2599384 thru 2599770).

Deletion (imprecise)

A deletion of approximately 8000 base pairs, as determined by a paired-end mapping (PEM) experiment.

Inversion (precise)

An inversion of 863 base pairs.

Duplication (imprecise, inners and outers)

A duplication determined by oligo array CGH. The nature of arrays is such that breakpoints cannot be determined to base pair resolution, only to a range defined by probes on the array.

Inners only:

Outers only:

Inners and Outers:

4. Limitations and Notes

Variation Size and Data Submission Limitations

  • Submit variations >50bp in length to KVar using SV format. Small variants (≤ 50 bp) must follow the SNP submission format.

  • Synthetic mutations are not accepted

  • Variations ascertained from cross-species alignments and analysis are not accepted

  • Personal human data cannot be accepted due to current policy unless the participant is enrolled in a study with institutional oversight

Last updated