2.2 File Format Guide
Introduction
This page reviews the submission file formats currently supported by the KRA, and gives guidance to submitters about current and future file formats and policies regarding KRA submissions.
Some things to keep in mind
The KRA is a raw data archive, and requires per-base quality scores for all submitted data. Therefore, FASTA and other sequence-only formats are not sufficient for submission! FASTA can, however, be submitted as a reference sequence(s)
KRA accepts binary files such as BAM, SFF, FAST5 and HDF5 formats and text formats such as FASTQ
FASTQ files
Fastq consists of a defline that contains a read identifier and possibly other information, nucleotide base calls, a second defline, and per-base quality scores, all in text form. There are many variations.
The following terms and formats are defined in general:
Identifier and other information: text string terminated by white space.
Bases: fastq sequence should contain standard base calls (ACTGactg) or unknown bases (Nn) and can vary in length.
Qualities options:
OptionFormatDecimal-encoding, space-delimited
[0-9]+ | <quality>\s[0-9]+
Phred-33 ASCII
[\!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+
Phred-64 ASCII
[\@A-Z\[\\\]\^_`a-h]+
Quality string length should be equal to sequence length.
In a limited set of cases, log odds or non-ASCII numerical quality values will succeed during an KRA submission.
Files from various platforms employing this format are acceptable:
@<identifier and expected information>
<sequence>
+<identifier and other information OR empty string>
<quality>
Where each instance of Identifier, Bases, and Qualities are newline-separated. Extra information added beyond the <identifier and expected information>
examples is likely to be discarded/ignored.
As indicated above, the Qualities string can be space-separated numeric Phred scores or an ASCII string of the Phred scores with the ASCII character value = Phred score plus an offset constant used to place the ASCII characters in the printable character range. There are 2 predominant offsets: 33 (0 = !) and 64 (0=@).
Paired-end FASTQ
Although generally the case, there are some instances where paired reads are not a forward read paired with a reverse read.
Paired-end data submitted in FASTQ format should be submitted in one of two formats:
As separate files for forward and reverse reads, in which the reads are in the same order.
As interleaved, or "8-line", FASTQ, in which forward and reverse reads alternate in the file and are in order (i.e., read "1F", followed by read "1R", then read "2F", then "2R").
KRA supports the following forward/reverse read indicators: '/1'
and '/2'
at the end of the read name or newer Illumina style '1:Y:18:ATCACG'
and '2:Y:18:ATCACG'
.
Concatenated FASTQ (in which all forward reads are followed by all reverse reads) is not supported.
Single cell FASTQ
For single-cell gene expression data, submit raw data to RUN and processed data to ANALYSIS of KRA. Submit raw data in fastq to RUN file. Include barcode sequences.
Submit demultiplexed (divided) sample and data files in the case of dozens cells (samples). In the case of more number of cells and demultiplexed data affect reproducibility, submit multiplexed (mixed) sample and data files (fastq file of both cell multiplexing oligo sequencing and gene expression sequencing data)
When multiple fastq files for cell-level libraries were generated from a single sample, by methods such as SMART-seq2, combine the files for Read1 and Read2 separately before submission.
Example:
We recommend the following forms :
Kdata_S1_L001_I1_001.fastq.gz;Kdata_S1_L001_I2_001.fastq.gz;Kdata_S1_L001_R1_001.fastq.gz;Kdata_S1_L001_R2_001.fastq.gz
For multiplexed (mixed) samples, you need to add Cell multiplexing oligo data to new EXPERIMENT and RUN as well as raw data :
Gene Expression: Kdata_GEX_S1_L001_I1_001.fastq.gz;Kdata_S1_L001_I2_001.fastq.gz;Kdata_S1_L001_R1_001.fastq.gz;Kdata_S1_L001_R2_001.fastq.gz
CMO(Cell Multiplexing Oligo)/HTO(Hashtag Oligo) Tags: Kdata_Multiplexing_Capture_S1_L001_R1_001.fastq.gz;Kdata_Multiplexing_Capture_S1_L001_R2_001.fastq.gz
Regarding the 10x Genomics data files, please refer to What format of 10x Genomics data should I submit to NCBI GEO/SRA?
For 10x bam files without barcode sequences, submit fastq instead. Please see Generating FASTQs with cellranger mkfastq
Library information
In both de-multiplexed and multiplexed submissions, describe methods, name and version of kit (e.g., SMART-seq2, 10x, Drop-seq) used for single-cell library construction in Library Construction Protocol of the KRA Experiment. For 10x technology, describe version of 10x chemistry (e.g., v1, v2). Select “GENOMIC SINGLE CELL” or “TRANSCRIPTOMIC SINGLE CELL” in Library Source. Processed data for single-cell studies should be cell-level data.
BAM files
BAM (Binary Alignment/Map) is a binary format derived from the SAM (Sequence Alignment/Map) format, used to store aligned or unaligned reads efficiently.
An unaligned BAM file, typically generated by Pacific Biosciences (PacBio), contains raw read data without reference alignment. It includes read identifiers, quality scores, and metadata but lacks any positional alignment to a reference genome
The KRA accepts unaligned BAM generated from PacBio/IonTorrent as RUN file type
If user would like to submit aligned BAM, it should be submitted along with the RUN file
When submitting aligned BAM files to the KRA, you must also specify an assembly (the reference genome that your reads were aligned against)
You can identify your reference assembly by its name or accession from the NCBI genome dataset, UCSC and Ensembl.
If the assembly is not available from a public repository, you will need to submit your own assembly in FASTA format (reference_fasta) along with your aligned BAM file.
SFF files
SFF (Standard Flowgram Files) is the preferred input format for 454 Life Sciences (now part of Roche) data. SFF files store raw signal intensity data (flowgrams) from pyrosequencing reactions, as well as base calls and quality scores. It’s designed to handle the unique read data generated by 454 sequencing.
Submitter of SFF data should ensure that the data are demultiplexed (if barcoded) - this is particularly common in pyrotag / 16S rRNA amplicon sequencing.
HDF5 and FAST5 files
HDF5 is a data model, library, and file format for storing and managing data. The KRA accepts bas.h5
and bax.h5
file submissions for PacBio-based submission and .fast5
files for submissions related to MinION Oxford Nanopore.
PACBIO_SMRT
Submission of data from the RS II instrument requires one (1) bas.h5
file and three (3) bax.h5
files. Do not link more than one PacBio RS II to an KRA run and please do not change the bax.h5
files names from those indicated in the bas.h5
file.
Depending on the platform used for your PacBio sequencing project, the following data files with respective extensions are produced and required for KRA submission.
PacBio RS
xxxx.metadata.xml (optional but desirable)
xxxx.bas.h5
PacBio RS II
xxxx.metadata.xml (optional but desirable)
xxxx.bas.h5 (optional but desirable)
xxxx.1.bax.h5
xxxx.2.bax.h5
xxxx.3.bax.h5
Please be sure to list the files for each SMRT Cell in a separate Run or on a separate row of your kra_metadata sheet.
PacBio documentation on bax.h5 / bas.h5
format: bas.h5ReferenceGuide.pdf.
MinION Oxford Nanopore
FAST5 is a legacy file type that is used to write out nanopore sequencing data, and can still be selected as an output type in MinKNOW. FAST5 is a type of HDF5 file, which is designed to contain all information needed for analysing nanopore sequencing data and tracking it back to its source. Read .fast5 files contain raw sequencing data for each read, with a default of 4000 reads per file.
Learn more about this platform at Oxford Nanopore Technologies website.
The entire set of .fast5
files must be compressed into a tar.gz
file for submission. Ensure that only the .fast5
files generated after base calling are included.
Last updated