2.2 File Format Guide

Introduction

This page reviews the submission file formats currently supported by the KRA, and gives guidance to submitters about current and future file formats and policies regarding KRA submissions.

FASTQ files

Fastq consists of a defline that contains a read identifier and possibly other information, nucleotide base calls, a second defline, and per-base quality scores, all in text form. There are many variations.

The following terms and formats are defined in general:

  • Identifier and other information: text string terminated by white space.

  • Bases: fastq sequence should contain standard base calls (ACTGactg) or unknown bases (Nn) and can vary in length.

  • Qualities options:

    Option
    Format

    Decimal-encoding, space-delimited

    [0-9]+ | <quality>\s[0-9]+

    Phred-33 ASCII

    [\!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+

    Phred-64 ASCII

    [\@A-Z\[\\\]\^_`a-h]+

    Quality string length should be equal to sequence length.

    In a limited set of cases, log odds or non-ASCII numerical quality values will succeed during an KRA submission.

Files from various platforms employing this format are acceptable:

@<identifier and expected information>
<sequence>
+<identifier and other information OR empty string>
<quality>

Where each instance of Identifier, Bases, and Qualities are newline-separated. Extra information added beyond the <identifier and expected information> examples is likely to be discarded/ignored.

As indicated above, the Qualities string can be space-separated numeric Phred scores or an ASCII string of the Phred scores with the ASCII character value = Phred score plus an offset constant used to place the ASCII characters in the printable character range. There are 2 predominant offsets: 33 (0 = !) and 64 (0=@).

Paired-end FASTQ

Although generally the case, there are some instances where paired reads are not a forward read paired with a reverse read.

Paired-end data submitted in FASTQ format should be submitted in one of two formats:

  1. As separate files for forward and reverse reads, in which the reads are in the same order.

  2. As interleaved, or "8-line", FASTQ, in which forward and reverse reads alternate in the file and are in order (i.e., read "1F", followed by read "1R", then read "2F", then "2R").

KRA supports the following forward/reverse read indicators: '/1' and '/2' at the end of the read name or newer Illumina style '1:Y:18:ATCACG' and '2:Y:18:ATCACG'.

Single cell FASTQ

For single-cell gene expression data, submit raw data to RUN and processed data to ANALYSIS of KRA. Submit raw data in fastq to RUN file. Include barcode sequences.

Submit demultiplexed (divided) sample and data files in the case of dozens cells (samples). In the case of more number of cells and demultiplexed data affect reproducibility, submit multiplexed (mixed) sample and data files (fastq file of both cell multiplexing oligo sequencing and gene expression sequencing data)

When multiple fastq files for cell-level libraries were generated from a single sample, by methods such as SMART-seq2, combine the files for Read1 and Read2 separately before submission.

Example:

We recommend the following forms :

Kdata_S1_L001_I1_001.fastq.gz;Kdata_S1_L001_I2_001.fastq.gz;Kdata_S1_L001_R1_001.fastq.gz;Kdata_S1_L001_R2_001.fastq.gz

For multiplexed (mixed) samples, you need to add Cell multiplexing oligo data to new EXPERIMENT and RUN as well as raw data :

  1. Gene Expression: Kdata_GEX_S1_L001_I1_001.fastq.gz;Kdata_S1_L001_I2_001.fastq.gz;Kdata_S1_L001_R1_001.fastq.gz;Kdata_S1_L001_R2_001.fastq.gz

  2. CMO(Cell Multiplexing Oligo)/HTO(Hashtag Oligo) Tags: Kdata_Multiplexing_Capture_S1_L001_R1_001.fastq.gz;Kdata_Multiplexing_Capture_S1_L001_R2_001.fastq.gz

Regarding the 10x Genomics data files, please refer to What format of 10x Genomics data should I submit to NCBI GEO/SRA?

For 10x bam files without barcode sequences, submit fastq instead. Please see Generating FASTQs with cellranger mkfastq

Library information

BAM files

BAM (Binary Alignment/Map) is a binary format derived from the SAM (Sequence Alignment/Map) format, used to store aligned or unaligned reads efficiently.

An unaligned BAM file, typically generated by Pacific Biosciences (PacBio), contains raw read data without reference alignment. It includes read identifiers, quality scores, and metadata but lacks any positional alignment to a reference genome

SFF files

SFF (Standard Flowgram Files) is the preferred input format for 454 Life Sciences (now part of Roche) data. SFF files store raw signal intensity data (flowgrams) from pyrosequencing reactions, as well as base calls and quality scores. It’s designed to handle the unique read data generated by 454 sequencing.

HDF5 and FAST5 files

HDF5 is a data model, library, and file format for storing and managing data. The KRA accepts bas.h5 and bax.h5 file submissions for PacBio-based submission and .fast5 files for submissions related to MinION Oxford Nanopore.

PACBIO_SMRT

Submission of data from the RS II instrument requires one (1) bas.h5 file and three (3) bax.h5 files. Do not link more than one PacBio RS II to an KRA run and please do not change the bax.h5 files names from those indicated in the bas.h5 file.

Depending on the platform used for your PacBio sequencing project, the following data files with respective extensions are produced and required for KRA submission.

PacBio RS Platform
Data Files Delivered

PacBio RS

  1. xxxx.metadata.xml (optional but desirable)

  2. xxxx.bas.h5

PacBio RS II

  1. xxxx.metadata.xml (optional but desirable)

  2. xxxx.bas.h5 (optional but desirable)

  3. xxxx.1.bax.h5

  4. xxxx.2.bax.h5

  5. xxxx.3.bax.h5

Please be sure to list the files for each SMRT Cell in a separate Run or on a separate row of your kra_metadata sheet.

PacBio documentation on bax.h5 / bas.h5 format: bas.h5ReferenceGuide.pdf.

MinION Oxford Nanopore

FAST5 is a legacy file type that is used to write out nanopore sequencing data, and can still be selected as an output type in MinKNOW. FAST5 is a type of HDF5 file, which is designed to contain all information needed for analysing nanopore sequencing data and tracking it back to its source. Read .fast5 files contain raw sequencing data for each read, with a default of 4000 reads per file.

Learn more about this platform at Oxford Nanopore Technologies website.

The entire set of .fast5 files must be compressed into a tar.gz file for submission. Ensure that only the .fast5 files generated after base calling are included.

Last updated