2413.01(g) The “Sequence Listing XML” Must Contain a Sequence Data Part [R-07.2022]

2413.01(g) The “Sequence Listing XML” Must Contain a Sequence Data Part [R-07.2022]

[Editor Note: This section is applicable to all applications filed on or after July 1, 2022, having disclosures of nucleotide and/or amino acid sequences as defined in 37 CFR 1.831(b). Formatting representations of XML (eXtensible Markup Language) elements in this section appear different than shown in Standard ST.26, which may be accessed at: www.wipo.int/export/sites/www/standards/en/pdf/03-26-01.pdf.]

37 CFR 1.833 Requirements for a “Sequence Listing XML” for nucleotide and/or amino acid sequences as part of a patent application filed on or after July 1, 2022.

  • *****

  • (b) The “Sequence Listing XML” presented in accordance with paragraph (a) of this section must further:
    • *****

    • (2) Comply with the requirements of WIPO Standard ST.26 to include:
      • *****

      • (v) A sequence data part that complies with the requirements of paragraphs 50–55, 57, 58, 60–69, 71–78, 80–87, 89–98, and 100, as applicable, of WIPO Standard ST.26 representing the nucleotide and/or amino acid sequences according to § 1.832.
  • *****

The sequence data part is the part of the “Sequence Listing XML” that associates all relevant biological sequence information for each individual nucleotide or amino acid sequence that meets the definition for inclusion in a “Sequence Listing XML.” WIPO Standard ST.26, paragraph 50, specifies that the sequence data part must be composed of one or more SequenceData elements, each element containing information about one sequence.

WIPO Standard ST.26, paragraph 51, specifies that each SequenceData element must have a mandatory attribute sequenceIDNumber, in which the sequence identifier (see MPEP § 2412.05(a)) for each sequence is contained.

WIPO Standard ST.26 specifies that the SequenceData element must contain a dependent element INSDSeq, consisting of further dependent elements as follows:

List of INSDSeq Dependent Elements
Element Description Sequences Intentionally Skipped Sequences
INSDSeq_length Length of the sequence Mandatory Mandatory with no value
INSDSeq_moltype Molecule type Mandatory Mandatory with no value
INSDSeq_division Indication that a sequence is related to a patent application Mandatory with the value “PAT” Mandatory with no value
INSDSeq_feature-table List of annotations of the sequence Mandatory Must NOT be included
INSDSeq_sequence Sequence Mandatory Mandatory with the value “000”

(reproduced from paragraph 52 of WIPO Standard ST.26)

WIPO Standard ST.26, paragraph 53, specifies that the element INSDSeq_length must disclose the number of nucleotides or amino acids of the sequence contained in the INSDSeq_sequence element.

WIPO Standard ST.26, paragraph 54, specifies that the element INSDSeq_moltype must disclose the type of molecule that is being represented. For nucleotide sequences, including nucleotide analogue sequences, the molecule type must be indicated as DNA or RNA. For amino acid sequences, the molecule type must be indicated as AA.

WIPO Standard ST.26, paragraph 55, specifies that for a nucleotide sequence that contains both DNA and RNA segments of one or more nucleotides, the molecule type must be indicated as DNA. The combined DNA/RNA molecule must be further described in the feature table, using the feature key “source” and the mandatory qualifier “organism” with the value “synthetic construct” and the mandatory qualifier “mol_type” with the value “other DNA”. Each DNA and RNA segment of the combined DNA/RNA molecule must be further described with the feature key “misc_feature” and the qualifier “note”, which indicates whether the segment is DNA or RNA.

WIPO Standard ST.26, paragraph 57, specifies that the element INSDSeq_sequence must disclose the sequence. Only the appropriate symbols set forth in Table 1: List of Nucleotides Symbols and Table 3: List of Amino Acids Symbols (reproduced in MPEP § 2412.03(a)) must be included in the sequence. The sequence must not include numbers, punctuation or whitespace characters.

I. FEATURE TABLE

According to WIPO Standard ST.26, a “feature table” “contains information on the location and roles of various regions within a particular sequence. A feature table is required for every sequence, except for any intentionally skipped sequence, in which case it must not be included. The feature table is contained in the element INSDSeq_feature-table, which consists of one or more INSDFeature elements”. (WIPO Standard ST.26, paragraph 60).

WIPO Standard ST.26 specifies that each INSDFeature element that comprise the feature table describe one feature, and consists of dependent elements as follows:

List of INSDFeature Dependent Elements
Element Description Mandatory/Optional
INSDFeature_key A word or abbreviation indicating a feature Mandatory
INSDFeature_location Region of the sequence which corresponds to the feature Mandatory
INSDFeature_quals Qualifier containing auxiliary information about a feature Mandatory where the feature key requires one or more qualifiers, e.g., source; otherwise, Optional

(Reproduced from paragraph 61 of WIPO Standard ST.26).

II. FEATURE KEYS

WIPO Standard ST.26, paragraph 62, specifies the exclusive listing of feature keys in Annex I that must be used when preparing and submitting a “Sequence Listing XML,” along with an exclusive listing of associated qualifiers and an indication as to whether those qualifiers are mandatory or optional. Section 5 of Annex I of WIPO Standard ST.26 provides the exclusive listing of feature keys for nucleotide sequences and Section 7 of Annex I of WIPO Standard ST.26 provides the exclusive listing of feature keys for amino acid sequences.

III. MANDATORY FEATURE KEYS

WIPO Standard ST.26, paragraph 63, specifies that the “source” feature key is mandatory for all nucleotide sequences and for all amino acid sequences, except for any intentionally skipped sequence. Each sequence must have a single “source” feature key spanning the entire sequence. Where a sequence originates from multiple sources, those sources may be further described in the feature table, using the feature key “misc_feature” and the qualifier “note” for nucleotide sequences, and the feature key “REGION” and the qualifier “note” for amino acid sequences.

IV. FEATURE LOCATION

WIPO Standard ST.26, paragraph 64, specifies that the mandatory element INSDFeature_location must contain at least one location descriptor, which defines a site or a region corresponding to a feature of the sequence in the INSDSeq_sequence element. Amino acid sequences must contain one and only one location descriptor in the mandatory INSDFeature_location element. Nucleotide sequences may have more than one location descriptor in the mandatory INSDFeature_location element when used in conjunction with one or more location operator(s) (more information about location descriptors is discussed below).

WIPO Standard ST.26, paragraph 65, specifies that the location descriptor can be a single residue number, a region delimiting a contiguous span of residue numbers, or a site or region that extends beyond the specified residue or span of residues. The location descriptor must not include numbering for residues beyond the range of the sequence in the INSDSeq_sequence element. For nucleotide sequences only, a location descriptor can be a site between two adjacent residue numbers. Multiple location descriptors must be used in conjunction with a location operator when a feature corresponds to discontinuous sites or regions of a nucleotide sequence (more information about location descriptors is discussed below).

WIPO Standard ST.26, paragraph 66, specifies that the syntax for each type of location descriptor is indicated in Table 10 below, where x and y are residue numbers, indicated as positive integers, not greater than the length of the sequence in the INSDSeq_sequence element, and x is less than y. Paragraph 66 of WIPO Standard ST.26).

(a) Location descriptors for nucleotide and amino acid sequences:

Location Descriptors for Nucleotide and Amino Acid Sequences
Location descriptor type Syntax Description
Single residue number x Points to a single residue in a sequence.
Residue numbers delimitating a sequence span x. .y Points to a continuous range of residues bounded by and including the starting and ending residues.
Residues before the first or beyond the last specified residue number

<x

>x

<x. .y

x. .>y

<x. .>y

Points to a region including a specified residue or span of residues and extending beyond a specified residue. The ‘<‘ and ‘>’ symbols may be used with a single residue or the starting and ending residue numbers of a span of residues to indicate that a feature extends beyond the specified residue number.

(Reproduced from paragraph 66 of WIPO Standard ST.26.)

(b) Location descriptors for nucleotide sequences only:

Location Descriptors for Nucleotide Sequence Only
Location descriptor type Syntax Description
A site between two adjoining nucleotides x^y Points to a site between two adjoining nucleotides, e.g., endonucleolytic cleavage site. The position numbers for the adjacent nucleotides are separated by a carat (^). The permitted formats for this descriptor are x^x+1 (for example 55^56), or, for circular nucleotides, x^1, where “x” is the full length of the molecule, i.e. 1000^1 for circular molecule with length 1000.

(reproduced from paragraph 66 of WIPO Standard ST.26)

(c) Location descriptors for amino acid sequences only:

Location Descriptors for Amino Acid Sequences Only
Location descriptor type Syntax Description
Residue numbers joined by an intrachain cross-link x. .y Points to amino acids joined by an intrachain linkage when used with a feature that indicates an intrachain cross-link, such as “CROSSLNK” or “DISULFID”.

(Reproduced from paragraph 66 of WIPO Standard ST.26).

WIPO Standard ST.26 specifies that the INSDFeature_location element of nucleotide sequences may contain one or more location operators. A location operator is a prefix to either one location descriptor or a combination of location descriptors corresponding to a single but discontinuous feature, and specifies where the location corresponding to the feature on the indicated sequence is found or how the feature is constructed. A list of location operators is provided in the table below with their descriptions. Location operators can be used for nucleotides only.

Location Operators
Location syntax Location description
join (location, location ,…, location) The indicated locations are joined (placed end-to-end) to form one contiguous sequence.
order (location, location,…,location) The elements are found in the specified order but nothing is implied about whether joining those elements is reasonable.
complement (location) Indicates that the feature is located on the strand complementary to the sequence span specified by the location descriptor, when read in the 5’ to 3’ direction or in the direction that mimics 5’ to 3’ direction.

(Reproduced from paragraph 67 of WIPO Standard ST.26)

WIPO Standard ST.26, paragraph 68, specifies that the join and order location operators require that at least two comma-separated location descriptors be provided. Location descriptors involving sites between two adjacent residues, i.e. x^y, must not be used within a join or order location. Use of the join location operator implies that the residues described by the location descriptors are physically brought into contact by biological processes (for example, the exons that contribute to a coding region feature).

WIPO Standard ST.26, paragraph 69, specifies that the location operator “complement” can be used in combination with either “join” or “order” within the same location. Combinations of “join” and “order” within the same location must not be used.

WIPO Standard ST.26, paragraph 71, specifies that in an XML instance of a “Sequence Listing XML” , the characters “<” and “>” in a location descriptor must be replaced by the appropriate predefined entities (see MPEP § 2413.01(a) regarding the predefined entities).

V. FEATURE QUALIFIERS

WIPO Standard ST.26, paragraph 72, specifies that qualifiers are used to supply information about features in addition to that conveyed by the feature key and feature location. There are three types of value formats to accommodate different types of information conveyed by qualifiers, namely:

  • (a) free text (see MPEP § 2413.01(g), subsection X, for more detail about “free text”);
  • (b) controlled vocabulary or enumerated values (e.g., a number or date); and
  • (c) sequences.

WIPO Standard ST.26, paragraph 73, specifies the exclusive listing of qualifiers and their specified value formats, if any, for each nucleotide sequence feature key in Section 6 of Annex I and in Section 8 of Annex I for each amino acid sequence feature key.

WIPO Standard ST.26, paragraph 74, specifies that any sequence encompassed by 37 CFR 1.831(b) (see MPEP § 2412.03) which is provided as a qualifier value must be separately included in the “Sequence Listing XML” and assigned its own sequence identifier as described in MPEP § 2412.05(a).

VI. MANDATORY FEATURE QUALIFIERS

WIPO Standard ST.26, paragraph 75, specifies that one mandatory feature key, i.e., “source” for nucleotide sequences and amino acid sequences, requires two mandatory qualifiers, “organism” and “mol_type”. Some optional feature keys also require mandatory qualifiers. ST.26).

VII. QUALIFIER ELEMENTS

WIPO Standard ST.26 specifies that the element INSDFeature_quals contains one or more INSDQualifier elements. Each INSDQualifier element represents a single qualifier and consists of three dependent elements as shown below:

List of INSDQualifier Dependent Elements
Element Description Mandatory/Optional
INSDQualifier_name Name of the qualifier (see Annex I, Sections 6 and 8) Mandatory
INSDQualifier_value Value of the qualifier, if any, in the specified format (see Annex I, Sections 6 and 8) and composed in the characters as set forth in paragraph 40(b). Mandatory, when specified (see paragraph 87 and Annex I, Sections 6 and 8)
NonEnglishQualifier_value Value of the qualifier, if any, in the specified format (see Annex I, Sections 6 and 8) and composed in the characters as set forth in paragraph 40(a). Mandatory, when specified (see paragraph 87 and Annex I, Sections 6 and 8)

(Reproduced from paragraph 76 of WIPO Standard ST.26.)

WIPO Standard ST.26, paragraph 77, specifies that the organism qualifier, i.e., “organism” for nucleotide sequences ( See Table 5: List of Qualifier Values for Nucleotide Sequences with Language-Dependent Free-Text Values reproduced in MPEP § 2413.01(h)) and “organism” for amino acid sequences (see Table 6: List of Qualifiers for Amino Acid Sequences with Language-Dependent Free Text Values reproduced in MPEP § 2413.01(h)) must disclose the source, i.e., a single organism or origin, of the sequence. Organism designations should be selected from a taxonomy database.

WIPO Standard ST.26, paragraph 78, specifies that if the sequence is naturally occurring and the source organism has a Latin genus and species designation, that designation must be used as the qualifier value. The preferred English common name may be specified using the qualifier “note” for nucleotide sequences and amino acid sequences, but must not be used in the organism qualifier value.

VIII. SPECIFYING VALUES FOR INSDCQUALIFIER_NAME AND INSDCQUALIFIER_VALUE ELEMENTS

WIPO Standard ST.26, paragraph 80, specifies that if the sequence is naturally occurring and the source organism has a known Latin genus, but the species is unspecified or unidentified, then the organism qualifier value must indicate the Latin genus followed by “sp”.

WIPO Standard ST.26, paragraph 81, specifies that if the sequence is naturally occurring, but the Latin organism genus and species designation is unknown, then the organism qualifier value must be indicated as “unidentified”. Any known taxonomic information should be indicated in the qualifier “note” for nucleotide sequences and the qualifier “note” for amino acid sequences.

WIPO Standard ST.26, paragraph 82, specifies that if the sequence is naturally occurring and the source organism does not have a Latin genus and species designation, such as a virus, then another acceptable scientific name (e.g., “Canine adenovirus type 2”) must be used as the organism qualifier value.

WIPO Standard ST.26, paragraph 83, specifies that if the sequence is not naturally occurring, the organism qualifier value must be indicated as “synthetic construct”. Further information with respect to the way the sequence was generated may be specified using the qualifier “note” for nucleotide sequences and the qualifier “note” for amino acid sequences.

IX. SPECIFYING MOL_TYPE

WIPO Standard ST.26, paragraph 84, specifies that the “mol_type” qualifier for nucleotide sequences and “mol_type” qualifier for amino acid sequences must disclose the type of molecule represented in the sequence. These qualifiers are distinct from the element INSDSeq_moltype discussed above where INSDSeq_moltype for nucleotide sequences, including nucleotide analogue sequences must be indicated as DNA or RNA, and for amino acid sequences, must be indicated as AA:

  • (1) For a nucleotide sequence, the “mol_type” qualifier value must be one of the following: “genomic DNA”, “genomic RNA”, “mRNA”, “tRNA”, “rRNA”, “other RNA”, “other DNA”, “transcribed RNA”, “viral cRNA”, “unassigned DNA”, or “unassigned RNA”. If the sequence is not naturally occurring, i.e. the value of the “organism” qualifier is “synthetic construct”, the “mol_type” qualifier value must be either “other RNA” or “other DNA”;
  • (2) For an amino acid sequences, the “mol_type” qualifier value is “protein”. (reproduced, in part, from paragraph 84 of WIPO Standard ST.26).

X. FREE TEXT

WIPO Standard ST.26, paragraph 85, specifies that “free text” is a type of value format for certain qualifiers presented in the form of a descriptive text phrase or other specified format (see MPEP § 2413.01(h) for the definition of “free text” and see Annex I of WIPO Standard ST.26 for controlled vocabulary).

WIPO Standard ST.26, paragraph 86, specifies that the use of free text must be limited to a few short terms indispensable for the understanding of a characteristic of the sequence. For each qualifier, the free text must not exceed 1000 characters.

WIPO Standard ST.26, paragraph 87, specifies that language-dependent free text (see MPEP § 2413.01(d) for definition of language, is the free text value of certain qualifiers that is language-dependent in that it may require translation for national, regional or international procedures. Qualifiers for nucleotide sequences with a language-dependent free text value format are identified in Table 5: List of Qualifier Values for Nucleotide Sequences with Language-Dependent Free-Text Values (reproduced in of MPEP § 2413.01(h)). Qualifiers for amino acid sequences with a language-dependent free text value format are identified in Table 6: List of Qualifiers for Amino Acid Sequences with Language-Dependent Free Text Values (reproduced in MPEP § 2413.01(h)).

XI. CODING SEQUENCES

WIPO Standard ST.26, paragraph 89, specifies that the “CDS” feature key may be used to identify coding sequences, i.e., sequences of nucleotides which correspond to the sequence of amino acids in a protein and the stop codon. The location of the “CDS” feature in the mandatory element INSDFeature_location must include the stop codon.

WIPO Standard ST.26, paragraph 90, specifies that the “transl_table” and “translation” qualifiers may be used with the “CDS” feature key (see Annex I of WIPO Standard ST.26). Where the “transl_table” qualifier is not used, the use of the Standard Code Table (see Annex I, Section 9, Table 7 of WIPO Standard ST.26) is assumed.

WIPO Standard ST.26, paragraph 91, specifies that the “transl_except” qualifier must be used with the “CDS” feature key and the “translation” qualifier to identify a codon that encodes either pyrrolysine or selenocysteine.

WIPO Standard ST.26, paragraph 92, specifies that an amino acid sequence encoded by the coding sequence and disclosed in a “translation” qualifier that is encompassed by encompassed by the description of sequences found in MPEP § 2412.05(a) referencing paragraph 7 of WIPO Standard ST.26 must be included in the sequence listing and assigned its own sequence identifier. The sequence identifier assigned to the amino acid sequence must be provided as the value in the qualifier “protein_id” with the “CDS” feature key. The “organism” qualifier of the “source” feature key for the amino acid sequence must be identical to that of its coding sequence.

XII. VARIANTS

MPEP § 2412.05(c) provides information about representation and inclusion of variants

WIPO Standard ST.26, paragraph 93, specifies that a primary sequence and any variant of that sequence, each disclosed by enumeration of its residues and encompassed by the description of sequences found in MPEP § 2412.05(a) referencing paragraph 7 of WIPO Standard ST.26, must each be included in the sequence listing and assigned their own sequence identifier.

WIPO Standard ST.26, paragraph 94, specifies that any variant sequence, disclosed as a single sequence with enumerated alternative residues at one or more positions, must be included in the sequence listing and should be represented by a single sequence, wherein the enumerated alternative residues are represented by the most restrictive ambiguity symbol.

WIPO Standard ST.26, paragraph 95, specifies that any variant sequence, disclosed only by reference to deletion(s), insertion(s), or substitution(s) in a primary sequence in the sequence listing, should be included in the sequence listing. Where included in the sequence listing, such a variant sequence:

  • (a) may be represented by annotation of the primary sequence, where it contains variation(s) at a single location or multiple distinct locations and the occurrence of those variations are independent;
  • (b) should be represented as a separate sequence and assigned its own sequence identifier, where it contains variations at multiple distinct locations and the occurrence of those variations are interdependent; and
  • (c) must be represented as a separate sequence and assigned its own sequence identifier, where it contains an inserted or substituted sequence that contains in excess of 1000 residues (see WIPO Standard ST.26, paragraph 86).

WIPO Standard ST.26, paragraph 96, specifies the proper use of feature keys and qualifiers for nucleic acid and amino acid sequence variants from the table List of Feature Keys and Qualifiers (reproduced in MPEP § 2412.05(c)).

WIPO Standard ST.26, paragraph 97, specifies that annotation of a sequence for a specific variant must include a feature key and qualifier, as indicated in the table above, and the feature location. The value for the “replace” qualifier must be only a single alternative nucleotide or nucleotide sequence using only the symbols in set forth Table 1: List of Nucleotides Symbols (reproduced in MPEP § 2413.01(g)), or empty. A listing of alternative residues may be provided as the value in the “note” qualifier. In particular, a listing of alternative amino acids must be provided as the value in the “note” qualifier where “X” is used in a sequence, and represents a value other than “any one of ‘A’, ‘R’, ‘N’, ‘D’, ‘C’, ‘Q’, ‘E’, ‘G’, ‘H’, ‘I’, ‘L’, ‘K’, ‘M’, ‘F’, ‘P’, ‘O’, ‘S’, ‘U’, ‘T’, ‘W’, ‘Y’, or ‘V.’” A deletion must be represented by an empty qualifier value for the “replace” qualifier or by an indication in the “note” qualifier that the residue may be deleted. An inserted or substituted residue(s) must be provided in the “replace” or “note” qualifier. The value format for the “replace” and “note” qualifiers is free text and must not exceed 1000 characters. See below for sequences that are provided as an insertion or a substitution in a qualifier value.

WIPO Standard ST.26, paragraph 98, specifies that the symbols set forth in Tables 1 to 3, reproduced in MPEP § 2412.03(a), MPEP § 2412.03(b), and MPEP § 2412.05(b) subsection IV, should be used to represent variant residues where appropriate. For the “note” qualifier, where the variant residue is a modified residue not set forth in Table 2, the complete unabbreviated name of the modified residue must be provided as the qualifier value. Modified residues must be further described in a FEATURE TABLE as described above at subsection (I).

WIPO Standard ST.26, paragraph 100, specifies that a sequence encompassed by the description of sequences found in MPEP § 2412.05(a) referencing paragraph 7 of WIPO Standard ST.26 that is provided as an insertion or a substitution in a qualifier value for a primary sequence annotation must also be included in the sequence listing and assigned its own sequence identifier.