FlyBase Archive server: last updated November 2004
FlyBase
.. Aberrations
.. Anatomy
.. BLAST
.. Genes
.. Annotation/Sequences
.. Gene Products
.. Maps
.. People
.. References
.. Stocks
.. Transgenes/Transposons
.|.
Help
.. Searches
.. News
.. Site
B.1.
Genes
The Genes
section of FlyBase is the major source of information on Drosophila genes. Data
from all species of the family Drosophilidae are included. The initial data
set was produced by merging the genes data in the text of Lindsley
and Zimm (1992) with the old LOCI table of Ashburner, and Merriam's Genevent
database. Information from all three sources has, however, been considerably
revised and reformatted. New gene and allele records are added through FlyBase's
curation of the literature and sequence databases. The curation of phenotypic
data, a particularly complex class of Genes data, is discussed in Phenotypic
Data in FlyBase, Drysdale (2001).
Some of the records in Genes
will be transient. As more data become available some gene records will merge
with others. Furthermore, some of these records are based on minimal data, for
example, the annotation to an EMBL
or GenBank
sequence record. Our policy is to include data wherever we can. As records merge
(or split) they will always be traceable by their secondary gene identifier
numbers and by their synonyms.
One of the major differences between
Lindsley and Zimm (1992) on the one
hand, and Lindsley and Grell (1968)
and Bridges and Brehme (1944), on
the other, is that the 1944 and 1968 books were very much catalogs of mutations,
rather than of genes. Bridges
and Brehme (1944) and Lindsley and
Grell (1968) were allele based, while Lindsley
and Zimm (1992) is largely, although not entirely, gene based. FlyBase is
a gene based database, and Genes
reflects this change. Having said that, it will be apparent that the transition
is by no means complete in genes. For the majority of genes, mutant phenotypes
are described in the respective allele records. In many cases, where, as far
as we know, all mutant alleles have a similar phenotype, then this description
will be found in the record for the first allele in genes. Many genes in Lindsley
and Zimm (1992) had no alleles specified, although it is clear that these
genes were identified by one or more mutant alleles. In these cases we have
arbitrarily designated an allele with the superscript 1. (Likewise, where an
allele is referred to in text with a gene designation, we have regarded this
as implying allele 1, where this seems reasonable, and made the change to state
allele 1 explicitly). There remain, in Genes, many cases where phenotypic information
is to be found within the gene record itself. This is especially so for genes
for which there is a great amount of data.
Errors in Genes.
Genes data will not be free of errors,
typographical, of fact, or of interpretation. Please inform FlyBase when you
find any error in these data. It will then be corrected. E-mail to flybase-updates
at morgan.harvard.edu (reformat to standard e-mail address) or contact
a member of the FlyBase group, whose addresses and phone/fax numbers are given
in Reference Manual I: The FlyBase Project.
B.1.1.
General description of Genes data
The Genes file contains
a set of Drosophila gene records, the data of each record being organized into
many different fields. As far as possible, we have implemented controlled vocabularies
for the descriptions. These are indicated by [cv]. The controlled vocabularies
are to be found in controlled-vocabularies.txt.
This process is by no means complete, except for some of the simpler fields,
such as mutagen. For example all X ray induced alleles are described as 'X ray'
(without the quotes) in the allele origin field, never 'X rays', 'X-ray' or
'X-rays'.
The use of controlled vocabularies
will increase in the future. This will allow users to more easily search the
database and retrieve genes or alleles with particular properties.
Overall syntax: The maximum
line length is 255 characters; there are no blank lines; all lines begin with
either * or #; lines that begin with # have no other characters; lines that
begin with * have a letter in column 2, a space in column 3 and at least one
more character beginning in column 4. The character # appears nowhere else in
the file. The character * does, unfortunately, but the string *[A-Z,a-z] does
not.
Record structure: The lines
that are just '#' identify the end of record for a gene. All other lines hold
data for a gene, each field is one or more lines that have the same character
in field 2. This character identifies the field and, sometimes, its position
within a record (see below).
B.1.2.
List of Genes field descriptions
These are the current field designations
in alphabetical order:
*a gene symbol
*b genetic location
*c cytological location
*d biological role of gene product [cv]
*e full name of gene or allele
*f cellular compartment of which gene product is a component
[cv]
*g nucleic acid sequence databank and other DNA accession
number
*h polymorphism data
*i synonym(s)
*k phenotypic information on alleles
*l transposable element data
*m protein database accession number
*n aberrations causing position-effect variegation of gene
[cv]
*o origin/mutagen [cv]
*p phenotypic information on genes
*q information concerning functional relationships between
genes
*r information on wild-type biological role
*s molecular information for genes and alleles
*t class of gene [cv]
*u miscellaneous information on genes and alleles
*v information on availability
*w discoverer
*x reference(s)
*y secondary FlyBase identifier number(s)
*z primary FlyBase identifier number
*A allele symbol
*B alternative genetic location
*C comments on cytology associated with allele
*D comments on cytological location
*E a duplicate of a *x field, used to tie data to
a reference
*F function of gene product [cv]
*G insertion chromosome associated with allele
*H date record entered or updated
*I transgene construct that carries allele
*J protein domain information
*K arguably most useful aneuploids for this gene
*L synonym for transgene construct symbol
*M probable ortholog in reference species of drosophilid
*N synonym for insertion symbol
*O progenitor allele or chromosome if relevant to
allele
*P aberration causing the allele
*Q complementation information concerning alleles
*R comments on origin, including progenitor genotype
if irrelevant to allele
*S genetic interaction information on alleles
*T recent review article that discusses this gene
*U nickname
*Y name of gene product
Field structure: The first
line of each record is the *a field. There is only one of these per record.
Other fields may appear in any order, and most can appear more than once, not
necessarily consecutively. All fields before the first *A field (if any *A)
refer to the gene. All fields between two *A fields (or between and *A field
and a #) refer to the immediately preceding allele. Thus, for example, *b fields
always appear before any *A fields, but *e fields can appear anywhere (e.g.,
"*e white" and "*e white-apricot"). Fields before the first *A are in a defined
order:
aHiezyCbcwBDdJUltrfvFghmnpqsuxE
In pretty outputs the *-codes are
replaced by a text term describing the field.
Special characters: There
are no special characters used in this file. Superscripts are enclosed between
square brackets []; subscripts between double square brackets [[]]. Greek letters
are written out, e.g. alpha, beta.
B.1.3.
Detailed description of the Genes fields
In this description the fields are
grouped logically, rather than alphabetically. Links in the list of field designations
in section B.1.2. above go to the relevant detailed field descriptions below.
- *H. Dating of
records and updates. All gene records have two date fields. The first, 'Date
entered', is the date a gene record was entered into the Sybase tables. The
second is 'Last updated', the date the record was last updated. When entered
the two dates will be the same. The 'zero' date of all records then
extant was 16 May 1994. FlyBase dates are represented as dd mm yy, mm being
the initial 3-letter abbreviation of the month, and yy being the last two
digits of the year (e.g., 01 Jul 94).
- *z, *y. Each
gene and allele record in FlyBase has a unique identifier number (see section
F.1. of Reference Manual F: Links To and From FlyBase).
The primary identifier number is in the *z field, secondary identifier numbers
are in *y fields.
Syntax: *z FBgn_integer
e.g., *z FBgn0001234
- *a. This is the
standard abbreviation (gene symbol) for the name of the gene. In the genes
file, gene records are sorted alphabetically. The order of precedence is:
all-Greek symbols (in alphabetical order), symbols that begin with a number
(in numerical order, secondarily sorted on suffix, i.e., 1, 2, 2a,
2b, 3), symbols that begin with a letter, lower case having precedence over
upper, and numerals precedence over letters, i.e., b, B, b1, ba).
Syntax: *a <Nnnn>\symbol
e.g., *a bb
*a Dhyd\Minos
Nnnn is an abbreviation for the species. The default species is D. melanogaster,
in which case there is no species abbreviation. If a gene is from another
species of drosophilid then this is indicated by Nnnn, where N is normally
the initial letter of the genus, and nnn are normally the first three letters
of the specific epithet. A list of species
abbreviations is in the Nomenclature
section of FlyBase.
Syntax: *e <Nnnn\>name
e.g., *e bobbed
*e Dhyd\Minos
Genes encoded by the mitochondrial genome all have the prefix Nnnn\mt:.
The D. melanogaster gene encoding the cytochrome oxidase subunit
II is, therefore, mt:CoII, the D. simulans gene encoding
the mitochondrial proline tRNA is Dsim\mt:tRNA:P. The record MT:DNA
is used for data concerning the mitochondrial genome and its products that
cannot be assigned to any single mitochondrial gene. The symbol mt:ori
is used for the non-coding A+T rich region of the mitochondrial origin of
replication.
FlyBase includes data on artificial gene constructs, for example fusions between
different genes. Fusion genes are named using the gene symbols of their components
separated by a double colon, e.g., Antp::Scr. The components are
listed in alphabetical order. When a component of a construct is from a species
other than D. melanogaster then its symbol is prefixed by Nnnn\
to indicate the species of origin. For example the lexA gene from E. coli
has the symbol Ecol\lexA. A list
of the species abbreviations used is to be found in the Nomenclature
section of FlyBase.
- *e. This is the
full name of the gene or allele. FlyBase takes a minimalist definition of
a gene. As an example, Notch is regarded as a gene, but facet,
Confluens, split etc. are not. These phenotypically distinct
allelic forms that have, in the past, been named as if they were genetic loci
are included as gene synonyms.
FlyBase is not entirely consistent in the way directly duplicated genes are
handled: for example the five HSP70 encoding genes at Hsp70A and
Hsp70B and the five larval cuticle protein encoding genes at 44D
are all listed independently but the five major histone protein coding regions,
tandemly repeated at the base of 2L, are each listed as a separate
gene, but only once.
Some loci have only been identified by molecular methods, not having been
mapped. Such loci are included in genes. Other "loci" included
in this file have not been genetically mapped or characterized but are assumed
to exist on the basis of, for example, a purified protein. Some loci have
been impossible to name in any logical way, due to a lack of data. As a temporary
expedient these are named as anon-*, where the * indicates a code.
These loci will be renamed as and when more data becomes available.
STS sequences identified by Drosophila genome projects appear in the nuceic acid sequence data archive, and in the NCBI's dbSTS database. These short sequences are routinely matched
against the universe of public sequence data and often have 'significant'
matches to genes identified in species other than Drosophila. Such matches
are clues that similar genes may occur in D. melanogaster. For this
reason STS sequences with significant matches are identified as 'genes' in
this file, and have the temporary name ESTSn (for STS sequences from
the European project) or BSTSn (for those from Berkeley), where n is the code
used by the Genome Project (e.g., ESTS100F7T, BSTSDm0092). STS sequences that
match known Drosophila genes will be linked to the relevant gene record by
their accession numbers in the GenBank/EMBL/DDBJ and dbSTS data archives.
STS sequences that have no matches whatsoever are only linked to their parental
clone in the clones tables. All STSs with matches are similarly linked to
their parental clones in these tables.
- *b, *B. Genetic
map position. Given as Chromosome number-map position, e.g. 3-10. If a gene
has not been mapped within a chromosome, then only the chromosome is indicated
as, for example, 2-. This implies '2- (not located)'. Many genes have been
mapped cytogenetically but not genetically. Their map positions have been
estimated and are enclosed in []. (Not {} as in Lindsley
and Zimm (1992).) The published map positions of some genes are clearly
at variance with their cytogenetic positions. In such cases we have estimated
their genetic position and indicate this by enclosing the estimate in [].
*B is used to store comments on genetic map positions, including unresolved
differences between some genetic map positions in Lindsley
and Zimm (1992) and those in Ashburner's original files.
To estimate genetic map positions from cytogenetic we use a standard table
made by plotting all of the available data and then interpolating. Estimated
genetic positions are normally only made to the nearest whole number. The
exceptions to this rule are in regions of very low recombination relative
to the cytogenetic map. The table of cytogenetic
vs. genetic map positions used is available in the Maps
section of FlyBase.
Syntax: *b chromosome_symbol-number
e.g., *b 1-66.0
- *c. Cytogenetic
map positions. These are given as extreme left and right hand limits. In the
case where one of these limits is said to be a doublet, e.g., "35D1,2", then
only the outermost band (in this case 35D1 if this was the left-hand end of
the range) is given. The limits are separated by a hyphen.
Syntax: *c left_hand_limit--right_hand_limit
e.g., *c 25C--25D
Many genes have been mapped genetically but not cytogenetically. Their map
positions have been estimated and are enclosed in [].
Following the cytogenetic range there may be a statement regarding how it
was established, e.g., by in situ hybridization. When a cytogenetic range
or a statement of how it was derived appears "unattributed", i.e., not in
a block headed "Data from ref. nnnn", it is computed from all available data
and the tightest deducible range is shown. In cases where different reports
give conflicting data, FlyBase has made a decision to mark one or more statements
as suspect by prefacing them with "???". Such statements are excluded from
the computations that give rise to CytoSearch
data. If you find that an error has been made in this process, please inform
us by email to flybase-updates at morgan.harvard.edu.
- *D. *D is used
to store comments on cytological map positions. This may include text giving,
for example, information that a weaker in situ signal was seen elsewhere.
- *K. Arguably
most useful aneuploids for this gene. This is the algorithm for identifying
the listed aberrations:
1) Admissible aberrations are ones that have no progenitor (too hard to work
out what's missing) and whose class is one of Deficiency, Deficient translocation,
Deficiency (first two listed breaks) plus Inversion, Tandem duplication or
the three insertional duplication classes, plus separable components of aberrations
that have no progenitor and whose class is one of the insertional transposition
classes (this may be extended to inversion recombinants and translocation
segregants in the future).
2) Aberrations are first prioritized into the following categories:
- those available at Bloomington
- those with a 2000 reference
- those with a 1990-9 ref not including L&Z
- those with a 1980-9 ref
- those with a 1970-9 ref
- those with a 1960-9 ref not including L&G
and then each category is sorted by distance between first two listed breaks
(number of bands, smallest aberration first, taking the minimum size). This
is the "league table" of aberrations.
3) The first aberration in the league table that is stated (in the aberration
record) to delete the relevant gene is listed as:
*K Deficiency: <Df symbol>
Similarly the first ab in the league table that is stated (in the aberration
record) to be duplicated for the relevant gene is listed as:
*K Duplication: <Dp symbol>
4) The first aberration in the league table whose minimum deleted region extends
at least two bands either side of the gene's region of uncertainty is listed
as:
*K Deficiency: <Df symbol> (inferred from cytology)
but only if it appears earlier in the league table than the one (if any) listed
in step 3. Similarly for duplications, as:
*K Duplication: <Dp symbol> (inferred from cytology)
- *i. Synonyms.
As mentioned above FlyBase takes a very liberal view of synonyms, and the
table gene-synonyms.txt in the Genes
section is provided as a tool to allow the identification of the name, and
symbol, that FlyBase uses for each gene or allele. In Genes
these data are kept in the *i field, for both gene and allele synonyms.
Syntax: *i synonym_symbol: synonym name <text, e.g. a reference>
e.g., *i ho: heldout
- *U. Nickname.
Nicknames are valid alternative symbols for a gene or allele. Nicknames support
the use in Drosophila genotypes of foreign gene symbols sans the species identifier,
for example, lacZ rather than Ecol\lacZ. Nicknames are assigned
only to foreign genes that frequently appear in Drosophila transgene constructs.
- *Y. Name of
the gene product. This field is moderately controlled. The suffix '-like'
is used to indicate that a gene product has been named by similarity.
- *d. Biological
role of gene product. This field gives information concerning the biological
role(s) of the gene product. The terms used are from the process ontology
of the Gene Ontology Consortium
database and include the GO identifier number. The 'evidence' for an attribution
may follow the term as a 'pipe' (i.e., after the character |). Statements
of evidence are drawn from a small controlled vocabulary:
inferred from mutant phenotype
inferred from genetic interaction
inferred from physical interaction
inferred from sequence similarity
inferred from direct assay
inferred from expression pattern
inferred from electronic annotation
traceable author statement
non-traceable author statement
Note about 'inferred from mutant phenotype': The GO consortium regards alterations
of gene expression as 'phenotype' in the context of this evidence code. The
description of mutant phenotypes in the FlyBase Allele data (see section on
*k), however, is restricted to alterations of the anatomy or organismal function
of the mutant, and does not include expression pattern data. For more about
the GO evidence codes see http://www.geneontology.org/doc/GO.terms_and_ids.
- *F. Function
of gene product. This field gives information about the function(s) of the
gene product. The terms used are from the function ontology of the Gene
Ontology Consortium database and include the GO identifier number. GO
function terms also include cross-reference to the ENZYME
database. Statements of evidence are drawn from a small controlled vocabulary:
inferred from mutant phenotype
inferred from genetic interaction
inferred from physical interaction
inferred from sequence similarity
inferred from direct assay
inferred from expression pattern
inferred from electronic annotation
traceable author statement
non-traceable author statement
Note about 'inferred from mutant phenotype': The GO consortium regards alterations
of gene expression as 'phenotype' in the context of this evidence code. The
description of mutant phenotypes in the FlyBase Allele data (see section on
*k), however, is restricted to alterations of the anatomy or organismal function
of the mutant, and does not include expression pattern data. For more about
the GO evidence codes see http://www.geneontology.org/doc/GO.terms_and_ids.
- *J. Description
of the structural features of gene products. These data are not curated by
FlyBase but are from the InterPro
database. InterPro provides an integrated view of the commonly used protein
domain or signature databases. Release 3.1 (May 2001) was built from Pfam
6.0, PRINTS 30.0,
PROSITE 16.35, ProDom
2001.1, SMART 3.1 and the current SWISS-PROT
+ TrEMBL data.
Syntax for InterPro cross references:
*J InterPro_number == InterPro_accession_name
e.g., *J IPR000014 == PAS domain.
- *f. Cellular
compartment of which gene product is a component. This field gives information
about the cellular compartment(s) of which the gene product is a component.
These include not only the obvious parts of a cell (nucleus, mitochondrion),
but also all defined supra-molecular complexes (e.g., small ribosomal subunit,
proteasome. The terms used are from the cellular component ontology of the
Gene Ontology Consortium database
and include the GO identifier number. Statements of evidence are drawn from
a small controlled vocabulary:
inferred from mutant phenotype
inferred from genetic interaction
inferred from physical interaction
inferred from sequence similarity
inferred from direct assay
author said so
not available
- *g. Nucleic acid
sequences. In these fields FlyBase stores pointers to nucleic acid sequence
data, usually in the form of EMBL/Genbank/DDBJ/NCBI accession (AC) numbers.
If a sequence has been published but is not yet in one of these data banks
a brief journal reference is given instead (the full reference will be found
in References). Data from the three
nucleic sequence databases are received on a daily basis by FlyBase.
FlyBase is also cross-referenced to a number of other sequence databases.
These cross-references are stored in the *g line (if nucleic acid) or *m line
(if protein). These other databases and the database code used in FlyBase
to identify links to those databases are listed in Reference
Manual F.3. The accession numbers for all external sequence links are
listed in the file external-databases.txt.
The EMBL/NCBI/DDBJ sequence accession numbers have no database code prefix.
Syntax: *g <database_code/>accession_number
e.g., *g X12345 *g EPD/23023
If the nucleic acid sequence accession includes coding regions then each coding
region has a unique PID number. These are appended to the nucleic acid sequence
accession number, following a semi-colon, e.g.,
*g U42989; g1150983
Note that the number of PIDs attached to a sequence record may be more than
one for two reasons. The first is that the EBI and NCBI often assign PID numbers
independently to the same object; the other is that there is more than one
protein product from a single gene (as the result, for example, of alternative
splicing).
- *r. The *r field
is used for information about the wild-type biological role of a gene. The
objective is for each gene record to have a *r field in which information
about the gene's biological role is summarized. The present situation, however,
is that for the majority of genes this information is still to be found in
the *p field of the gene record. FlyBase is systematically rewriting these
*p fields (historically derived from the 'Phenotype' field of Lindsley
and Zimm (1992)) so that the summary of wild-type function is moved to
the *r field.
- *n. Aberrations
causing position-effect variegation of gene. This is a controlled field to
indicate aberrations that cause position-effect variegation of a gene.
Syntax: *n recessive PEV in: <aberration_symbol>
*n dominant PEV in: <aberration_symbol>
*n no PEV in: <aberration_symbol>
- *m. Protein sequence
data. The *m field stores pointers to protein sequence data, usually in the
form of SWISS-PROT/TREMBL/PIR protein sequence databank accession (AC) numbers.
Because of potential clashes between the accession numbers between databases
the AC numbers are prefixed "SWP/", "TREMBL/" or "PIR/".
These fields are also used for cross-references between FlyBase and structural
data on Drosophila proteins held on PDB (Protein Data Bank, Brookhaven),
the NRL_3D databank and the G protein-coupled receptor database (GCRDb). These
records have the prefixes PDB/, NRL_3D/ and GCR/ respectively. Cross-references
to the 'factors' table of the TRANSFAC database (E. Wingender, J. Biotechnol.
35:273-280, 1994) have the prefix TF/.
Syntax: *m database_code/accession_number
e.g. *m SWP/P12428
- *M. Probable
ortholog in other species of drosophilid. The *M field is a pointer between
"orthologous" genes in another species of drosophilid. A single species (D.
melanogaster when possible) is treated as the "reference" for a given
gene, and links are made with *M fields between the gene of the reference
species and probable orthologs. No direct *M links are made between the non-reference
genes.
Links are only made where there is good genetic or phenotypic (including sequence)
evidence for homology of entire genes. It is not uncommon for a gene to be
present once in species a but twice (or more) in species b
(e.g., Adh in D. melanogaster vs. D. mulleri).
In such cases all possible pair-wise links are made via *M fields.
Syntax: *M <Nnnn>\gene_symbol
Although genes in different species of Drosophila characterized by
sequencing generally have the same gene symbol as the presumed homolog in
D. melanogaster this is by no means true for genes characterized
by mutations in these species. In these instances 'homology' is usually deduced
from mutant phenotype and linkage group. No attempt has (yet) been made to
impose homologies, over and above suggestions made in the literature.
- *p. Phenotype.
The *p field holds phenotypic information about a gene (or, as explained above,
about its mutant alleles in some cases). This field is free text and, by and
large, has not yet been standardized with respect to its vocabulary. One special
use of the *p field is to hold information on gene interactions. These are
expressed as follows:
*p Interacts genetically with: [gene_symbol]
- *u. The *u field
is for miscellaneous information concerning a gene, as free text. Notes concerning
the identification of the gene, or the derivation of the gene symbol/name
are stored following the corresponding 'Identification:' or 'Etymology:' prefix.
- *s. Molecular
data. These fields keep molecular data about genes and alleles. The *s field
at the gene level is subdivided into five categories. In addition to the free
text category there are four additional categories distinguished by a set
of controlled prefixes:
Gene order: Accommodates gene order/orientation data derived
by molecular, rather than genetic, means. The data will be presented in the
format 'Gene order: In direction of increasing cytology: Dredd- su(s)+' where
+ indicates 5'-3' proceeds with increasing cytological location, - the opposite,
and ? where the direction of transcription is not declared. Where orientation
with respect to the chromosome is not known, gene sequence is preceded by
the statement "Overall orientation not stated" and + and - simply reflect
orientation of the transcripts with respect to each other. Where a 'Gene order'
line begins or ends with an ellipsis (...) this indicates that the complete
gene order described in the publication is more extensive than this subset
reported for the gene in question. Gene reports for genes at either end of
the reported line will continue the molecular gene order over a greater extent.
Maps to clone: Accommodates positive relationships between
a gene and clones (P1, BAC, YAK) as used by large scale public genome projects.
Does not map to clone: Accommodates negative relationships
between a gene and clones (P1, BAC, YAK) as used by large scale public genome
projects.
Identified with: Accommodates relationships between a gene
and ESTs or STSs as generated by large scale public genome projects.
The *s field at the allele level is free text but for the following three
controlled prefixes.
Construct: Used to denote an 'allele' engineered in vitro by recombinant
DNA technology and assayed in the genome after germline transformation or
in transient assays in the whole organism or cell culture.
Amino acid replacement: prefixes a standard format statement about
the nature of the mutation. Format is 'letterNletter' where each letter refers
to the standard amino acid single letter code, and N is the residue of the
encoded protein that is altered. Thus C67Y denotes that the cysteine at position
67 is replaced by a tyrosine. Stop codons are represented by @. Question marks
? represent uncertainty or lack of information about the amino acid or position
in question.
Nucleotide substitution: prefixes a standard format statement about
the nature of the mutation. Format is 'letterNletter' where each letter refers
to the nucleotide, and N is the position of the affected nucleotide. Thus
C313T denotes that the C at position 313 is replaced by a T. Note that the
numbers in "Nucleotide substitution" data reflect author statement and do
not necessarily have any significance with respect to "Nucleotide substitution" statements from other authors.
- *q. The *q field
holds data about genes or groups of alleles that pertain to the relationship
between that gene and other genes. For example, statements that alleles of
gene A complement alleles of gene B, that, in addition to explicitly named
alleles of this locus, a further ten alleles had been isolated, or that the
gene may be the same as another, would be kept in this field. This field accommodates
data stored with several controlled prefixes:
"Source for merge: gene1 gene2" statements mark publications as containing
the evidence that the named gene1 and gene2, previously recorded as being
distinct, correspond to the same gene, giving rise to the merging of the two
gene records in FlyBase into one.
Other controlled prefixes for this field deal with functional complementation
relationships between the gene in question and genes of other species. Prefixes
are:
Functionally complemented by:
Does not functionally complement:
Is not functionally complemented by:
Partially functionally complements:
Partially functionally complemented by:
Gain of function effect when expressed in:
No gain of function effect when expressed in:
- *l. Information
about the nature and molecular characteristics of transposable elements is
contained in *l field.
*l element type:
*l terminal repeat length in bp:
*l total length in bp:
*l target site duplication length in bp:
*l number of copies in genome:
*l component genes:
The allowed values of 'element type:' are:
LINE, LINE-like retrotransposons
SINE, SINE-like elements
LTR, retroviral-like elements with long terminal repeats
IR, elements with inverted repeat termini
FB, fold-back elements
- *h. Polymorphism
data. The *h fields store data from population studies. These data are subdivided
into categories.
variability: a (more or less) quantitative statement of variability
at the locus.
sampled from: the geographic locations of the populations sampled.
sample size: the number of populations/strains analyzed.
no. of KB assayed: the extent of the region assayed.
type of assay: method used to measure variability (see CV).
comments: comments on the results and conclusions of the analysis.
- *t. Class of
gene. This field holds information about the class of the genetic element.
The default is a protein-coding gene carried by the nuclear genome of a species
of drosophilid.
The following classes of nuclear non-protein-coding gene are recognized:
*t nuclear_non-protein-coding_RNA_gene: the parent class of the following:
*t cytosolic_tRNA_gene: for tRNA encoding genes.
*t cytosolic_ribosomal_RNA_gene: for rRNA encoding genes.
*t nuclear_small_nucleolar_RNA_gene: for snoRNA encoding genes.
*t nuclear_snRNA_gene: for small-nuclear (snRNP) encoding genes.
*t nuclear_untranslated_RNA_gene: for other nuclear chromosomal genes none
of whose transcripts encode a protein.
*t small_intermediate_RNA_encoding_gene: for genes reported to encode siRNAs.
*t microRNA_encoding_gene: for miRNA encoding genes.
Mitochondrial genes. Genes encoded by the mitochondrial genome have the symbol
prefix 'mt:' or 'Nnnn\mt:' if from a species other than D. melanogaster.
The following classes of mitochondrial_gene are recognized:
*t mitochondrial_gene: the parent class of the following and used only for
generic MT:DNA records and for the mitochondrial replication origin, mt:ori.
*t mitochondrial_protein-coding_gene: for protein coding genes of the mitochondrial
genome.
*t mitochondrial_non-protein-coding_gene: the parent class of the following:
*t mitochondrial_tRNA _gene: for mitochondrial
encoded tRNA genes.
*t mitochondrial_ribosomal_RNA_gene: for mitochondrial
encoded rRNA genes.
*t pseudogene: Nonfunctional loci with sequence identity to a functional gene.
*t microsatellite: Loci composed of tandem repeats of short (1 to 10 bps)
nucleotide sequences.
*t transposable_element. A natural transposable element of a drosophilid.
Information concerning the class of the element is held in the *l field.
*t transposable_element_gene. A gene carried by a natural transposable element
of a drosophilid. The symbol of this gene will be of the form 'N\m', where
'N' is the symbol of the transposable element and 'm' is the symbol of the
particular gene.
*t repetitive_element. A natural non-coding repetitive element of a drosophilid.
This is used for non-coding elements for which evidence that they are transposable
is lacking. Includes satellite DNA sequences (satDNA).
*t virus_symbiont_pathogen: Viruses, symbionts, parasites and pathogens of
Drosophila. Includes components of such entities.
*t safe_element: Structural and/or non-coding functional elements. Includes
telomeres, centromeres, DNA amplification sites, scaffold sites, and boundary
elements. Does not include non-coding elements of other classes, e.g., promoters,
enhancers, introns, which are considered to be components of the default class
of genes.
*t sire_element: Synthetic and/or isolated regulatory elements, restricted
to regulatory elements widely used in an isolated context, such as mobile
activating elements. Does not include regulatory elements used to drive reporter
genes. An example is the synthetic GMR (glass multimer reporter) element,
as used in transgene constructs designed to activate adjacent endogenous genes.
*t fusion_gene: Genes synthesized as a fusion of two, or more, coding regions,
at least one being a Drosophila gene. Each component of a fusion gene has
a single gene entry as either a normal gene, foreign_gene or a fusion_gene.
*t foreign_gene: A gene from a non-drosophilid.
*t foreign_fusion: A fusion gene, as defined above, that includes a coding
region from a foreign gene.
*t foreign_transposon: Used for foreign transposons brought into Drosophila
for the purposes of analysis or transgene generation.
*t foreign_transposable_element_gene: A gene carried by a transposable element
of a non-drosophilid.
*t safe_element.f: A structural and non-coding functional element from a species
other than D. melanogaster, frequently used in D. melanogaster
transgene constructs.
*t sire_element.f: A SIRE (see definition above) from another species.
*t uncertain: Many genes in FlyBase have information
that is only of historical interest, because they were identified by mutations
that are now lost, were never sequenced, etc. It is important that searches
of FlyBase genes not return an oppressive number of hits to such genes. Hence,
we have developed a complex criterion by which genes can be classified as "uncertain", and such genes are only included in search hits if this is specifically
requested on the Genes query form.
This criterion is purely rule-based, so the set of "uncertain" genes is recomputed
at each genes update. The rules that comprise the criterion may be modified
in the future, in the light of experience of how well they describe only the
appropriate genes. The current criterion is that a gene is marked uncertain
if and only if:
(it is a Drosophila melanogaster standard gene, not a virus, transposable
element, etc.)
AND ( (it appeared in a prior, but not the current, release of the genome)
OR ( (it has no references dated post-1989 except for Lindsley
and Zimm and/or FlyBase curation)
AND (it has no GO (*d, *f or *F) data)
AND (it has no DNA/RNA or protein sequence or gene
order data)
AND (it has no alleles in any stock lists held by
FlyBase, either held by public stock centers or the community)
AND (its most specific mutant phenotype is shared
by alleles of at least nine other genes)
AND ( ( (it has no complementation data against aberrations)
AND ( (it has no cytological or within-chromosome
meiotic mapping data)
OR ( (its cytological range
of uncertainty exceeds two lettered subdivisions)
AND (its most
recent reference is pre-1970) ) ) )
OR
( (its gene symbol is an anonymous lethal or sterile)
AND ( (its cytological range of
uncertainty exceeds two lettered subdivisions)
OR (its most recent reference is
pre-1970) ) ) ) ) )
*t multicopy_xxx (where "xxx" is another *t). Some genes are present in the
Drosophila genome as clusters of genes, whose products are so similar that
they are traditionally referred to by a single name. This is true of various
RNA-encoding genes such as 5SRNA and bb, and also of the histones in 39D.
It is necessary in some circumstances to refer to individual members of such
clusters. Hence, the "gene" 5SRNA is given the gene class "multicopy_cytosolic_ribosomal_RNA_gene"
to indicate its composite nature, and individual members of the 5SRNA cluster
are given the gene class "cytosolic_ribosomal_RNA_gene". The individual genes,
as and when they are instantiated, are given symbols of the form ":",
e.g. "5SRNA:CG98765". The multicopy gene and its member genes are linked by
"relationship to other genes" data of the form "member genes: 5SRNA:CG98765,
..." and "member gene of: 5SRNA".
*t xxx_cassette (where "xxx" is another *t). There are various types of "composite
gene" which are defined as such not because all their members are virtually
identical, but because of some functional or structural relationship. The
simplest example is the encoding of two or more products on the same transcript,
such as Adh and Adhr. Here, it is necessary to call Adh a gene, but it is
also necessary to have a gene record to house the molecular information that
Adh and Adhr share, such as transcript length. Hence, a gene record exists
for CG32954 that has the gene class "gene_cassette", and the genes Adh and
Adhr have the default class "gene". As with the multi-copy genes, it is necessary
to link the cassette to its parts, and this is done with "relationship to
other genes" data of the form "encoded by: CG32954" and "encoded genes: Adh,
Adhr". Two other types of "cassette" are also defined: a cluster of closely
related genes with similar function and gene expression, currently used only
for the histone complex HIS-C, and a natural transposable element, whose component
genes are those that it carries. (In the case of transposable elements we
retain "transposable_element" as the gene class, as opposed to "transposable_element_gene_cassette").
Finally it should be noted that "multicopy_xxx" and "xxx_cassette" can be
combined. The existing cases of this are bb, Ybb and HIS-C. For example, bb
has *t multicopy_cytosolic_ribosomal_RNA_gene_cassette and links to the genes
18SRNA, 28SRNA and 5.8SRNA by "encoded genes" lines; both bb and its components
also -- potentially -- have member genes. Moreover, the RNAs are encoded genes
of Ybb as well as of bb.
- *A. Alleles.
Each allele record begins with a *A field with the gene and allele symbol.
*e, and *i fields, for the full allele name and synonyms, are used as for
the gene records.
Syntax: *A gene_symbol<up>allele_symbol</up>
*e allele_name
e.g. *A bb<up>G2</up>
*e bobbed of Goldschmidt
For some loci Lindsley and Zimm
(1992) gave only cross-references to Lindsley
and Grell (1968) or Bridges and
Brehme (1942) for lost alleles. FlyBase has included the data as published
in these earlier catalogs.
There is one class of 'allele' that FlyBase treats in a non-traditional way,
that of alleles named as a consequence of a variegating position effect. By
definition, these do not affect the structure of the gene, only its expression.
For this reason position effect alleles are not included in the genes file.
The aberration which gives rise to the position effect is, of course, in the
aberrations file and the fact that it causes a position effect (or not) is
noted in the *V lines of that file.
There are few exceptions to this policy. There are a handful of alleles that
may or may not be due to a position effect, the absence of any cytological
description of their chromosomes makes it impossible to tell. In these cases
their records will include a *k line as follows: *k may be due to position
effect variegation of normal allele.
- *v. Information
on availability. If a publication reports that an allele is lost, that information
is recorded in the *v field. Note that not all such reports in the literature
are authoritative.
- *o, *O, *R.
Origin of alleles. The *o field holds the data on the 'origin' of an allele,
usually the mutagen used to induce it, but the origin may well be 'natural
variant'. A controlled vocabulary is used in *o. This controlled
vocabulary includes the CAS
Registry Numbers of chemicals.
Syntax: *o mutagen
e.g. *o spontaneous *o ethyl methane sulfonate
Where the value in *o begins 'in vitro construct' this field is bipartite,
reflecting the type of in vitro mutagenesis employed to create that allele:
*o in vitro construct | regulatory fusion
*o in vitro construct | site directed
The legal entries
for this field are listed in controlled-vocabularies.txt within the Documents
section, along with all other mutagen terms.
The *O field is for the chromosome on which the mutation was induced or the
progenitor allele name (e.g., for revertants). This field is only used if
the progenitor is relevant to the derivative. The values in this field will
be valid FlyBase allele or aberration or transposon insertion symbols. Where
a *O field houses more than one value, each followed by " \?", this signifies
that the progenitor chromosome is one of the named alternatives.
*R is miscellaneous data about an allele's origin, for example that it was
simultaneously induced with another mutation, or information about the genotype
of the progenitor which is irrelevant to the derivative. This is a formatted
free text field.
- *Q. carries
miscellaneous inter-allele information as free text.
- *C. Cytology
of alleles. The *C field holds the information about the cytology of the allele,
either that the 'Polytene chromosomes are normal' or comments about possible
cytological abnormalities.
- *P. Associated
aberration. Holds the symbol of the aberration for those alleles caused by
an aberration break. If an allele is associated with but separable from an
aberration then that data will be in the *R field. If an allele was induced
in an aberrant chromosome, then that is indicated in the *O field.
- *G. Insertion
chromosome associated with allele. Transposon or transgene construct thought
to be responsible for a mutation are recorded in the *G field. Transposons
and transgene constructs are named according to the rules set out in the FlyBase
nomenclature document.
For example, an unmarked P-element is named P{}, the lArB
transgene construct is P{lArB}, a copia element, copia{}.
Insertions of unidentified transposons have the symbol *{}. Following
the closing brace is the allele symbol (identical to the preceding *A field);
the complete symbol (e.g., P{lArB}wgNZ) is the designation
of the insertion chromosome.
- *N. Synonym
for insertion recorded in *G.
- *I. Transposon
or transgene construct that carries an allele. An allele being carried on
a transposon/transgene construct, as opposed to being caused by its insertion,
is denoted by the symbol of the transposon/transgene construct appearing in
a *I field under the allele, e.g., *I P{lArB} under Adh+t3.2.
- *L. Synonym
for transposon or transgene construct recorded in *I.
- *k. Mutant phenotype.
This holds the phenotypic description of the mutant allele. This description
is restricted to alterations of the anatomy and organismal function of the
mutant, and does not include gene expression pattern data. (This contrasts
with the use of 'phenotype' in the GO term evidence code 'inferred from mutant
phenotype' which does encompass expression pattern data - see *d, *F and *f).
The *k field is free text, except for the following classes of information:
*k Phenotypic class: This field can be multi-component, storing information
about the recessive/dominant and conditional and stage specific aspects of
allele in addition to the phenotypic class into which the allele falls. Vertical
bars separate the components:
*k Phenotypic class: lethal | embryonic | maternal
effect | recessive
An allele can legitimately have multiple '*k Phenotypic class:' lines.
*k Phenotypic class: lethal | recessive
*k Phenotypic class: flightless | dominant
Where a genotype appears in curly brackets at the end of the line, that phenotypic
class of the allele is dependent on the {second site} genotype in the brackets.
*k Phenotypic class: visible | dominant { Scer\GAL4how-24B
}
Where a '(with allele)' statement appears at the beginning of the line that
phenotypic class is particular to the allelic combination of the allele that
is the subject of the report and the allele (of the same gene) stated in the
'(with allele)' statement.
*k Phenotypic class: (with fafFO8)
visible
*k Phenotype manifest in: This field describes the body part affected by the
mutant allele, using the body part terms as listed in the controlled vocabulary.
*k Phenotype manifest in: wing vein L5
Where a genotype appears in brackets at the end of the line, the phenotype
in that body part is dependent on the {second site} genotype in the brackets.
*k Phenotype manifest in: wing { Scer\GAL4dpp.blk1
}
The presence of a term in this field means simply that the named structure
can demonstrate a mutant phenotype as a consequence of the mutant allele.
Thus for maternal effect alleles, the embryo in which the named body part
is affected is not necessarily mutant for that allele in question, though
its mother was. Also, the phenotype need not be 100% penetrant and expressed
for the affected body part to be recorded in a 'Phenotype manifest in:' field.
Terms can be combined using an & symbol:
Phenotype manifest in: cuticle & procephalon
Phenotype manifest in: scutellum & macrochaetae
Where a '(with allele)' statement appears at the beginning of the line that
phenotypic class is particular to the allelic combination of the allele which
is the subject of the report and the allele (of the same gene) stated in the
'(with allele)' statement.
*k Phenotype manifest in: (with fafFO8)
eye
*k Mode of assay: This field is mandatory for all alleles that have '*o in
vitro construct'. The possible entries in this field are:
*k Mode of assay: In transgenic Drosophila
*k Mode of assay: Whole-organism transient assay
*k Mode of assay: Drosophila cell culture
*k Mode of assay: In transgenic Drosophila (allele
of one drosophilid species in genome of another drosophilid)
*k Mode of assay: Whole-organism transient assay
(allele from one drosophilid species assayed in another drosophilid)
*k Mode of assay: In transgenic Drosophila (allele
of foreign species in genome of drosophilid)
*k Mode of assay: Whole-organism transient assay
(allele of foreign species assayed in drosophilid)
The capture, storage and reporting of phenotypic data is discussed in Phenotypic
Data in FlyBase, Drysdale (2001).
- *S. Genetic
interaction information on alleles
*S Genetic interaction (effect, class):
*S Genetic interaction (anatomy, effect):
*S Genetic interaction (effect, class):
*S Genetic interaction (effect, anatomy):
*S Genetic interaction: free text
These 'Genetic interaction' fields store information about phenotypic class
and affected body parts for mutant combinations of genetically interacting
alleles. The interacting allele is indicated in the curly brackets {}. Phenotypic
class and Anatomical term values are as for *k fields.
*S Genetic interaction (class, effect): visible,
enhanceable { ml[1] }
*S Genetic interaction (anatomy, effect): eye,
enhanceable { ml[1] }
*S Genetic interaction (effect, class): enhancer,
visible { S[1] }
*S Genetic interaction (effect, anatomy): enhancer,
eye { S[1] }
The capture, storage and reporting of phenotypic data is discussed in Phenotypic
Data in FlyBase, Drysdale (2001).
- *x, *T, *E.
References. *x fields, in both gene and allele records, are references.
Syntax: *x FBrfnnnnnnn == abbreviated_reference
e.g., *x FBrf0036029 == Saigo et al., 1981, Cold Spring Harbor Symp. Quant.
Biol. 45:815--827
The FBrf number is the unique reference identifier number from references,
which also includes the full reference.
*T lists recent review(s). For each gene, this is the list of all the reviews
published in the last four years which were determined by FlyBase curators
as having that gene as a significant topic, except that the list is truncated
to more recent years when that still leaves at least three references (for
example, if there are two dated 1999, two dated 1998, two dated 1997 and two
dated 1996, then only the two from 1999 and the two from 1998 are listed).
The most recent are placed first.
The *E field is always a duplicate of a *x field within the same record. It
is a device to tie particular data to a particular reference. The data fields
then immediately follow the *E field.
The referenced block of fields is terminated by the next *E or *A field, or
the end of record line (#).
- *w. Discoverer.
This field contains the name of the individual who identified the allele,
or the name of the leader of the group that identified the allele.
B.1.4.
Nontraditional alleles
In addition to 'alleles' in the traditional
sense, FlyBase now names and curates further classes of allele so that phenotypic
or expression pattern data can be captured for in vitro construct alleles and
alleles of reporter (e.g., Ecol\lacZ), effector (e.g., Scer\FLP)
or toxin (e.g., Rcom\DT-A) genes. Since these alleles have not historically
been named by researchers, and have been named by FlyBase, their presentation
in FlyBase requires some explanation:
B.1.4.1.
Alleles of reporter genes
Alleles of reporter genes currently
fall into two main classes, those resulting from enhancer trap experiments,
and those resulting from promoter (or other regulatory region) analysis, where
a fragment is used to drive the expression of a reporter gene. Ecol\lacZ
will be used for illustration.
Enhancer trap results:
- The enhancer trap construct causes
an allele of a gene and is expressed in a pattern consistent with insertion
in that gene. The resulting aberration will be described with the format P{A92}hL43a,
and the Ecol\lacZ allele symbol is of the format Ecol\lacZh-L43a.
- The reporter gene reflects the
expression of a gene without causing a mutant allele of that gene. The resulting
aberration will be described with the format P{PZ}P2023-44, where
P2023-44 reflects the insertion identifier, and the Ecol\lacZ
allele symbol is of the format Ecol\lacZhh-P2023-44.
- The reporter gene reflects the
expression of an undescribed gene/enhancer. The resulting aberration will
be described with the format P{lacW}1.28, and the Ecol\lacZ
allele symbol is of the format Ecol\lacZ1.28.
Promoter analysis results:
- Generally some fragment of a gene
promoter/intron/3'-region is fused to the reporter gene. In this case the
allele symbol is of the form 'gene symbol.fragment descriptor' e.g., Ecol\lacZeve.prox54.
The fragment descriptor reflects that used in the publication, even though
this may be long and cumbersome (this may not be strictly true for such alleles
curated early in the FlyBase project).
- Where a reporter gene is simply
described in a publication as being driven by, e.g., an arm promoter,
the symbol of the Ecol\lacZ allele is 'arm.PI', where I
is the first letter of the surname of the first author of the paper, e.g.,
Ecol\lacZarm.PV for 'Ecol\lacZ arm promoter construct
of Vincent'.
- For logistical reasons some promoter
fusions involving reporter genes such as Ecol\lacZ, though technically
protein fusions, are simply treated as alleles of the reporter gene. The symbol
for the additional gene(s) contributing to the fusion is indicated as part
of a superscript, e.g., Ecol\lacZP\T.A92. In these special
cases there is no distinction made between promoter fusions and protein fusions
in the gene name.
B.1.4.2.
Alleles of ectopically expressed Drosophila gene products
Products of genes may be ectopically
expressed due either to juxtaposition with different regulatory sequences in
the genome (as a result of being inserted into different-than-wild-type locations
by chromosome rearrangement or P element transposition) or due to in vitro construction
creating a different constellation of regulatory sequences than in wild type.
By analogy with alleles of Ecol\lacZ
for enhancer traps, P-element-borne insertions of genes e.g., w or
ve that have a qualitatively distinct _position-dependent_ mutant phenotype
will be curated as new alleles of e.g., w or ve, e.g., veStg
caused by a particular insertion of P{HS-rho}, P{HS-rho}Stg.
The 'in vitro construct' ectopic
expression alleles currently fall into two main classes, one component or two
component systems:
One component systems:
Gene A is expressed from a promoter of gene B. The allele is typically generated
by in vitro construction. In such cases the allele symbol is of the format 'gene-Agene-B.PI',
e.g., phylsev.PC or 'gene-Agene-B.fragment descriptor'
where the author includes a promoter fragment descriptor, e.g., phylninaE.GMR.
An occasional exception is made for
promoter fusions that are widely used to provide essentially wild-type gene
function; these alleles have the mini-gene '+m construct' designation
(see below) prepended to an, e.g., heat shock designation, e.g., w+mW.hs.
It is common that authors report
a construct where e.g., ftz is expressed under a 'heat shock' or Hsp70
promoter, while providing no further details about the nature of the promoter.
For these cases the allele symbol hs.PI is employed, e.g., Antphs.PZ
for 'Antp heat shock construct of Zeng'. An 'hs' designation should be reserved
for when the heat inducible, not just the minimal, promoter fragment is used.
Where the allele is both altered
in its coding region and being expressed from an ectopic promoter the sequence
'alteration.promoter' is used in the allele designation, e.g., tor13D.hs.sev
to denote the coding sequence of tor13D expressed from a
heat shock (undefined) promoter with a sev enhancer. An exception to
this rule is made for Tags, which appear as the last component of the allele
symbol (see below).
Two component systems:
- GAL4-UAS The allele symbol
for the gene whose expression is dependent upon Scer\GAL4 shall include
'Scer\UAS' and an identifier. The identifier should reflect the construct
as named by author e.g., l(1)scDeltaB.Scer\UAS. In the
absence of any other identifier '.cIa' is used, where 'c' stands for construct,
I for the first author's last name initial and 'a' for the first in the series
(subsequent ones will be b, c, etc). e.g., ase Scer\UAS.cBa
for 'Scer\UAS construct a of Brand'.
- FLP-FRT Alleles of Scer\FLP
are named as outlined above for reporter genes, and allele symbols of genes
whose expression is dependent upon that of Scer\FLP include 'Scer\FRT'.
B.1.4.3.
Alleles of ectopically expressed non-Drosophila effector products
A note on ribozymes: FlyBase has
a foreign ribozyme gene, symbol LTSV\RBZ. Alleles of LTSV\RBZ
capture the different variants, e.g., for a heat inducible ftz-targeted
ribozyme: LTSV\RBZhs.ftz (syntax 'promoter.target gene')
will be named.
'+m' minigenes
The minigene allele designation is
used in its narrow sense, i.e., where the only difference between the allele
and the wild type is the removal of more or less non-essential sequences. Thus
the minigene allele symbol designation reserved for those cases where the gene's
own promoter is driving its expression.
The minigene allele symbols begin
with 'm', for minigene, and are followed by the construct symbol used
in the publication. If no construct symbol has been used, the string 'mIa'
where 'm' stands for minigene, 'I' for the first author's
last name initial and 'a' for the first in the series is used. If the
function of the minigene is stated to be indistinguishable from that of the
wild type allele, the 'm' is preceded by a '+'.
Tags Genes can be modified by the
addition of a tag allowing the product to be identified, purified, or targeted
to a particular subcellular distribution. Tagged alleles have the syntax 'gene-symbol
x.T:y' , where x is an identifier and y is
the name of the tag, e.g., Hsap\MYC, T:Ivir\HA1, SV40\nls2,
e.g., CycBB1.T:Hsap\Myc. Where a tag is artificial, the
species prefix Zzzz is used, e.g. T:Zzzz\His6.
B.1.4.4.
Classical alleles engineered into transgene constructs, including rescue constructs
A class of alleles are named to capture
fragments of genomic DNA used in rescue constructs. The symbol for the rescuing
allele symbol begins with '+t'. This is followed by length as stated
by authors, construct symbol if length is not given or '+tIa', where
't' stands for transgene, 'I' for the first author's last
name initial and 'a' for the first in the series (if neither length
nor construct symbol is stated). When rescue is incomplete, the construct is
considered as carrying a mutant allele. Allele designator is construct symbol,
'length of genomic insert.tIa' if no symbol is given or 'tIa'
where neither length nor construct symbol is stated.
When a classic allele, e.g., wa,
is put into a transgene construct it will get a new designation, e.g., wa.tIa,
to reflect its transgenic environment, where 't' stands for transgene,
'I' for the first author's last name initial and 'a' for the
first in the series
FlyBase is, of course, happy to discuss
and advise on use of nomenclature of these non-traditional alleles.
B.1.5.
Protein and transcript symbols and exon naming
FlyBase strives to link curated information
to particular protein and transcript species. In order to maintain the data
in this way, it is necessary to assign different symbols to each gene product.
Proteins, transcripts and exons are symbolized as follows.
Protein symbols are of the form cact[+]P482
where the gene symbol and allele designation are followed by a capital P and
the size of the protein in amino acids. When the size in amino acids is not
known, the size in kiloDaltons is used, e.g. grh[+]P120kD. If no size is known,
the symbol is followed by a capital letter to distinguish products that are
known to be different, e.g. Sh[+]PA, Sh[+]PB. If multiple proteins of the same
size and divergent sequence are characterized, the symbols are followed by different
capital letters, e.g. abc[+]P345A, abc[+]P345B. A generic protein symbol, e.g.
cact[+]P, is used to capture properties that cannot be specifically attributed
to one protein product of a gene.
Transcripts are similarly named.
The gene symbol and allele designation are followed by a capital R and the size
in kb, e.g. cact[+]R2.2. Where possible the size as estimated by northern blot
is used. If not, the size of the longest cDNA is used and this is indicated
in the transcript table. For transcripts of unknown size, the symbol is followed
by a capital letter, e.g. grh[+]RA, grh[+]RB. For multiple transcripts of similar
size and divergent sequence, the symbols are followed by different capital letters,
e.g. abc[+]R1.7A, abc[+]R1.7B. A generic transcript symbol, e.g. cact[+]R, is
used to capture properties that cannot be specifically attributed to one particular
transcript of a gene.
In general, all of the exons comprising
a gene are numbered consecutively from 5' to 3'. Where exons partially overlap,
they are given the same number with a suffix, e.g. 2a,2b.
In some cases, it is not possible
to attribute a characteristic to an individual gene product. For example, expression
pattern data is often obtained with probes or antibodies that recognize more
than one product of a gene. It is not rigorously known where each individual
gene product is expressed. In addition, it is often not possible to determine
which transcript observed on a northern blot corresponds to a particular cDNA.
In these cases, the data is linked to a generic protein or transcript entity
for that gene.
B.1.6.
FlyBase Genes - Interactive Fly Cross Index
FlyBase has developed a hierarchical
view of the Interactive Fly entitled "Interactive Fly
Hierarchy: cross-index to FlyBase genes". This hierarchy is accessible from
both Allied Data and Genes.
The hierarchy provides an overview of the Interactive Fly with links
to the specific Interactive Fly pages, as well as gene lists with links
to the individual gene records in FlyBase and the Interactive Fly.
This permits searches for genes grouped according to developmental and cellular
pathways and functions.
B.1.7.
Differences and omissions from Lindsley and Zimm (1992)
All errors found in Lindsley
and Zimm (1992) have been corrected. A list
of these errors, sorted by page number, is in the file errors.txt in the
Redbook section of FlyBase Documents.
The material in the DELETION
MAP tables in the 'lethals' section of Lindsley
and Zimm (1992) is not included; these tables are available in the Redbook
section of Maps. The
tables of Lindsley and Zimm (1992)
have been broken down and the data incorporated into the text of the relevant
gene record. All references
within the body of a text entry of Lindsley
and Zimm (1992), i.e., not in the references: field, have
been duplicated into the references: field. With a very few
exceptions all references are to be found in the FlyBase Bibliography
and carry FlyBase reference ID numbers. The
molecular map figures in Lindsley
and Zimm (1992) are not included in genes, but are available in Redbook/Images
sections of Documents. Lindsley and Zimm often used introductory sections for groups of genes that are, in some way or other, related (see e.g. the record for ASC, page 50). This structure is not suitable for FlyBase, and this information has, in general, been repeated in each of the relevant individual gene records.
B.2.
Synonyms
FlyBase maintains a record of synonyms
for gene, allele, aberration, transposon and transgene construct symbols that
have appeared in the literature and stock center stock lists. Files with tables
of synonyms and their corresponding "valid" symbols are found in the relevant
sections of FlyBase.
Synonyms have several different causes.
Sometimes two workers give the same symbol to two different genes, requiring
one of these to be changed. Sometimes two workers, either by accident or design(1),
give two different symbols to the same gene, then that which has priority should
be used. Many of the synonyms arise, however, as a consequence of minor variation
in the way a gene's or aberration's or transposon's or transgene construct's
symbol is written (e.g., with lower case or capital first letter), or by error,
either in the literature or these tables. In some cases it has been difficult
to decide whether a name is a gene synonym or just an allele name (this is especially
so for lethals). We have taken a very liberal attitude to synonyms and, when
in doubt, have included a name as a synonym even when it may more correctly
be an allele name.
The files are:
- Genes/gene-synonyms
-- For genes and their alleles. This plain-text file contains a list of synonyms
and valid symbols as 'synonym-symbol > valid-symbol', one synonym per line.
There are often many synonyms per valid symbol. Superscripts are indicated
in the text by <up> (beginning of superscript) and </up> (end
of superscript). Greek
letters are also encoded in the text (for example, alpha appears as &agr;).
- Aberrations/aberration-synonyms
-- This plain-text file contains a list of synonyms and valid symbols as 'synonym-symbol
> valid-symbol', one synonym per line.
- Transgene-construct/transposon-synonyms
(not yet available)
1.
"Scientists would rather use each other's toothbrushes than each other's
nomenclature.", Keith Yamamoto.
B.3.
Species other than D. melanogaster
FlyBase includes data on all species
from the family Drosophilidae. The 'default' species is D. melanogaster
and all symbols and names of genes, alleles, aberrations and clones from other
species have a prefix of the form Nnnn\, where N is the initial
letter of the genus (e.g. D for species in the genus Drosophila)
and nnn is normally the first three letters of the specific epithet
(e.g., sim for simulans). In formal terms all symbols and
names from D. melanogaster have the prefix Dmel\, but this
is usually omitted.
Species prefixes are also used for
non-melanogaster genes introduced into D. melanogaster via a transgene
construct, including Ecol\lacZ, Scer\GAL4 and Avic\GFP.
In addition, genes carried by natural transposable elements have the transposon
symbol as a 'species' prefix, for example, P\T, the gene for P-element
transposase. To find genes such as these in a Genes search, change the 'Species'
option from the default 'Dmel' to 'All'.
A list of all of the names
and abbreviations used by FlyBase for species is included in the Nomenclature
section of FlyBase. The species-abbreviations.txt file has the syntax:
taxgroup | abbreviation | genus | species name | common name | comment
At present, four different 'taxgroups'
are recognized:
drosophilid (i.e., species in the
family Drosophilidae), non-drosophilid eukaryote, prokaryote, transposable element
and virus (including prokaryotes viruses), and the file is sorted in this order.
We stress that identity of gene symbol
between two species cannot be used to conclude 'homology' of genes. Where known,
or strongly suspected, information concerning homologous genes within the family
is present in a *M field of the genes file.
FlyBase has made only limited efforts
to curate genes, alleles and aberrations from species other than D. melanogaster
for the period before 1989. We have back curated from D.I.S. and some
primary papers and reviews that have come to hand. For four species we have
incorporated the efforts of others:
- D. ananassae
- From a catalog of mutations and chromosome aberrations of Drosophila
ananassae provided to FlyBase by Y.N. Tobari. This was the text of Chapter
11 'Catalog of mutants' by D. Moriwaki and Y.N. Tobari in Y.N. Tobari (editor)
Drosophila ananassae: Genetical and biological aspects (Japan Scientific
Societies Press, Tokyo and Karger, Basel, 1993). We thank Professor Tobari
for his permission to make these data available in FlyBase and for providing
the data on disk.
- D. buzzatii
- From a catalog of the genes and mutations of Drosophila buzzatii
provided to FlyBase by J.S.F. Barker. This was based on Schafer, Fredline,
Knibb, Green and Barker (1993) Genetics and linkage mapping of Drosophila
buzzatii. J. Hered. 84:188--194. Where no phenotypic description
is given, it is similar to that for the mutant of the same name in D.
melanogaster, and is assumed homologous. Unless otherwise specified,
visible mutants were detected through inbreeding to F2 or F3 the progeny of
wild-caught females (Spencer, 1949). Most of the visible mutants are in the
collection of the Tucson Drosophila
Species Stock Center. FlyBase thanks Professor Barker for providing these
data on D. buzzatii.
- D. virilis
- From a list prepared for FlyBase by Professor H. Kress.
- D. subobscura
- From the lists in Krimbas (1993) 'Drosophila subobscura, Biology, Genetics
and Inversion Polymorphism'. Verlag Dr. Kovac, Hamburg.
We
would be happy to hear from colleagues who are able to review records from species
other than D. melanogaster. We thank Jerry Coyne for reviewing the
records for D. simulans, D. mauritiana and D. sechellia.
B.4.
Genetic objects from non-Drosophila species that are included in Drosophila
Sequences from many other organisms
are often included in artificial constructs introduced into the genome of Drosophila.
FlyBase calls these 'foreign genes' and they have symbols that indicate both
the species of origin and the nature of the element, e.g., Hsap\BMP4,
the BMP4 gene from humans. A list
of the species abbreviations used is to be found in the Nomenclature
section.
Just as two or more different Drosophila
genes can be engineered into a gene fusion so can two or more different foreign
gene coding regions. These are called 'foreign fusion' genes, e.g., Avic\GFP::Ecol\lacZ,
a coding fusion of Aequorea victoria GFP and the E. coli lacZ
gene.
Structural and non-coding elements
('SAFE elements', see B.1.3.) from non-Drosophila species are called foreign
SAFE elements. The most common group of foreign SAFE elements are short sequence
tags used to mark genes or their products (including epitope tags). These have
symbols that begin with 'T:', e.g., T:Hsap\MYC, the 'myc' epitope tag.
Artificial sequences are also classed as SAFE elements, e.g., T:Zzzz\His6
for a DNA sequence encoding a run of six histidine residues.
A limited class of regulatory elements
from foreign species are classified as foreign SIRE elements (synthetic and/or
isolated regulatory elements). This class is restricted to regulatory elements
widely used in an isolated context, for example as mobile activating elements.
Examples are the synthetic multiple UAS[[G]] elements, restricted to cases in
which they are used within transgene constructs designed to activate adjacent
endogenous genes.
The class of element is indicated
in a *t line, which, for the objects described in this section, can have the
following values:
- *t foreign_gene
- *t foreign_fusion
- *t safe_element.f
- *t sire_element.f
Each class, or any combination of
classes, can be extracted from the database by using the complex query form
in Genes with the "Class" option changed from the default "all" to one or more
(ctrl+click to add terms) of these categories.
For each class the origin of the
gene is described in star-coded format in a *u line with the following syntax:
*u Foreign sequence; species == <species_name>; gene|sequence|sequence
tag|function tag|epitope tag == <gene symbol>; <database_abbreviation:database_id>.
Attempts are first made to cross-reference
to another genetic database (e.g., OMIM, GDB, MGD). If such a link cannot be
made then we attempt to establish a link with a protein or nucleic acid sequence
database. The database abbreviations used will be found Reference
Manual F: Links To and from FlyBase. The gene name or symbol will be enclosed
with single quotation marks if no cross-reference to another genetic database
can be found. If no cross-reference can be established then a brief literature
reference to the object will be included within the 'comment' field. In the
case of epitope tags the comment field will normally include the 'name' of the
antibody recognizing the epitope and a literature reference.
B.5.
Maps
The Maps
section of FlyBase contains map-based browsing and query tools and data. See
Reference Manual C: Using FlyBase on the Web for
further information on these tools.
FlyBase uses Bridges' revised maps
for the banding patterns of the polytene chromosomes. See:
Bridges, 1938, J. Hered. 29: 11--13
(X chromosome), Bridges and Bridges, 1939, J. Hered. 30: 475--476 (2R), Bridges,
1941, J. Hered. 32: 64--65 (3L), Bridges, 1941, J. Hered. 32: 299--300 (3R),
Bridges, 1942, J. Hered. 33: 403--408 (2L).
B.5.1.
Sequence-based Maps
Release
3.1 of the Drosophila melanogaster genomic sequence (January 2003)
includes new and reviewed annotations of genes (protein-coding and non-coding),
as well as real sequences and annotations of naturally occurring transposable
elements. FlyBase has developed several tools with which to view and manipulate
sequence annotations, sequence-aligned supporting data, and other sequence-aligned
data sets.
B.5.1.1.
Genome Browser, gbrowse
The FlyBase Genome Browser, gbrowse , provides a Web-based view of a specified region of the genome; the location of that region along the chromosome arm is indicated graphically. The region of interest can be specified by gene symbol, CG identifier, a mapped feature (such as a Drosophila Gene Collection cDNA clone, BAC genomic clone, P element insertion, or protein sequence accession in the SPTR database with BLASTX similarity to the genomic sequence), or a coordinate extent on a scaffold accession or chromosome arm. One can also input a sequence string using the Fly BLAST server and from the BLAST results list link to the alignment in the gbrowse view. The extent of the region (from 100 bp to 5 Mbp) can be controlled by the user using the zoom option. Adjacent regions can be viewed using the scroll option. Annotated genes, supporting data, and other sequence-aligned data (eg., P-element insertion sites and Affymetrix oligos) are shown as color-coded features flanking the central sequence axis. Features can be indentifed by mousing over the relevant graphic and viewing the feature name in the status bar; when the view is zoomed in sufficiently, or the gene labelling option is selected, the gene annotations are labelled. Included below the gbrowse view of the region are BAC in situ images. The "Display Settings" panel can be used to control the subset of features displayed, the width of the image, and other display options. For example, one can choose to have gene symbols displayed or can choose to have an expanded view of the aligned data. The data behind the gbrowse view, including cytological locations and GO gene function descriptions, can be downloaded in various flat-file formats: tabulated, FASTA, GAME-XML or GFF formats.
B.5.1.2.
Drosophila Genome Overview
The FlyBase tool Drosophila Genome Overview is an extension of gbrowse that allows users to browse entire chromosome arms at once. The default view displays cytological numbered divisions, the tiling BAC genomic clones, and the annotated sequence scaffolds in GenBank. Clicking on the BAC or GenBank scaffolds takes users to the gbrowse view of the region. Users can also choose to display all of the genes along a chromosome arm, as well as cDNAs that align to the genomic sequence, P element insertions, transposable elements, and sequencing gaps. The width of the map can be adjusted, which is necessary when viewing these finer, optional features.
B.5.1.3.
Apollo
A more flexible and interactive view
of the same data provided in gbrowse is possible using the Apollo
genome browser and annotator. Use of this tool requires that the Apollo
software be downloaded and installed locally; data are then loaded via a Web
connection from the annotation database. Data can be saved locally in the form
of GAME-XML flat files and subsequently reloaded into Apollo. A detailed and
comprehensive user
guide for Apollo is available. This tool provides several options for viewing
annotations and features down to the sequence level, and allows searches for
specific genomic or amino acid sequence strings. Apollo also provides editing
options, including sequence-level modifications of exon extents, addition of
alternative transcripts, deletion of existing annotations, modifications involving
merging or splitting existing annotations, and addition of comments associated
with specific genes or transcripts. There are many options for customizing the
format of the view and the data sets; these may be saved as user preferences.
B.5.1.4.
Sequence Features from the Literature
Sequence features captured from the
literature have not yet been integrated with annotation data from the whole
genome. These gene annotation maps, when available for a given gene, are included
in the Abridged and Full gene report formats (but not in the default Synopsis
report format) in the Cytogenetic Maps section. Browseable
lists of Sequence Features Maps are also available from the Maps
section of FlyBase.
B.5.1.5.
Molecular Maps
Molecular Maps represent genomic
data presented in one publication and may contain information about gene structure,
aberration breakpoints, mutations, regulatory elements, rescue fragments, and
sites of transposon/transgene construct insertions. The coordinate scale used
in the publication is preserved. If known, the orientation of the map on the
polytene chromosomes is indicated as positive or negative with respect to the
direction of increasing cytological designation.
Map data are extracted from graphical
maps in figures, from text, and from DNA sequence (either from the paper itself
or from sequence database entries that cite the paper as a reference). Wherever
possible, exact sizes or positions of entities are used as extracted from sequence
data or author statements. In the absence of sequence based data, coordinates
are estimated from the graphical maps in figures. An indication of the source
of the data is provided in the comments.
Links to Molecular Maps are included
in the Abridged and Full gene report formats (but not directly from the default
Synopsis report format). If a Molecular Map is available for that region a link
titled Molecular Map will
be present toward the top of the gene report.
Using the Molecular Maps
The maps are presented with the coordinate
scale at the top. If no coordinate system was assigned by the author, maps are
displayed with 0 at the left end with coordinates increasing from left to right.
Mapped features are displayed with other features of a similar nature in labeled
sections beneath the coordinate scale. Thick lines represent the known extent
of a given entity and thin lines represent the range of uncertainty of the map
location (as indicated by the author). For deficiencies, thick lines represent
DNA that is deleted. Dashed lines extending off the map indicate that part of
the mapped feature is not contained within the map as drawn. Triangles or diamonds
represent points. Note that coding sequences are simply represented as an arrow
connecting the translation start and stop points and do not show the intervening
transcript structure.
Many of the features are hyperlinked
to reports which provide further details about the mapped entities. All of the
aberrations, mutations, and rescue fragments are linked to the appropriate aberration
or allele reports. Coding sequence (CDS), exon, and regulatory elements are
linked to a gene report. In the near future, the CDS and exon links will go
to the Protein and Transcript page of the gene. Note that all of the mapped
exons for a gene are indicated but the order in which they are combined to generate
transcripts is not indicated as yet on the maps. This information can be found
near the top of the transcript report in a table of transcript characteristics
in a column titled Exons.
The portion of the mapped DNA that
has been sequenced and is associated with a sequence database record is indicated
as a GenBank
record and is marked with the Accession number. If a sequence is presented as
a solid line, it represents genomic DNA and indicates that the entire indicated
region has been sequenced. If a sequence is drawn with dashed lines at each
end, then some subset of the indicated region has been sequenced. The latter
are most often cDNA sequences which correspond only to the exonic portions of
the map. The sequence entries are hyperlinked to the appropriate GenBank
record.
Below the map, a table lists the
actual coordinates in kb of each of the mapped features. Coordinates listed
as X--Y indicate entities that extend from point X to point Y. Note that for
entities that extend off the map, the table does not give an indication that
the entity extends beyond the map but rather lists the Left-most or Right-most
coordinate of the map. X~Y indicates the uncertainty in the location of a particular
entity or breakpoint. Coordinates with 4 places after the decimal point (e.g.,
0.1005) represent the locations of entities that map between bases on the wild
type map such as transposon and transgene construct insertions.
B.5.2.
Cytological Maps
Cytological
Maps display both cytologically-mapped and sequence-mapped objects. Gene
data from both the literature and from full-genome annotation are included.
Using the Cytological Maps tool
From the Mainmap
click one of the object bars (genes, clones, etc.) over the map location of
interest. A map of that polytene chromosome division will be returned. Clicking
on the symbol of a map object will retrieve the FlyBase report for that object.
Other classes of objects can be added to or subtracted from the current map
using the check boxes provided, and other chromosome divisions can be displayed
using the navigation options provided.
Purple lines behind an object's symbol
indicate the range of uncertainty of its map position. These ranges may run
together in dense regions of the map, hiding the end points. Retrieve an object's
report to see the range of its cytological location.
In some cases you will see buttons
labeled More above the list of entity names. This is a compromise to
provide a full list of entities at a map position without expanding the size
of these maps too much. Clicking this button provides a full list of items within
that map section.
B.5.3.
Gene Order Maps
Gene
order maps contains maps that communicate both gene order and cytological
location. There are two formats: files whose names end '.ps' are suitable for
downloading and printing on a PostScript printer, while those ending 'txt' are
preferable for viewing in a web browser. Their format is documented in detail
in the file geneorder.doc in
the same folder.
Using the Gene Order Maps
The gene-order map communicates both
gene order and cytological location. This is presentationally rather different
on a genome-wide map than on a small, well-mapped region, and a novel format
has been adopted, which is documented here.
1. Cytological range
Each gene whose cytological location is known with a range of uncertainty less
than about two number divisions is written on a vertical line whose extent is
the range of uncertainty. Overlapping lines are staggered. To this extent, in
other words, the format is as in the EofD. A gene whose symbol exceeds nine
characters may cross more than one line; the line it is attached to always goes
through the second character of the symbol.
Bands are drawn with differing sizes,
but this is not in any way related to amount of DNA per band, as it is on the
EofD. It is only a function of how much data we need to place there.
2. "Limiting" genes
In addition, at either end of the line there is the symbol for a gene that is
known to lie to the indicated side of the gene in the middle of the line. Two
points must be emphasized about these "limiting" genes: they are not be