/
GENOME DATABASE Richard Durbin MRC Laboratory of Molecular CB2 2QH Ma GENOME DATABASE Richard Durbin MRC Laboratory of Molecular CB2 2QH Ma

GENOME DATABASE Richard Durbin MRC Laboratory of Molecular CB2 2QH Ma - PDF document

nicole
nicole . @nicole
Follow
342 views
Uploaded On 2022-10-11

GENOME DATABASE Richard Durbin MRC Laboratory of Molecular CB2 2QH Ma - PPT Presentation

adapted form of this program called PMAP was publicly available research community together with regularly updated copies of the in house data This rapid and complete access to the map even incom ID: 958551

objects data map database data objects database map acedb sequence class object gene information 000 system display classes text

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "GENOME DATABASE Richard Durbin MRC Labor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

GENOME DATABASE Richard Durbin MRC Laboratory of Molecular CB2 2QH, Mathematique and CRBM, BP Systematic genome mapping adapted form of this program, called PMAP, was publicly available research community, together with regularly updated copies of the in house data. This rapid and complete access to the map, even incomplete form, extremely popular and successful, soon becoming when cloning genes. It therefore seemed sensible to extend the same approach, single database hold sequence, physical genetic map, and references, that use in house, could distribute in form freely within the worm community. directly to database program ACEDB, which is described this chapter. Rather than being limited the specific data that we could envisage when we decided to write a management system frequent extension database schema been comparatively easy to adapt ACEDB to be used other genome projects working with other organisms. the time writing (March, there are public databases for the plant Arabidopsis thaliana, tuberculosis, which are the pathogens for leprosy and tuberculosis. Several other databases for public distribution are is also being several sites, for example for physical mapping results from human Drosophila projects. being used the core pieces (Integrated Genomic Datatape) project (European Data Recource for Genom Research, Heidelberg), which plans bring together all public human genome in an both being used as the primary graphics end of IGD, and the alternative data storage OVERVIEW OF FEATURES we will give a of

ACEDB a biological sees it. program is very graphical. works using windowing system, and different types to the different types of map. The maps other windows are linked in hypertext fashion, that clicking on object will display further information about that object the appropriate sort of window. For a chromosome displays map; clicking on a the genetic map information about the gene's phenotype, references etc; clicking on a clone physical map around it; on a sequence display The internal structures system, which more general, and which contain the more novel features, will be described later sections. There are interactive tools for these more features available to the interface, but discussion will be delayed later. In this section display the different classes in the There is window that is always present running ACEDB: main control contains a list available classes objects, such Papers, Authors etc. To use program in its most simple fashion selects a class with the mouse and types the name of the object in the yellow text box, then hits key, at which point the will be displayed window. If template is given using wildcard characters (such then a list all possible matching is given, from which can select. There a menu accessible with the right button that complex features, query system discussed later. Ready Sequence Clone Journal Paper Locus Strain Mutagen one-Gr. Model Other-Locus Genom Search] 6 items Selected Type Gene Molecular-information Positive-clone 0.133817 Allele Strain Reference GENETICS OF FEED

ING. eat-4 0. unc-45 1 on- tra-i unc-64 21 lin-37 SUP-18 lin-13 1 i n-39 mab-5 1 i n-21 lin-36 eP6 lin-9 unc-32 emb-34 emb-24 eat-4 ced-7 unc-16 sup-13 emb-30 mor Figure 1. Main window, Genetic window of one gene GENETIC MAP genetic map display gives access to genes and rearrangements on the scale chromosome. In the units centimorgans. The map many ACEDB maps. On the left side, acting like annotated scrollbar, of the whole chromosome with a green cursor region indicating region. This area is zoomed up to fill the rest of display, which various zones, including from right: space for chromosomal an indication of physically mapped regions, a scale the genes themselves. The be easily scrolled up and out using either scrollbar cursor, or that can pressed with with There are also buttons to allow genetic mapping data to be displayed have extended the to also allow calculations be made the map double clicks on any item a new pops up with text information text information is called a tree structure. The section below on "Organisation of data'' describes further this tree structure, fact is the of storing information in ACEDB. The maps are the data form with This is primary display for looking at the clones within clone contigs. is (currently) a horizontal map, based very much on the map display the bottom is annotated scrollbar showing wider region of the contig. are a showing different types of clone, i.e. probes, these are spaces for have been attached to the localising them onto a clone, which can be freely attached

Also in this region is a green horizontal to the genetic map. Again, whenever item is selected, all the items are up; e.g. selecting might show that contains and any remarks attached is double clicked, then text information about the item is HYBRIDISATION GRID This window provides access to one the forms of raw used in map, which is hybridisation of probes to a grid of The grid displayed schematically, and the hybridisation pattern clicking on squares. Once this is done, the loci on map are determined automatically. facilities for comparing one pattern with another, and for displaying the results real and hypothetical pooled probings. was used position 1100 nematode cDNA's on the C. elegans a human mapping project for data acquisition. The sequence display function in ACEDB As well being able to display the actual nucleotide sequence protein translations of it in all frames, there are wide range of different schematic display options available. These allow features to be shown - - - 1128 - 000 000 000 000 000 000 000 000 000 000 000 000 Restriction site: 523167 523216 t t different scales, with simplified results on a megabase scale of short tandem repeats the other. Selection exactly which shown is As well displaying annotations precalculated information, sequence window several types particular there are program (Green personal communication) for predicting gene structures sequence based likelihood predictions of splice and codon usage. These are in fact to annotate the nematode genomic sequence before s

ubmission There are also site detection tools, extracting subsequences and the schema object-oriented system. of organisation of data much more intuitive than relational tables. Because direct input from working biologists construction and refinement of data structure. unique identifier, name, which is followed attributes organised into tree. The nodes at branchpoints of the are all named. The branches in pointers other objects, data, which are numerical values, character bare branch ending indicate presence property. There possibility of constructed subobjects, similar expanding a in place recursively into full object with branches, rather maintaining merely object. Arbitrary comments can attached freely at in the are allocated class has the maximal branches, and types of data or pointer permitted at each position. Individual objects, which are instances of class, in general part of branching pattern approach gives triple advantage: Poorly studied which are most numerous, take little space memory, which strongly increases schema, which frequently, all be done add another branch to the course none of but they remain valid because there no requirement for objects to contain any particular add personal comments that are ignored algorithms allows annotation of sources, reservations without affecting is a part of definition of the class Gene, Clone ?Clone XREF Gene Physical pMap UNIQUE XREF Gene UNIQUE Sequence ?Sequence XREF Gene Genetic gMap ?Chromosome XREF Gene UNIQUE Location ?Laboratory qui dN ced-4 Reference

-Allele Molecular-information Clone Location Cambridge Sequence ?Sequence XREF Gene Mapping-data "ced-4 Note that each object belongs to one class. deliberately chosen multiple inheritance. This concept is at same time efficiently and very difficult to use, the inheritance graph easily becomes with potential conflicts amongst super-classes. multiple inheritance is to restrict the of classes, but allow wider variety of objects within this system, it possible that two objects in the same class have few branches in common. For example consider genes, the first studied classical genetics and the second similarity to protein in another organism. These objects could be considered as archetypes of subclasses of the class such simple cases are relatively rare, a gene could have data for some fields of one and some of the other, and one is rapidly led combinatorial explosion in the number of classes. Our approach lets us capture without difficulty all the intermediate cases, and only need around fifty classes to hold nearly a hundred thousand heterogeneous objects. As well the classes of tree objects described above, also denoted have type classes, which contain general arrays of data, which allow more rigid but efficient storage data such as The schema itself objects within the simplified startup procedure and dynamic editing of Although data are stored internally in a binary the trees discussed above, they are entered via simple ascii files as ace files. Each paragraph in these files corresponds must be separated the m

ore blank lines. The first line indicates the class and name the object to be created or edited. Following lines start with the or text data, or names of other objects to are interpreted according Keywords such specify actions to be taken, with the default action being add the data into the database. comment in the file. us define a sequence: Sequence ACT3 elegans actin gene Library EMBL with special reader) here we change the name here we change one the authors the paper Author deletion the objects have public unique ascii identifier, the doublet that these edit and can to precise objects. not known yet, it is created, else it is modified. delete or rename operation finds delete or rename it moves on silently. an instruction makes no sense according model, for example to an branch point, the user is warned and the these properties also allow reading of the same file changing the database contents, transfer of information between databases that match exactly, something very hard with traditional database systems. even allow transfer of commonly meaningful systems whose schemas differ. reading in data, can also export set of objects in file form. program, acediff, takes as input such ace files, third that would the effect of transforming database containing data as the first file into one containing in the second file. This be used generate update files for remote copies central database, and in fact this is the distribute the nematode database. There also facilities within for certain types of specialised data

such as FASTA format. ACEDB to efficient. The is therefore built using the C, Unix, ended up database system to write our management system. also wrote graphical library on top of X, be reimplemented using other underlying systems Apple Toolbox), and a number useful to us. is therefore tied to particular machine, nor even (after little work) windowing system. It contains an internal help system, crash recovery (when possible). Some of these functions are described in this The conceptual unit of transfer and memory is the object. Since the objects trees or arrays of arbitrary size, wrote a relatively complex module them into out of fixed size blocks (one or two small objects are brought in groups that each fill ones are split several blocks. The system is speeded two levels the first acting on the fixed size blocks, the on developed trees. These both work last out queues. Since the class of an object is at all levels, it selectively optimise the of certain classes, as we example with particular session, all objects are rewritten to new disk locations, allows us store multiple versions of an object, and from crashes the last verified save state of the Each object identified externally doublet Internally it represented by a unique 32 bit bits being for the class, and 24 for the location the class. Linking these are hash tables that map the known names of each class there are then set of indices containing the disk address the object, cache if it is in flags indicating its edit state and other properties, pointer ba

ck its name, etc. There are separate index hash tables each class. These are loaded into about 30 bytes per Only one user at a time is permitted write The set of changes until write access is given constitutes a session. a session saved, first the objects are disk, then the indices and tables for altered classes are written (as type A class objects, and also to locations). Finally a pointer in the superblock changed to point to the new index information. Once done the will start up with this point will it starts up the old indices, and the old objects from before the aborted All simple access data within the uses a subroutine that allows access via the names of terminal branchpoints, like with the ace files. There are two steps: the first recalls an object from disk and returns a handle; the second uses this handle to recover data. Since be missing, these routines all return boolean value indicating success or failure. an example, to the date a paper, one might write void paperDate (KEY was published %d\n", name(key), year) As well as accessing data via the direct subroutine there is a powerful and query language that allows higher level manipulation of sets of objects, which are known as The basic operations filtering operations on a set of objects, either on the basis of their names or the contain, and pointers to retrieve other objects. These operations can be combined into complex query sentences. The resulting keysets can be used in in various single items can be looked at interactively, the whole set can

be a starting point further queries, it can be dumped out ascii ace file (see above), can be saved database with a user-specified Boolean set operations can also be used to combine important feature is that sets can contain objects from used is via another query operation, "text search". This performs a search on all text stored in the database, and returns a list of all objects that either have names matching the search string, or contain text that matches it. For example a search for with muscle phenotypes, papers with "muscle” in the title, proteins, etc. The query is available both to the user, via an interactive interface that allows saving, recovery and reuse of queries, via a of subroutines. fact the main control window a limited set another facility for data presentation on the query package, called the "Table This allows the user to construct tables of displayed information in a using a spreadsheet. difference is that, in the table maker, new columns are derived from previous columns calculations. Once the instructions for defining a table can be built up interactively, and stored in a file once are correct. server/client architecture All the discussion that most users However there also a text version of the textace, that can be from the command line. This basically contains the kernel with a simple parsing interface. Although it can be useful for extracting data a serial line, the purpose of this version is to enable a client/server architecture for ACEDB. this is done is to contain a full of th

e database, and for the client to start with an the client, the server resolves it and sends the result to the client as an ace file, which is parsed into the client the server is acting exactly the same fashion independent textace program, except that it is to the client a pair of sockets, the standard of structure is possible because ACEDB allows meaningful data transfer between non-identical databases, via ace files. The client database can allowed to accumulate during the session, acting as a local or can be restricted that all calls for data are resolved passing them back to the server. course the former become much more efficient, susceptible to data stale if it is edited another process. is clear that editing data, the objects must retrieved from the server locked there, rather than updated based on a local copy. ACEDB is publicly available form and as source files. There are three ftp sites: in England, directory pub/acedb; lirmm.lirmm.fr directory genome/acedb in the repository/acedb. In each case the file gives further instructions retrieving the The data for the current version elegans database is available the same directories. make the source a licence restricting commercial exploitation), and specific application code, we hope that community of groups using ACEDB a single database This can between groups that are doing development work, and folding kernel changes into the official release described above. With this believe that can continue to support a community of genome database providers, cover