jPAP Manual > IV. Defining Metadata
Before one can define a genetic model, one must define the metadata that controls the interpretation of the observations data. The task of defining metadata is accomplished by using MetEd to create a metadata document, which contains a set of Observational Variables (OVs). Each OV definition involves two types of settings: properties and attributes. Properties appear in the property pane when the node for the OV is selected, as discussed in the interface tour. Attributes are available for assignment from the Edit Menu, depending on context.
The primary property of an OV is its type. Among other things, an OV's type places constraints on the values the OV may take on. The set of such values is referred to here as the type's domain. The table below shows the type-to-domain correspondences for the current type implementations. However, it should be noted that within jPAP an OV type is an abstraction and is not limited to the current set of implementations. The slate of available OV types, their data characteristics, and their interpretation in modeling and analysis is expected to evolve as development of jPAP proceeds, and in principle can receive additions from third-parties. This is in keeping with jPAP's double role as analysis tool and tool framework, a recurrent theme in this manual.
| OV Type Branch | OV Type Leaf | OV Value Domain | OV Fields |
|---|---|---|---|
| Discrete Outcome | N-valued Outcome | 1..N, where N is set in MetEd. If the type represents disease status, then 1=Unaffected and 2..N represent ascending levels of disease severity. | |
| Dichotomous Outcome | A special case of the above, having an implied N of 2. Many analysis modules assume that a discrete variable is 2-valued, hence this case has received an explicit type designation. If the type represents disease status, as is the typical usage, 1=Unaffected and 2=Affected. | ||
| Gender ID | A special case of the above, where 1=Male, 2=Female. | ||
| Age Interval | An N-valued Outcome value reserved for age differences. | ||
| Continuous Outcome | Continuous Outcome | Real number. | |
| Age | Real positive number. | ||
| Marker |
Autosomal Marker X-linked Marker |
1..N, where N is the number of genotypes. N is determined by the program from another setting M, the number of alleles. The latter is a property of the marker's OV definition, and is set in MetEd. While the corresponding ObsDocument input data may fit this description directly, perhaps more commonly the marker data will be provided as two adjoining columns of allele values from the domain 1..M. When the marker has been assigned the Xlinked type and the subject is a male, the second column is ignored but should contain a missing value code. When allele values rather than genotype values are supplied in the ObsDocument as outlined above, assignment of a GTYPE() transformation to the target marker OV within MetEd suffices to specify that the relevant preprocessing be performed. |
|
| Marker Set |
Autosomal Marker X-linked Marker |
1..N, where N is the number of alleles | Selection of available markers | Outcome Set |
N-valued Outcome Dichotomous Outcome Continuous Outcome |
None | Selection of available traits |
| QTL |
IBD Formats: SOLAR Merlin jPAP |
None |
Selection of available chromosomes
IBD Directory - directory containing compressed files of IBD probabilities. Lower/Upper Location - chromosomal location [cM] range Interval - interval to step through location range Number of Digits to create GUIDs - N - for merging pedigree ID and individual ID into one GUID: ID_New = ID_Family * 10N + ID_Old |
| Shared-environment ID | Shared-environment ID | Positive integer. This should be a code identifying a common-environment "bin" which contains the individual. That is, all subjects sharing a given environmental effect receive the same code. (For instance, there could be an encoding for shared households.) A missing value code should be assigned for individuals who do not fall into a bin containing at least two subjects. |
Among the other important properties exposed in an OV's property pane are those which determine its input and code assignments. To get more information about a property, place the mouse cursor over its name.
The attributes assignable from the Edit Menu are of three types: (1) Data supplements, (2) Transformations, and (3) Cross-references. Data supplements typically contain population data, such as allele frequencies, prevalence values, and incidence rates. Before entering data for a multi-dimensional supplement, one first must assign cross-reference attributes to identify the other OVs to be used as axes for the data. In the case of some axial OVs, such as age measurements, one will also want to assign an intervalizing transformation to map the raw input. The metadata document for the examples included in the jPAP distribution illustrates these usages.
Of special note among the transformations assignable to Continuous Outcome OVs is the Maclean power function y = r/P[(x/r+1)P-1] [Maclean et al 1976], where x represents the phenotype, P represents the power and r = 6. It can be useful to update P from the parameter estimate produced thorough use of certain analysis modules, on a per-model basis. See section VI.7 for information on this technique.
Marker and Outcome Sets
Multiple markers and traits may be assigned to a given model using the Set observational variable types, Marker Set and Outcome Set. A Marker Set may be designated as Autosomal or X-Linked, just as the Marker type. An Outcome Set may be designated as N-Valued, Dichotomous, or Continuous. Once the Marker Type or Outcome Type is chosen, all available meta data variables that match that type will be avaiable to be added to the Set. A list of those variables will appear as checkboxes.
Data from LINKAGE format files may be imported into the meta data. Existing data will be overwritten. Up to 50 markers or traits may be imported.
Marker Sets and Outcome Sets are assigned to a model the same way Marker and Outcome variables are. When either the evaluation or maximization commands are executed, the model is run for each item defined in the Set.
QTL Sets
A QTL type works as a Set such that models are run separately for each chromosome specified and for each valid chromosomal location specified in a range at the given interal.
The location range and interval are given by the Lower Location, Upper Location, and Interval meta data entries. Location ranges and intervals are applied to all selected chromosomes. Locations outside the range of possible values for a particular chromosome are ignored.
SOLAR and Merlin IBD file formats may be used. When using SOLAR-formatted IBDs, the
program assumes pedindex.out and pedindex.cde files exist
in the directory with the IBDs, and expects the files be gzip-compressed and have the following
naming convention:
mibd.CHROMOSOME.LOCATION.gz.
The system on which
jPAP is run must have gzip installed in order to run variance components linkage models.
The program will need to know from where to import IBD probabilities and in what format are the files containing them. This information is set in the property pane under IBD Format and Directory Containing Multipoint IBD Files. The directory of the IBD files is set by double-clicking on the property pane, "Directory Containing Multipoint IBD Files" and choosing the directory from the directory selection dialog. When the directory is selected, the program will search for valid files of the given format.
If the SOLAR or jPAP file formats are chosen, the available chromosomes, and the maximum and minimum chromosomal locations will appear. Merlin files will have only the available chromosomes displays.
For cases where the individual identifiers must be merged with family identifiers in order to be
compatible with the pedigree and observational data, select the checkbox on the properties pane,
Number of Digits for Input IDs, and set the integer value to the appropriate decimal place.
If this value is called N, the new GUID will take the form,
ID_New = ID_Family * 10N + ID_Old