Data and Code for the Article

Z. Barutcuoglu, R. E. Schapire, O. G. Troyanskaya,
"Hierarchical Multi-label Prediction of Gene Function,"
Bioinformatics 2006 22(7):830-836.

ALL_inputs.mat.zip is the training data in a zipped Matlab data file.

The examples are the rows.

The first row and the first column are informational, so remove them before training.

The first column contains the gene indices from geneNameIndex.txt. "trainingORFs = geneNameIndex(ALL_inputs(2:end, 1))" should be the relationship to figure out the ORF names of examples.

The first row is the index of the feature in the dataset it's coming from, but we never needed to actually use those, so they are not verifiably meaningful. Our training scheme then filtered the columns (in each fold of training) to remove the features with less than three non-zero entries for that fold to get to ~5000 from ~88000.

For the experiments, we had 3465 yeast genes/proteins and 131 selected GO nodes, but some of those nodes were isolated, without parents or children in that subhierarchy. We reported hierarchical aggregation results over the remaining 105 GO nodes, though the labels and predictions are available over the initial selection of 131.

selectedGOIDNums.txt is the list of GO ID numbers of the selected 131 nodes.

connectedGOIDNums.txt is the list of GO ID numbers of the non-isolated 105 nodes.

selectedNodeParentsDirect.mat.txt is the hierarchy edge list of parent->child pairs. These include part_of relations as well as is_a relations in the GO.

trainingORFs.txt is the list of ORF names we used. The Nth line in the training data corresponds to the Nth line in trainingORFs.txt and to the Nth line in the labels matrix.

Labels.mat contains the binary labels we used at the time (April 2004 snapshot from the GO). NaN entries are descendants of direct positive annotations, which we did not use as negative examples.

This does not apply to implicit annotations. Namely, if X and Y are children of Z and a gene is annotated directly to Y, it is an implicit positive in Z, and a negative in X. If it were annotated directly to Z, it would have been a NaN for X and Y and therefore not used as negatives in their trainings. Also note, in the GO, a gene can be annotated directly to both Y and Z, in which case, for X it is still a descendant of a direct annotation and thus not negative.

What labels there are in the 3465-vector ALL_inputs.mat and Labels.mat are all used for training (in folds). NaN labels are excluded from individual class trainings. Zero labels are used as negatives. The ones with no positive annotations for the 131 classes still have positive annotations elsewhere in the GO hierarchy, and so become negative examples (zero label) for a selected class if they are not annotated to an ancestor of it (if they are, they have NaN label and are excluded).

If you'd like to verify or want more details on the creation of these input files, see the Python scripts in the code archive below. The hierarchy and the labels are handled in ontology.py. ALL_export.py is the starting point for generating the input data.

svml_lin_Cinf_OutputsHeldOutRaw_withExcluded.mat.zip is the matrix of our raw (unthresholded) linear-SVM predictions.

Predictions_Heldouts_SVMlightRaw_Bayes0.mat is our predictions as marginal membership probabilities, to be thresholded as desired, or used for ranking.

code.zip: All of our Matlab code for the project, plus the Python code that produced all Matlab data files. The code is provided "as is", mostly undocumented, and including obsolete files.