Introduction
Welcome to the Infobiotics PSP benchmarks repository. This site contains an adjustable real-world family of benchmarks suitable for testing the scalability of classification/regression methods. When we test a machine learning method we usually choose a test suite containing datasets with a broad set of characteristics, as we are interested in checking how the learning method reacts to each different scenario.The Protein Structure Prediction (PSP) field provides us with a whole family of real-world classification/regression problems that can be adjusted almost arbitrarily in terms of number of variables, number of classes, class balance, etc (click here for an in-depth explanation). This characteristic makes these datasets an ideal benchmark suite for classification/regression methods.
The repository
We have generated 140 versions of the same prediction problem. 120 as a classification domains, 20 as a regression domain. Some versions use discrete attributes, other continuous. The number of attributes ranges from 1 to 380. The number of classes from 2 to 5. The datasets are partitioned into training/test sets using 10-fold cross-validation. All the datasets take 4.3GB in compressed format. We have prepared different packaged with different subsets of the repository, to make it easier to download the desired exact amount of data.Access to the repository
Using the datasets
If you use the datasets in this repository, we ask you to cite them as follows:
@article{1358456,
author = {Michael Stout and Jaume Bacardit and Jonathan D. Hirst and Natalio Krasnogor},
title = {Prediction of recursive convex hull class assignments for protein residues},
journal = {Bioinformatics},
volume = {24},
number = {7},
year = {2008},
issn = {1367-4803},
pages = {916--923},
doi = {http://dx.doi.org/10.1093/bioinformatics/btn050},
publisher = {Oxford University Press},
address = {Oxford, UK},
}
@misc{PSPbenchmarks,
author = "Jaume Bacardit and Natalio Krasnogor",
title = "The Infobiotics PSP benchmarks repository",
note = "(http://www.infobiotic.net/PSPbenchmarks)",
year = "2008"
}
We also would like you to inform us of any published work where these datasets are used, so we can collect
a list of results obtained with them