🌙 Toggle Dark Mode

Imbalanced Regression Dataset Repository

A curated collection of datasets for extreme value-aware regression tasks

This page provides access to 62 datasets with metadata on features, target imbalance, extreme values, and missing data characteristics. Ideal for benchmarking regression models under imbalanced conditions.

This repository has been constructed and used in the following work:
The data is available in two formats: CSV and ARFF.

Download CSVs Download ARFFs View on GitHub
DatasetDescriptionFeaturesNominal FeaturesNumeric FeaturesInstancesMissing ValuesType of ExtremeRelevance Threshold# Rare% RareTarget VariableTarget Variable Index PositionSource
a1The data points are taken on an annual basis from various streams / rivers in Europe, compiling features aimed at predicting the concentrations of seven algae species.1138198NoBoth0.800.00%a10[1]
a2The data points are taken on an annual basis from various streams / rivers in Europe, compiling features aimed at predicting the concentrations of seven algae species.1138198NoBoth0.800.00%a20[1]
a3The data points are taken on an annual basis from various streams / rivers in Europe, compiling features aimed at predicting the concentrations of seven algae species.1138198NoBoth0.800.00%a30[1]
a4The data points are taken on an annual basis from various streams / rivers in Europe, compiling features aimed at predicting the concentrations of seven algae species.1138198NoBoth0.800.00%a40[1]
a6The data points are taken on an annual basis from variousstreams / rivers in Europe, compiling features aimed at predicting the concentrations of seven algae species.1138198NoBoth0.800.00%a60[1]
a7The data points are taken on an annual basis from various streams / rivers in Europe, compiling features aimed at predicting the concentrations of seven algae species.1138198NoHigh0.8147.07%a70[1]
abalonePredict the age of abalone from physical measurements.8174177NoBoth0.8103324.73%Rings0[2]
accelerationDataset with acceleration statistics.143111732NoBoth0.81589.12%acceleration0[3]
aileronsThe attributes describe the status of the aeroplane, while the goal is to predict the control action on the ailerons of the aircraft.4004013515NoBoth0.8162211.80%Goal0[4]
airfoilNASA data set, obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel.5051503NoHigh0.8805.32%scaled-sound-pressure0[5]
anacaltThe data contains information about the decisions taken by a supreme court.7074052NoBoth0.800.00%Log_exposure0[4]
appliances_energyExperimental data used to create regression models of appliances energy use in a low energy building.2702719735NoBoth0.8821241.61%Appliances0[6]
autopricesDataset with feature leading to the prediction of its price.16115159NoBoth0.800.00%class0[7]
availablePowerDataset with power statistics.15781802NoBoth0.830516.93%available.power0[8]
bank8FMPart of a family of datasets synthetically generated from a simulation of how bank-customers choose their banks.8084499NoBoth0.800.00%rej0[9]
baseballThis dataset contains the 1992 salaries of the set of Major League Baseball players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers.16016337NoBoth0.800.00%Salary0[10]
californiaHousingThis data set contains information about all the block groups in California from the 1990 Census.80820640NoLow0.818028.73%MedianHouseValue0[11]
cocomoSoftware Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering.161160NoLow0.81423.33%ACT_EFFORT0[12]
concreteConcrete Compressive Strength data set8081030NoLow0.800.00%ConcreteCompressiveStrength0[13]
cpuActivComputer activity data set210218192NoLow0.83714.53%Usr0[9]
cpuSmThe Computer Activity databases are a collection of computer systems activity measures. The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a multi-user university department. The final dataset is taken from both occasions with equal numbers of observations coming from each collection epoch.120128192NoLow0.83714.53%usr0[9]
debutanizerThis dataset aims to predict the butane concentration on a Debutanizer column.7072394NoHigh0.82128.86%y0[14]
deltaAirleronsThis data set is also obtained from the task of controlling the ailerons of a F16 aircraft.5057129NoBoth0.8120616.92%Sa0[4]
deltaElevatorsThis data set is also obtained from the task of controlling the elevators of a F16 aircraft.60069517NoHigh0.8478550.28%Se0[4]
diabetesThis data set concerns the study of the factors affecting patterns of insulin-dependent diabetes mellitus in children.20243NoHigh0.8613.95%C_peptide0[4]
ele-1Electrical Length data set202495NoHigh0.8214.24%Length0[4]
ele-2Electrical-Maintenance data set4041056NoBoth0.800.00%Y0[4]
elevatorsThe attributes describe the status of the aeroplane, while the goal is to predict the control action on the ailerons of the aircraft.1801816599NoBoth0.8439026.45%Goal0[15]
forestFiresForest Fires data set12012517NoHigh0.8152.90%Area0[16]
friedmanFriedman Benchmark Function data set5051200NoBoth0.800.00%Output0[4]
fuelConsumptionThe data contains information about car’s emissions and fuel consumption.3712251764NoBoth0.81679.47%fuel.counsumption.country0[17]
geographical_origin_musicInstances in this dataset contain audio features extracted from 1059 wave files. The task associated with the data is to predict the geographical origin of music.11701171059NoBoth0.81049.82%V1000[18]
heatDataset with heating statistics.11387400NoBoth0.883311.26%heat0[8]
house16HThis database was designed on the basis of data provided by US Census Bureau.1601622784NoBoth0.8609826.76%Price0[9]
housingThe Ames Housing Dataset is a well-known dataset in the field of machine learning and data analysis. It contains various features and attributes of residential homes in Ames, Iowa, USA.7943361460Yes (7829)Both0.817912.26%SalePrice0[19]
housingBostonThis dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.13013506NoBoth0.810520.75%HousValue0[20]
kdd_coil_1This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities.1138316Yes (56)Both0.800.00%algae_10[21]
kinematics8nmThis is data set is concerned with the forward kinematics of an 8 link robot arm.8088192NoBoth0.800.00%y0[9]
laserLaser generated data set404993NoBoth0.800.00%Output0[4]
lungcancer-sheddenPrediction in Lung Adenocarcinoma23320442NoHigh0.8122.71%OS_years0[22]
machineCPUMachine CPU Performance data set606209NoBoth0.84622.01%PRP0[4]
maxTorqueDataset with torque statistics.3213191802NoBoth0.823513.04%maximal.torque0[23]
metaMeta-Data was used in order to give advice about which classification method is appropriate for a particular dataset.21219528Yes (504)Both0.816531.25%class0[24]
mortgageMortgage data set150151049NoLow0.813312.68%30Y-CMortgageRate0[25]
pdgfrThis is one of 41 drug design datasets.320032079NoHigh0.81518.99%oz3220[26]
pollenThis dataset is synthetic. It was generated by David Coleman at RCA Laboratories in Princeton, N.J.5053848NoBoth0.82426.29%DENSITY0[27]
puma32NHThis is a family of datasets synthetically generated from a realistic simulation of the dynamics of a Unimation Puma 560 robot arm.320328192Yes (33)Both0.800.00%thetadd60[9]
puma8NHThis is a family of datasets synthetically generated from a realistic simulation of the dynamics of a Unimation Puma 560 robot arm.8088192Yes (9)Both0.800.00%thetadd30[9]
quakeQuake data set3032178NoBoth0.800.00%Richter0[4]
sensoryData for the sensory evaluation experiment in Brien, C.J. and Payne, R.W. (1996) Tiers, structure formulae and the analysis of complicated experiments.11011576NoBoth0.86911.98%Score0[28]
servoThis is an interesting collection of data provided by Karl Ulrich. It covers an extremely non-linear phenomenon - predicting the rise time of a servomechanism in terms of two (continuous) gain settings and two (discrete) choices of mechanical linkages.440167NoBoth0.85935.33%class0[15]
space_gaThe dataset contains 3,107 observations on U.S. county votes cast in the 1980 presidential election.6063107NoBoth0.81825.86%ln_votes_pop0[29]
stockStock Prices data set909950NoBoth0.800.00%Company100[30]
strikesThe data consist of annual observations on the level of strike volume (days lost due to industrial disputes per 1000 wage salary earners), and their covariates in 18 OECD countries from 1951-1985.606625NoHigh0.8152.40%strike_volume0[31]
sulfer_1The sulfur recovery unit (SRU) removes environmental pollutants from acid gas streams before they are released into the atmosphere. Furthermore, elemental sulfur is recovered as a valuable by-product.50510081NoBoth0.8111711.08%y10[32]
sulfer_2The sulfur recovery unit (SRU) removes environmental pollutants from acid gas streams before they are released into the atmosphere. Furthermore, 0.8elemental sulfur is recovered as a valuable by-product.50510081NoBoth0.8144414.32%y20[32]
supercondutivityTwo files contain data on 21263 superconductors and their relevant features.8108121263NoBoth0.800.00%critical_temp0[33]
treasuryThis file contains the Economic data information of USA from 01/04/1980 to 02/04/2000 on a weekly basis.150151049NoLow0.813713.06%1MonthCDRate0[4]
triazinesA triazine dataset. The goal is to predict the inhibition of dihydrofolate reductase by triazines.60060186NoBoth0.82312.37%activity0[34]
wankaraThis file contains the weather information of Ankara from 01/01/1994 to 28/05/1998.9091609NoBoth0.800.00%Mean_temperature0[4]
wine-qualityThe two datasets are combined and related to red and white variants of the Portuguese "Vinho Verde" wine.121116497NoHigh0.8411363.31%quality0[35]
yachtHydrodynamicsDelft data set, used to predict the hydodynamic performance of sailing yachts from dimensions and velocity.606308NoBoth0.800.00%Residuary_Resistance0[36]

References