Abbott Analytics: Data Mining Consulting

Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements

Abbott Insights

Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance

Data Mining Clients

Client List and Case Studies

Courses and Seminars

Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.

Data Mining Resources

Data Mining Resources, Books, Websites, White Papers, Presentations, Tutorials

About Us

Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).

Contact Us


Data Mining Clients

Case Studies


Case Study: Air Force Research Laboratory

Project Title: Integrated Toxicity Assessment System (ITAS)

Sponsoring Company/Organization: USAF/AFMC Air Force Research Laboratory SBIR Contract F33615-02-C-6032
Contracting Organization: Intelligent Automation Corporation
Technical Lead (Modeling): Abbott Analytics

The Business Problem

The Air Force identified a need to integrate existing toxicological data into a comprehensive, open architecture system to be called the Integrated Toxicity Assessment System (ITAS), something that had not yet been accomplished in a public setting. Additionally, the Air Force desired predictive model for a variety of routes (oral, inhalation, dermal), endpoints (acute), subjects (rat, mouse, human, etc.) and dose types (LD50, LC50, LDLo, etc.) as data were sufficient to build models. This would be essential when studies for compounds of interest had not yet been completed, and therefore no toxicity.

The Analytics Problem

Intelligent Automation Corp. (IAC) integrated the RTECS and IUCLID databases into a SQL Server 2000 database. The log of LD50 values in the ITAS database for the rat, oral acute data are shown in the figure below. Additionally, attributes for each compound were computed from the Dragon software (Milano Chemometrics). Dragon contained approximately 1500 attributes, including 1-D, 2-D, and 3-D structure information. However, while the number of matches created a large dataset for predictive toxicity (1,995 based on rat, oral, acute compounds having known LD50 values), there were far too many attributes to build reliable models. Additionally, significant variations (contradictions) in toxicity values for the same compound were found in the database, leading to confusion for modeling algorithms.

Graph of results

The Approach

Sufficient data for modeling existed only for rat and mouse oral acute data (only rat results are shown here). First, after initial modeling results demonstrated the difficulty in predicting LD50 values, LD50 was split into four classes according to Environmental Protection Agency (EPA) guidelines. The split points are shown as red vertical lines in the figure above, and specific splits and class counts are shown in Table 1 below. These split points helped to significantly reduce the adverse effects of noise in the data's LD50 values.

Table 1: Counts for Each Toxicity Class

Toxicity Label LD50 Range Rat Oral Acude Data Count
Extremely Toxic (LD50 < 0.25 mmol/kg): 131
Highly Toxic (0.25 < LD50 < 2.5 mmol/kg): 479
Moderately Toxic (2.5 < LD50 < 10 mmol/kg): 627
Non-toxic (LD50 > 10 mmol/kg): 758
Total Compounds   1,995

A multi-stage variable selection approach was taken to reduce the dimensionality of the data. First, constant-valued descriptors were removed because they contained no information for predictive modeling. Second, descriptors highly correlated with other descriptors were removed. These two steps reduced the number of descriptors from nearly 1,500 down to 411. To further reduce the number of descriptors, a study was performed that found 120 descriptors were sufficient to provide good accuracy without overfitting. The actual 120 descriptors were selected via the Kolmogorov-Smirnov (K-S) distance ranking method.

To overcome the problems due to relatively small sample sizes and high levels of noise, an ensemble of ten neural networks were created, built from bootstrapped data samples. The entire process described thus far is summarized in the block diagram above.

Results and Delivery

Modeling results showed moderate accuracy, though seemingly better than other results documented in the data mining literature (accuracy is difficult to assess when there is such high variation in acute toxicity values). In general, individual descriptors proved to be poor predictors on their own, and the best models used a combination of 1-D, 2-D, and 3-D descriptors, and tended to favor the 2-D and 3-D descriptors. Model ensembles were critical in providing accuracy and reducing risk of error from a single model, increasing model accuracy on hold-out data by 50% over average single-model performance, as shown in Table 2 below. Note that the classification accuracy would be only 25% if one guessed the toxicity class at random.

Table 2: Classification Accuracy by Model Type

Model Type Accuracy on Held-out Data
Minimal Single Neural Netowrk Accuracy 41%
Maximum Single Neural Network Accuracy 44%
Average Neural Network Accuracy 44.1%
Bootstrap Neural Network Ensemble 65.6%


Wroblewski, D., M.T. Green, J. Viloria, D. Abbott, and E. Wroblewska, Computational Platform for Predictive Toxicology, A Poster Presented at The ADMET 1 Conference, San Diego, CA, February 11-13, 2004.

Abbott, D., D. Wroblewski, Making Large Feature Sets Manageable for Prediction of LD50 from 3-D Chemical Structure, 7th Annual Insightful Users' Conference, Las Vegas, NV, October 8-10, 2003.