Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements
Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance
Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.
Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).
Case Studies
Partners
Sponsoring Company/Organization: USAF/AFMC Air Force Research Laboratory SBIR Contract F33615-02-C-6032
Contracting Organization: Intelligent Automation Corporation
Technical Lead (Modeling): Abbott Analytics
The Air Force identified a need to integrate existing toxicological data into a comprehensive, open architecture system to be called the Integrated Toxicity Assessment System (ITAS), something that had not yet been accomplished in a public setting. Additionally, the Air Force desired predictive model for a variety of routes (oral, inhalation, dermal), endpoints (acute), subjects (rat, mouse, human, etc.) and dose types (LD50, LC50, LDLo, etc.) as data were sufficient to build models. This would be essential when studies for compounds of interest had not yet been completed, and therefore no toxicity.
Intelligent Automation Corp. (IAC) integrated the RTECS and IUCLID databases into a SQL Server 2000 database. The log of LD50 values in the ITAS database for the rat, oral acute data are shown in the figure below. Additionally, attributes for each compound were computed from the Dragon software (Milano Chemometrics). Dragon contained approximately 1500 attributes, including 1-D, 2-D, and 3-D structure information. However, while the number of matches created a large dataset for predictive toxicity (1,995 based on rat, oral, acute compounds having known LD50 values), there were far too many attributes to build reliable models. Additionally, significant variations (contradictions) in toxicity values for the same compound were found in the database, leading to confusion for modeling algorithms.
Sufficient data for modeling existed only for rat and mouse oral acute data (only rat results are shown here). First, after initial modeling results demonstrated the difficulty in predicting LD50 values, LD50 was split into four classes according to Environmental Protection Agency (EPA) guidelines. The split points are shown as red vertical lines in the figure above, and specific splits and class counts are shown in Table 1 below. These split points helped to significantly reduce the adverse effects of noise in the data's LD50 values.
Table 1: Counts for Each Toxicity Class
Toxicity Label | LD50 Range | Rat Oral Acude Data Count |
---|---|---|
Extremely Toxic | (LD50 < 0.25 mmol/kg): | 131 |
Highly Toxic | (0.25 < LD50 < 2.5 mmol/kg): | 479 |
Moderately Toxic | (2.5 < LD50 < 10 mmol/kg): | 627 |
Non-toxic | (LD50 > 10 mmol/kg): | 758 |
Total Compounds | 1,995 |
A multi-stage variable selection approach was taken to reduce the dimensionality of the data. First, constant-valued descriptors were removed because they contained no information for predictive modeling. Second, descriptors highly correlated with other descriptors were removed. These two steps reduced the number of descriptors from nearly 1,500 down to 411. To further reduce the number of descriptors, a study was performed that found 120 descriptors were sufficient to provide good accuracy without overfitting. The actual 120 descriptors were selected via the Kolmogorov-Smirnov (K-S) distance ranking method.
To overcome the problems due to relatively small sample sizes and high levels of noise, an ensemble of ten neural networks were created, built from bootstrapped data samples. The entire process described thus far is summarized in the block diagram above.
Modeling results showed moderate accuracy, though seemingly better than other results documented in the data mining literature (accuracy is difficult to assess when there is such high variation in acute toxicity values). In general, individual descriptors proved to be poor predictors on their own, and the best models used a combination of 1-D, 2-D, and 3-D descriptors, and tended to favor the 2-D and 3-D descriptors. Model ensembles were critical in providing accuracy and reducing risk of error from a single model, increasing model accuracy on hold-out data by 50% over average single-model performance, as shown in Table 2 below. Note that the classification accuracy would be only 25% if one guessed the toxicity class at random.
Table 2: Classification Accuracy by Model Type
Model Type | Accuracy on Held-out Data |
---|---|
Minimal Single Neural Netowrk Accuracy | 41% |
Maximum Single Neural Network Accuracy | 44% |
Average Neural Network Accuracy | 44.1% |
Bootstrap Neural Network Ensemble | 65.6% |
Wroblewski, D., M.T. Green, J. Viloria, D. Abbott, and E. Wroblewska, Computational Platform for Predictive Toxicology, A Poster Presented at The ADMET 1 Conference, San Diego, CA, February 11-13, 2004.
Abbott, D., D. Wroblewski, Making Large Feature Sets Manageable for Prediction of LD50 from 3-D Chemical Structure, 7th Annual Insightful Users' Conference, Las Vegas, NV, October 8-10, 2003.
Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics
TDWI Data Science Bootcamp Seminar (Austin, TX / Virtual Classroom): September 20 - 22, 2021
PAW for Business (Virtual Classroom): May 20 - 25, 2021
Vafaie, H., D.W. Abbott, M. Hutchins, and I.P. Matkovsky, Combining Multitple Models Across Algorithms and Samples for Improved Results (PDF), The Twelfth International Conference on Tools with Artificial Intelligence, Vancouver, British Columbia, November 13-15, 2000.