Integrated Toxicity Assessment System (ITAS)
The Air Force identified a need to integrate existing toxicological data into a comprehensive, open architecture system to be called
Project Title: Integrated Toxicity Assessment System (ITAS)
Sponsoring Company/Organization: USAF/AFMC Air Force Research Laboratory SBIR Contract F33615-02-C-6032
Contracting Organization: Intelligent Automation Corporation
Technical Lead (Modeling): Abbott Analytics
The Business Problem
The Air Force identified a need to integrate existing toxicological data into a comprehensive, open architecture system to be called the Integrated Toxicity Assessment System (ITAS), something that had not yet been accomplished in a public setting. Additionally, the Air Force desired predictive model for a variety of routes (oral, inhalation, dermal), endpoints (acute), subjects (rat, mouse, human, etc.) and dose types (LD50, LC50, LDLo, etc.) as data were sufficient to build models. This would be essential when studies for compounds of interest had not yet been completed, and therefore no toxicity.
The Analytics Problem
Intelligent Automation Corp. (IAC) integrated the RTECS and IUCLID databases into a SQL Server 2000 database. The log of LD50 values in the ITAS database for the rat, oral acute data are shown in the figure below. Additionally, attributes for each compound were computed from the Dragon software (Milano Chemometrics). Dragon contained approximately 1500 attributes, including 1-D, 2-D, and 3-D structure information. However, while the number of matches created a large dataset for predictive toxicity (1,995 based on rat, oral, acute compounds having known LD50 values), there were far too many attributes to build reliable models. Additionally, significant variations (contradictions) in toxicity values for the same compound were found in the database, leading to confusion for modeling algorithms.
The Approach
Sufficient data for modeling existed only for rat and mouse oral acute data (only rat results are shown here). First, after initial modeling results demonstrated the difficulty in predicting LD50 values, LD50 was split into four classes according to Environmental Protection Agency (EPA) guidelines. The split points are shown as red vertical lines in the figure above, and specific splits and class counts are shown in Table 1 below. These split points helped to significantly reduce the adverse effects of noise in the data's LD50 values.
A multi-stage variable selection approach was taken to reduce the dimensionality of the data. First, constant-valued descriptors were removed because they contained no information for predictive modeling. Second, descriptors highly correlated with other descriptors were removed. These two steps reduced the number of descriptors from nearly 1,500 down to 411. To further reduce the number of descriptors, a study was performed that found 120 descriptors were sufficient to provide good accuracy without overfitting. The actual 120 descriptors were selected via the Kolmogorov-Smirnov (K-S) distance ranking method.
To overcome the problems due to relatively small sample sizes and high levels of noise, an ensemble of ten neural networks were created, built from bootstrapped data samples. The entire process described thus far is summarized in the block diagram above.
Results and Delivery
Modeling results showed moderate accuracy, though seemingly better than other results documented in the data mining literature (accuracy is difficult to assess when there is such high variation in acute toxicity values). In general, individual descriptors proved to be poor predictors on their own, and the best models used a combination of 1-D, 2-D, and 3-D descriptors, and tended to favor the 2-D and 3-D descriptors. Model ensembles were critical in providing accuracy and reducing risk of error from a single model, increasing model accuracy on hold-out data by 50% over average single-model performance, as shown in Table 2 below. Note that the classification accuracy would be only 25% if one guessed the toxicity class at random.
References
Wroblewski, D., M.T. Green, J. Viloria, D. Abbott, and E. Wroblewska, Computational Platform for Predictive Toxicology, A Poster Presented at The ADMET 1 Conference, San Diego, CA, February 11-13, 2004.
Abbott, D., D. Wroblewski, Making Large Feature Sets Manageable for Prediction of LD50 from 3-D Chemical Structure, 7th Annual Insightful Users' Conference, Las Vegas, NV, October 8-10, 2003.