Abbott Analytics: Data Mining Consulting

Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements

Abbott Insights

Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance

Data Mining Clients

Client List and Case Studies

Courses and Seminars

Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.

Data Mining Resources

Data Mining Resources, Books, Websites, White Papers, Presentations, Tutorials

About Us

Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).

Contact Us


Case Study: Defense Finance and Accounting Service

Project Title: Invoice Fraud Detection

Sponsoring Company/Organization: Defense Finance and Accounting Service (DFAS)
Contracting Organization: Federal Data Systems (now Logicon Fed Data), Elder Research, Abbott Analytics
Technical Lead: Abbott Analytics

The Business Problem

The Defense Finance and Accounting Service (DFAS) is responsible for disbursing nearly all of the Department of Defense (DoD) funds. In its effort to be outstanding stewards of these funds, DFAS determined to minimize fraud against DoD financial assets and chose Data Mining as one of its strategies to detect, and ultimately deter, fraud. The data mining models ranked the likelihood that each individual invoice was suspicious, and should therefore should receive further investigation by a human examiner. Therefore, false alarms also had to be minimized so that examiners were not unduly burdened with cases unlikely to be fraudulent.

The Analytics Problem

The data consisted of fields extracted from millions of invoices and stored in a database. Analysts then added a label to each record indicating whether the invoice was "fraudulent" or "not fraudulent" (also called "unlabelled"). Fraudulent invoices were labeled only after they had been investigated by examiners, prosecuted, and judged to be fraudulent. The vast majority of invoices, however, had never been examined, and were assumed not to be fraudulent.

Several problems confronted the analysts. First, due to the small numbers of available records labeled " fraudulent," the analysts could not perform the standard splitting of data into training, testing, and validation data sets. Second, invoices that were unlabelled were not necessarily cleared of suspicion. Although the vast majority of these were likely not fraudulent, some of them were fraudulent, thus polluting the labels if any a large sample of the data was taken. Third, the known fraudulent transactions were not all representing the same fraud scheme; they belonged to several different types of schemes, and therefore labeling all of them the same (as "fraud") could confuse modeling algorithms.

The Approach

For the data mining process, the DFAS extracted transactions for some several adjudicated fraud cases, and used source documents to recreate other transactions to appear as they would in the database. These recreated transactions were added to a moderate number of unlabelled transactions, the vast majority of which were not fraudulent. However, by choosing only thousands of these unlabelled transactions, the likelihood of selecting an unlabelled transaction and calling it non-fraudulent mistakenly was relatively small. This combined data set was split into 11 data subsets via cross-validation sampling. Finally, instead of trying to predict an outcome of merely "fraud" and "not fraud," examiners and analysts were interviewed to provide expert insight into how many types of fraud they had seen. As a result of the interviews, a five-level target variable was created containing four distinct fraud classes types and a non-fraud type.

Results and Delivery

There is strength in diversity. This is an adage that has found support in the literature reviewed in this paper and in the results presented in this paper. In this project, ensemble diversity had two main sources: sample diversity and algorithm diversity. Eleven random samples of the data were used in building models, and each of the six independent analysts were given full freedom to use any algorithm on the random samples assigned to them. The best 11 models are shown in the figure below, where "best" was defined as those models achieving the highest model accuracy while keeping false alarms low.

Model # Modeler Data Subset Algorithm Type
1 1 8 Neural Network
2 2 5 Decision Tree
3 3 7 Neural Network
4 4 9 Decision Tree
5 4 8 Decision Tree
6 4 11 Decision Tree
7 4 11 Rules
8 5 6 Neural Network
9 6 4 Neural Network
10 6 1 Neural Network
11 1 3 Rules

The figure below shows the relative performance of the 11 models and an additional “ensemble” model that combined the predictions of the other 11 into a final decision (actual sensitivity and false alarm rates of the models are not shown to protect the confidentiality of the models). The ensemble performed better than any of the individual models, and because of this, we avoided the necessity of “handcrafting” models via multiple iterations of the model building process.

Graph of results

In addition, the ensemble detected 97% of known fraud cases (in validation data), and as a result of the models 1,217 payments were selected for further investigation by DFAS.