Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements
Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance
Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.
Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).
Case Studies
Partners
Sponsoring Company/Organization: Large, National Health Club
Contracting Organization: Seer Analytics, LLC (Tampa, FL)
Technical Lead: Abbott Analytics, Inc.
A national fitness club created a survey to assess and understand how well it was meeting its core objectives by satisfying the desires and interests of the club's members. Survey questions measured members' ratings of the club in such areas as performance of the staff, quality and quantity of exercise equipment, and interactions between members of the club. Health club administrators wanted to gain insight into the relative importance of these issues without having to understand statistics, and therefore results had to be presented in language and visualization that was transparent to these decision makers. The core goals of the club were to maintain existing members, and expand the total number of members in the club.
The data mining objective centered on finding which survey questions were most related to questions that reflected the core goals of the club. Unfortunately, no single question contained captured fully how member opinions related to these core goals. Additionally, as is common with surveys, responses to questions were highly intercorrelated, which could lead to a misleading interpretation of model results.
A composite "target" variable was created by combining responses to three survey questions most related to the club's core goals: overall satisfaction, intent to renew membership, and willingness to recommend club membership to a friend. This variable became the output (dependent) variable for modeling. Second, factor analysis was used to identify groups of highly correlated variables and their relationship to the target variable. Single survey questions best representing each of the top five factors were included as inputs (independent variables) along with three additional survey questions that had been consistently good predictors in preliminary data mining models. These eight survey questions were then used to build linear regression models to predict the composite satisfaction target variable. The key to the model was discovering which survey questions were most related to the target variable.
After the final model was created, simulations were devised to assess the sensitivity of the model to changes in answers to the survey questions. In other words, if a member changed his/her answer to key survey questions, the simulation would determine the influence of the changes on the overall satisfaction rating.
The final presentation to the customer consisted of visuals representing the influences of key survey questions to the club's objectives. In a single picture, four dimensions were represented. First, the order of the balls (left to right) indicated the relative importance of the key survey questions to the organization's objectives. The most important questions, in descending order, were: relationships with other members, how well fitness goals were met, and value for the money. Second, the height of the ball represented the performance of that club relative to all clubs of the same type. Third, the color of the ball showed how well the club performed in the current year compared to the previous year. And fourth, the size of the ball showed how influential each survey question was in predicting the target variable for this particular club.
In the figure above, this club performed better than its peers at promoting positive relationships between members, and at developing more competent (Staff 1) and caring (Staff2) staff. This finding was particularly satisfying for the club shown here because the green ball shows that the club improved the caring nature of the staff from the previous year. However, members found that this club was a worse value for the dollar than other clubs, the facilities were not good, and the perceptions for both were getting worse. However, the small sizes of these balls indicate that neither was very important for predicting members’ overall satisfaction with the club.
Invoice Fraud Detection: Successful application of data mining by Abbott Analytics
TDWI Data Science Bootcamp Seminar (Austin, TX / Virtual Classroom): September 20 - 22, 2021
PAW for Business (Virtual Classroom): May 20 - 25, 2021
Abbott, D.W., Benefits of Creating Ensembles of Classifiers, The Data Administration Newsletter, Robert Seiner, ed., Issue 18, October 2001.