Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: YSYY66
GUIDE FOR ASSIGNMENT 1
A ) Selecting a business goal a . 1 Problem Statement
The problem statement is a concise description of the business issues that you are going to address with the chosen data set . A good problem statement should answer these questions:
1 .What is the business problem?
2. Who has the problem or who is the client/customer? This should explain who would
need the solution and who would decide the problem has been solved . 3 .a What is the problem definition in customer terms?
3 .b What is the problem definition in the company's terms?
Optional (depending on the problem):
4 .a When in the process (at what step or station) does the problem occur? 4 .b Where on the business is the problem seen?
4 .c How many parts in the business are reported as involved?
4 .d Can the problem be expressed in percentages, euros or pieces?
In addition to the primary problem statement , there are typically other related business problems that you would like to address . For example, the primary problem might be to keep current customers by predicting when they are prone to move to a competitor . Examples of related business questions are “ How does the primary channel (e .g ., ATM, visit branch, internet) a bank customer uses affect whether they stay or go?” or “ Will lower ATM fees significantly reduce the number of high-value customers who leave?”
a . 2 Description
The description of the selection should include these 3 sections:
- Business Goal: Describe the customer’s primary objective, from a business perspective, as a problem statement . The problem statement is a concise description of the issues that you are going to address with the chosen data set . Example: Improve employee performanc e .
- Business Questions: Translate the goals into several questions, that usually break down the problem into its major components . Each question is then refined into metrics . The same metric can be used in order to answer different questions under the same goal .
Examples: Is performance related to IQ , motivation or social skills ? How can we increase the performance of an employee ?
Hint : the business questions are questions that a person with no technical knowledge at all may have about the business . This person has usually no way of knowing how to answer the question , and what technique to use , so the person will write the questions in only business terms .
- Business success criteria : Describe the criteria for a successful or useful outcome to the project from the business point of view . This might be quite specific and able to be measured objectively, such as reduction of customer churn to a certain level or general, improved number of detected deviations, improved response rate of customers to some marketing campaign, percentage of correct patient diagnoses, and subjective such as “ give useful insights into the relationships .” In the latter case it should be indicated who could make the proper subjective judgment .
Your description of success criteria should answer the following questions:
1. What does success look like?
2. How do I know I've completed the project?
3. How do I know I've done a great job? and finally, 4. How will all this be measured?
B ) Data analysis questions
b . 1 Determine data mining goals
A business goal states objectives in business terminology . A data mining goal states project objectives in technical terms . For example, the business goal might be “ Increase catalog sales to existing customers .” A data mining goal might be “ Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc .) and the price of the item .” A business goal might be "increase sales", while the data mining goal could be "determine customer properties with respect to their purchasing power" . A business goal might be "prevent credit card fraud", while the data mining goal could be "find critical patterns for fraudulent card usage" or "build an accurate algorithm for automatic fraud detection"
The description should indicate what sort of modeling technique (classification, clustering, association, …) is associated .
b . 2 Data mining success criteria
Define the criteria for a successful outcome to the project in technical terms, for example a certain level of predictive accuracy or a propensity to purchase profile with a given degree of “lift .” As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons who could make the subjective judgment should be identified .
Your description of success criteria should answer the same questions as in business criteria .
C ) Insights & recommended call to action
A call to action (CTA) is a term that refers to the next step a business must take, in this case based on the insight obtained . For example, if the insight is the "customer properties with respect to their purchasing power", the next CTA the business must take could be a marketing campaign directed to retain great buyers . Another example is “ Predict how many widgets a customer will buy”, the CTA could be to launch a promotion for those buyers buying few widgets .
Task II : Data Understanding
Describe the data which has been selected and used, including: the format of the data, the quantity of data, for example number of records and fields in each table, the identities of the fields and any other surface features of the data which have been discovered . Explain why the data selected satisfy the relevant requirements .
Describe properties of the data obtained by visualization and simple statistics like:
- The distribution of key attributes, for example the target attribute of a prediction task;
- Relations or correlations between pairs or small numbers of attributes;
- Results of simple aggregations;
- Properties of significant sub-populations;
- Simple statistical analyses .
Task III : Data Preparation
Decide on the data to be used for analysis . Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types . Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table .
Describe what decisions and actions were taken to address the data quality problems .
Describe the new attributes that are constructed from one or more existing attributes in the same record . Example: area = length * width .
Describe the creation of completely new records . Example: Create records for customers who made no purchase during the past year . There was no reason to have such records in the raw data, but for modeling purposes it might make sense to explicitly represent the fact that certain customers made zero purchases .
Task IV : Modeling
A ) Test design
Describe the intended plan for training, testing and evaluating the models . Describe the división of the available dataset into training data, test data and validation data . Describe the metrics and formulas used to measure the performance of the models .
B ) Build model
Run the modeling tool on the prepared dataset to create one or more models . With any modeling tool, there are often a large number of parameters that can be adjusted . List the parameters and their chosen value, along with the rationale for the choice of parameter settings .
Describe the selected modeling technique . The description should include all mathematics and algorithms necessary to understand how the model works .
List the parameters and their chosen value, along with the rationale for the choice of parameter settings .
C ) Model description , evaluation and assesment
Describe numerically and analytically the obtained results . If possible, describe de meaning of the model’s parameters . Describe the results separately for the training set and the test set, and compare . Should you have used any technique to avoid over fitting (cross validation, boostraping) or data unbalance, explain it .
Describe and evaluate the resultant model from its performance in terms of accuracy and generality of the model .
Task V : Evaluation
Report on the interpretation of the model according to the domain and business knowledge, the data mining success criteria and the desired test design, and document any difficulties encountered with their meanings .
If there is a verbal description of the generated model (e .g . via rules, trees, coefficients), provide an interpretation for each rule/branch/coefficient according to the domain and business .
Assess the results: are they logical, are they feasible, are there too many or too few, do they make sense according to the problem?
Task VI : Plan deployment
This task takes the evaluation results and concludes a strategy for deployment of the data mining result(s) with respect to business goal and success criteria . Summarize deployment strategy including necessary steps and how to perform them:
- How will the results be deployed within the organization ’ s systems?
- How will its use be monitored and its benefits measured (where applicable)?
- Decide for each distinct knowledge or information result .
- How will the knowledge or information be propagated to its users?
- How will the use of the result be monitored or its benefits measured (where applicable)?
- Identify possible problems when deploying the data mining results (pitfalls of the deployment) .