*Topic: Predicting Catalog Demand* | *Tool: Alteryx*

Description: The project is to predict how much money the company can expect to earn from sending out a catalog to new customers. Need to build a linear regression model and apply the result to the mail-order catalog business problem.

### Step 1: Business and Data Understanding

What decisions needs to be made?

Management from the company that manufactures and sells high-end home goods want to determine if they should send print catalog to the 250 new customers in the coming months. If the expected profit contribution from these new customers exceed $10,000, the company will send the catalog. Otherwise, the company won’t send the catalog.

What data is needed to inform those decisions?

- Predict the expected revenue from these 250 new customers.
- Use the historical customers data to build the linear regression model.
- Apply the model to the mailing list data to come up with the predicted Avg_Sale_Amount from the 250 new customers.
- Make sure to multiply Avg_Sale_Amount by Score_Yes to get predicted revenue.

- The costs of printing and distributing is $6.50 per catalog.
- The average gross margin on all products sold through the catalog is 50%.
- The expected profit = SUM(Predicted Avg_Sale_Amount * Score_Yes) * 50% – $6.50 * 250
- If the expected profit >= $10,000, the company will send the catalog. If the expected profit < $10,000, the company won’t send the catalog.

### Step 2: Analysis, Modeling, and Validation

*Provide a description of how you set up your linear regression model, what variables you used and why, and the results of the model. *

How and why did you select the predictor variables in your model?

First, just by exploring the data, I did not select Name, Customer_ID, Address, State, Responded_to_Last_Catalog. Name and Customer_ID do not affect the sales amount. Address is so unique to the customers that it’s rare that two customers have the same address. State is always CO in the dataset. Responded_to_Last_Catalog cannot be used in the linear regression model since it could not be applied to the mailing list data set.

I run the linear regression model in Alteryx by selecting the target variable as Avg_Sale_Amount, and the predictor variables as Customer_Segment, City, Zip, Store_Number, Avg_Num_Products_Purchased and #_Years_as_Customer.

Based on the result above, seems that **Customer_Segment**, **Zip** and **Avg_Num_Products_Purchased** have a significant coefficient with Avg_Sale_Amount.

Plot Zip: Apparently, there is no linear relationship with Avg_Sale_Amount.

Plot Avg_Num_Products_Purchased: there is a linear relationship between Avg_Num_Products_Purchased and Avg_Sale_Amount.

Customer_Segment is a categorical variable. The P-value is less than 0.05, so the relationship between Customer_Segment and Avg_Sale_Amount is considered to be statistically significant.

In conclusion, the target variable is Avg_Sale_Amount; the predictive variables are Customer_Segment and Avg_Num_Products_Purchased. Customer_Segment is a categorical variable and Avg_Num_Product_Purchased is a continuous variable.

Explain why you believe your linear model is a good model. You must justify your reasoning using the statistical results that your regression model created. For each variable you selected, please justify how each variable is a good fit for your model by using the p-values and R-squared values that your model produced.

Below please find the statistical results that the linear regression model created. As we can see, the P-value for each variable is less than 2.2e-16. Since the predictor variables have a p-value below 0.05, the relationship between it and the target variable Avg_Sale_Amount is considered to be statistically significant. Also, the R-square value is 0.8369, and the adjusted R-squared value is 0.8366. It is fairly high. Considering P-value and R-squared, the linear regression model is a good model.

What is the best linear regression equation based on the available data? Each coefficient should have no more than 2 digits after the decimal (ex: 1.28)

Avg_Sale_Amount = 303.46 + 66.98 * Avg_Num_Products_Purchased -149.36 (If Customer_Segment: Loyalty Club Only) + 281.84 (If Customer_Segment is Loyalty Club and Credit Card) – 245.42 (If Customer_Segment is Store Mailing List) + 0 (If Customer_Segment is Credit Card Only)

**Important: The regression equation should be in the form:**

*Y = Intercept + b1 * Variable_1 + b2 * Variable_2 + b3 * Variable_3……*

**For example:** Y = 482.24 + 28.83 * Loan_Status – 159 * Income + 49 (If Type: Credit Card) – 90 (If Type: Mortgage) + 0 (If Type: Cash)

Note that we **must **include the 0 coefficient for the type Cash.

### Step 3: Presentation/Visualization

What is your recommendation? Should the company send the catalog to these 250 customers?

My recommendation is that the company should send the catalog to these 250 customers. The expected profit is $21,987.44, which is higher than $10,000, so the company should send the catalog to these 250 new customers. (See more explanations in the following questions.)

How did you come up with your recommendation?

- First I came up with the formula using the linear regression model.
- Avg_Sale_Amount = 303.46 + 66.98 * Avg_Num_Products_Purchased -149.36 (If Customer_Segment: Loyalty Club Only) + 281.84 (If Customer_Segment is Loyalty Club and Credit Card) – 245.42 (If Customer_Segment is Store Mailing List) + 0 (If Customer_Segment is Credit Card Only)

- Apply the model to the mailing list data set to get the predicted Avg_Sale_Amount.
- Then multiply Avg_Sale_Amount by Score_Yes (which is the probability to buy) for each customer.
- Sum the amount above then multiply by 50% which is the gross margin,
- then subtract the catalog cost ($6.5 *250).

Below please find the formula in short:

**The expected profit = SUM(Predicted Avg_Sale_Amount * Score_Yes) * 50% – $6.50 * 250**

The entire process is done in Alteryx. See the process below:

What is the expected profit from the new catalog?

Using the formula and the process that I explained in the previous question, the expected profit from the new catalog is **$21,987.44.**

Here is how I come up with the result (it is done in Alteryx):

**The expected profit = SUM(Predicted Avg_Sale_Amount * Score_Yes) * 50% – $6.50 * 250**

The predicted Revenue = SUM(Predicted Avg_Sale_Amount * Score_Yes) = $47,224.87

The expected profit = $47,224.87 * 50% – $6.5 *250 = $21,987.44