For our final project in my machine learning course, we participated in a Kaggle competition to predict the concentration of dissolved inorganic carbon in water samples by using ocean chemistry data. This data comes from the California Cooperative Oceanic Fisheries Investigations (CalCOFI) program.
Objective
To predict dissolved inorganic carbon, we will be using a Linear Regression Model in R to make these predictions. We will use the CalCOFI data to train our model to make predictions of inorganic dissolved carbon concentrations in different parts of the ocean that aren’t included in the training data.
The variables we are using as predictors in our model are:
NO2uM - Micromoles of Nitrite per liter of seawater
NO3uM - Micromoles of Nitrate per liter of seawater
NH3uM - Micromoles of Ammonia per liter of seawater
R_TEMP - Reported (Potential) Temperature (degrees Celsius)
R_Depth - Reported Depth from pressure (meters)
R_Sal - Reported Salinity (from Specific Volume Anomoly, M³ per Kg)
R_DYNHT - Reported Dynamic Height (work per unit mass)
R_Nuts - Reported Ammonium concentration (micromoles per Liter)
R_Oxy_micromol.Kg - Reported Oxygen concentration (micromoles per kilogram)
PO4uM - Micromoles of Phosphate per liter of seawater
SiO3uM - Micromoles of Silicate per liter of seawater
TA1 - Total Alkalinity (micromoles per kilogram solution)
To train machine learning models using a data set, the model must have training data to learn from and test data to compare its predictions to to evaluate model performance. We will then split the training data further into two groups, a validation set and training set. The training set will still be used to train the model, while the validation set will be used to evaluate how well the model performed.
Show code
#Reading in data used to train modeltraining_data <-read_csv(here("posts","2023-03-21_calcofi_ml","data","train.csv")) %>%clean_names() %>%select(-x13) #remove this since its all NA#Reading in data that will be used to test modeltesting_data <-read_csv(here("posts","2023-03-21_calcofi_ml","data","test.csv")) %>%clean_names() %>%mutate(ta1_x = ta1)#split the training data into training and evaluation sets, stratify by dissolved inorganic carbon concentrationdata_split <-initial_split(training_data,strata = dic)#extract training and test data from the training datatraining_set <-training(data_split)evaluation_set <-testing(data_split)#take a look at training and testing data head(training_data)head(evaluation_set)
Pre-Processing Data, Creating Recipe, Creating Models, and Creating Workflow
To pre-process the data for our model, we begin by creating a recipe where dissolved inorganic carbon concentration is the predicted value and all the variables mentioned above as the predictors.
Show code
#set seed for reproducibilityset.seed(711)#creating a recipebottle_recipe <-recipe(dic ~.,data = training_set) %>%step_dummy(all_nominal(),-all_outcomes(),one_hot =TRUE) %>%step_normalize(all_numeric(),-all_outcomes()) %>%prep()#creating model specification of linear regressionbottle_model <-linear_reg() %>%set_engine("lm") %>%set_mode("regression")#bundle recipe and model spec into a workflowbottle_wf <-workflow() %>%add_recipe(bottle_recipe) %>%add_model(bottle_model)
Fit Model to Training Data and Make Predictions
Show code
#creating and training a model on the training datafit_bottle <- bottle_wf %>%fit(training_set)#using the model to make predictions on the validation data bottle_results <- fit_bottle %>%predict(evaluation_set) %>%bind_cols(evaluation_set) %>%mutate(dic_prediction = .pred_res) %>%relocate(dic,.before = id) %>%relocate(dic_prediction,.before = id) %>%select(-.pred_res)
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from rank-deficient fit; attr(*, "non-estim") has
doubtful cases
Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from rank-deficient fit; attr(*, "non-estim") has
doubtful cases