Kiran Favre - Using Machine Learning for Ocean Chemistry Prediction

Background

For our final project in my machine learning course, we participated in a Kaggle competition to predict the concentration of dissolved inorganic carbon in water samples by using ocean chemistry data. This data comes from the California Cooperative Oceanic Fisheries Investigations (CalCOFI) program.

Objective

To predict dissolved inorganic carbon, we will be using a Linear Regression Model in R to make these predictions. We will use the CalCOFI data to train our model to make predictions of inorganic dissolved carbon concentrations in different parts of the ocean that aren’t included in the training data.

The variables we are using as predictors in our model are:

NO2uM - Micromoles of Nitrite per liter of seawater
NO3uM - Micromoles of Nitrate per liter of seawater
NH3uM - Micromoles of Ammonia per liter of seawater
R_TEMP - Reported (Potential) Temperature (degrees Celsius)
R_Depth - Reported Depth from pressure (meters)
R_Sal - Reported Salinity (from Specific Volume Anomoly, M³ per Kg)
R_DYNHT - Reported Dynamic Height (work per unit mass)
R_Nuts - Reported Ammonium concentration (micromoles per Liter)
R_Oxy_micromol.Kg - Reported Oxygen concentration (micromoles per kilogram)
PO4uM - Micromoles of Phosphate per liter of seawater
SiO3uM - Micromoles of Silicate per liter of seawater
TA1 - Total Alkalinity (micromoles per kilogram solution)
Salinity1 - Salinity (Practical Salinity Scale 1978)
Temperature_degC - Temperature (degrees Celsius)

Load and split data

To train machine learning models using a data set, the model must have training data to learn from and test data to compare its predictions to to evaluate model performance. We will then split the training data further into two groups, a validation set and training set. The training set will still be used to train the model, while the validation set will be used to evaluate how well the model performed.

Show code

#Reading in data used to train model
training_data <- read_csv(here("posts",
                               "2023-03-21_calcofi_ml",
                               "data",
                               "train.csv")) %>%
  clean_names() %>%
  select(-x13) #remove this since its all NA

#Reading in data that will be used to test model
testing_data <- read_csv(here("posts",
                               "2023-03-21_calcofi_ml",
                               "data",
                               "test.csv")) %>%
  clean_names() %>% 
  mutate(ta1_x = ta1)


#split the training data into training and evaluation sets, stratify by dissolved inorganic carbon concentration
data_split <- initial_split(training_data,
                            strata = dic)

#extract training and test data from the training data
training_set <- training(data_split)
evaluation_set <- testing(data_split)

#take a look at training and testing data 
head(training_data)
head(evaluation_set)

Pre-Processing Data, Creating Recipe, Creating Models, and Creating Workflow

To pre-process the data for our model, we begin by creating a recipe where dissolved inorganic carbon concentration is the predicted value and all the variables mentioned above as the predictors.

Show code

#set seed for reproducibility
set.seed(711)

#creating a recipe
bottle_recipe <- recipe(dic ~.,
                        data = training_set) %>% 
  step_dummy(all_nominal(),
             -all_outcomes(),
             one_hot = TRUE) %>% 
  step_normalize(all_numeric(),
                 -all_outcomes()) %>% 
  prep()

#creating model specification of linear regression
bottle_model <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

#bundle recipe and model spec into a workflow
bottle_wf <- workflow() %>% 
  add_recipe(bottle_recipe) %>% 
  add_model(bottle_model)

Fit Model to Training Data and Make Predictions

Show code

#creating and training a model on the training data
fit_bottle <- bottle_wf %>%
  fit(training_set)

#using the model to make predictions on the validation data   
bottle_results <- fit_bottle %>% 
  predict(evaluation_set) %>%
  bind_cols(evaluation_set) %>% 
  mutate(dic_prediction = .pred_res) %>% 
  relocate(dic,
           .before = id) %>% 
  relocate(dic_prediction,
           .before = id) %>% 
  select(-.pred_res)

Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from rank-deficient fit; attr(*, "non-estim") has
doubtful cases

Show code

#retrieve and evaluate our predictions
bottle_metrics <- bottle_results %>%
  metrics(estimate = dic_prediction,
          truth = dic)

bottle_metrics

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       6.82 
2 rsq     standard       0.996
3 mae     standard       3.45

Test Model

Show code

## Outputting predictions for our testing data
test_data_predictions <- fit_bottle %>% 
  predict(testing_data) %>%
  bind_cols(testing_data) %>% 
  mutate(DIC = .pred_res) %>% 
  relocate(DIC,
           .before = id) %>% 
  select(id,
         DIC)

Warning in predict.lm(object = object$fit, newdata = new_data, type =
"response"): prediction from rank-deficient fit; attr(*, "non-estim") has
doubtful cases

Show code

test_data_predictions

# A tibble: 485 × 2
      id   DIC
   <dbl> <dbl>
 1  1455 2173.
 2  1456 2194.
 3  1457 2325.
 4  1458 1993.
 5  1459 2147.
 6  1460 2036.
 7  1461 2159.
 8  1462 2196.
 9  1463 2270.
10  1464 2314.
# ℹ 475 more rows

Citation

BibTeX citation:

@online{favre2023,
  author = {Favre, Kiran},
  title = {Using {Machine} {Learning} for {Ocean} {Chemistry}
    {Prediction}},
  date = {2023-03-21},
  url = {https://kiranfavre.github.io/posts/2023-03-21_ml_predictions/},
  langid = {en}
}

For attribution, please cite this work as:

Favre, Kiran. 2023. “Using Machine Learning for Ocean Chemistry Prediction.” March 21, 2023. https://kiranfavre.github.io/posts/2023-03-21_ml_predictions/.