Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Learn how to perform linear and logistic regression using a generalized linear model (GLM) in Azure Databricks. glm
fits a Generalized Linear Model, similar to R's glm()
.
Syntax: glm(formula, data, family...)
Parameters:
formula
: Symbolic description of model to be fitted, for eg:ResponseVariable ~ Predictor1 + Predictor2
. Supported operators:~
,+
,-
, and.
data
: Any SparkDataFramefamily
: String,"gaussian"
for linear regression or"binomial"
for logistic regressionlambda
: Numeric, Regularization parameteralpha
: Numeric, Elastic-net mixing parameter
Output: MLlib PipelineModel
This tutorial shows how to perform linear and logistic regression on the diamonds dataset.
Load diamonds data and split into training and test sets
require(SparkR)
# Read diamonds.csv dataset as SparkDataFrame
diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "com.databricks.spark.csv", header="true", inferSchema = "true")
diamonds <- withColumnRenamed(diamonds, "", "rowID")
# Split data into Training set and Test set
trainingData <- sample(diamonds, FALSE, 0.7)
testData <- except(diamonds, trainingData)
# Exclude rowIDs
trainingData <- trainingData[, -1]
testData <- testData[, -1]
print(count(diamonds))
print(count(trainingData))
print(count(testData))
head(trainingData)
Train a linear regression model using glm()
This section shows how to predict a diamond's price from its features by training a linear regression model using the training data.
There is a mix of categorical features (cut - Ideal, Premium, Very Good…) and continuous features (depth, carat). SparkR automatically encodes these features so you don't have to encode these features manually.
# Family = "gaussian" to train a linear regression model
lrModel <- glm(price ~ ., data = trainingData, family = "gaussian")
# Print a summary of the trained model
summary(lrModel)
Use predict()
on the test data to see how well the model works on new data.
Syntax: predict(model, newData)
Parameters:
model
: MLlib modelnewData
: SparkDataFrame, typically your test set
Output: SparkDataFrame
# Generate predictions using the trained model
predictions <- predict(lrModel, newData = testData)
# View predictions against mpg column
display(select(predictions, "price", "prediction"))
Evaluate the model.
errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, "error"))
display(errors)
# Calculate RMSE
head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))
Train a logistic regression model using glm()
This section shows how to create a logistic regression on the same dataset to predict a diamond's cut based on some of its features.
Logistic regression in MLlib supports binary classification. To test the algorithm in this example, subset the data to work with two labels.
# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good"
trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good"))
testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
# Family = "binomial" to train a logistic regression model
logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial")
# Print summary of the trained model
summary(logrModel)
# Generate predictions using the trained model
predictionsLogR <- predict(logrModel, newData = testDataSub)
# View predictions against label column
display(select(predictionsLogR, "label", "prediction"))
Evaluate the model.
errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error"))
display(errorsLogR)