--- title: "Families and model types in fbrglm" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Families and model types in fbrglm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) set.seed(20260301) ``` ## Overview `fbrglm` exposes a single formula/data interface for regularized GLMs and reuses one set of S3 methods across model types. The common operations are: - `fbrglm(formula, data, family, lambda, ...)` - `print()` / `summary()` / `coef()` / `nobs()` - `predict(newdata, type = c("link", "response", "class"), newoffset)` - `plot()` (delegates to the underlying `glmnet` / `cv.glmnet` plot) - `as_glmnet()` / `as_cv_glmnet()` Classical standard errors, *z* values, and *p* values are **not** produced under `infer = "none"` (the only inference mode currently implemented). The package returns regularized point estimates; inference modes are the next milestone. The `family` argument accepts the same forms as `glmnet::glmnet()` itself: the six canonical character strings (`"gaussian"`, `"binomial"`, `"poisson"`, `"cox"`, `"multinomial"`, `"mgaussian"`), any GLM family object (e.g. `stats::Gamma(link = "log")`), or a family object from another package (e.g. `MASS::negative.binomial(theta = 2)`). fbrglm validates the argument shape, stores both the value passed to glmnet and a display name on the fit, and wraps glmnet's errors with the offending family name when something downstream fails. ```{r setup} library(fbrglm) ``` ## Summary of supported model types | Model | fbrglm formulation | Response | Status | |--------------------------------------|---------------------------------------------------------------------|------------------------|-----------------------| | Linear regression | `family = "gaussian"` | continuous `y` | **supported** | | Logistic regression | `family = "binomial"` | 0 / 1 `y` | **supported** | | Poisson regression | `family = "poisson"` | count `y` | **supported** | | Piecewise exponential survival | `"poisson"` + interval factor + `offset(log(exposure))` | event count / interval | **supported (formulation)** | | Native Cox regression | `family = "cox"` + `Surv(time, status) ~ ...` | survival object | **supported (experimental)** | | Gamma regression | `family = stats::Gamma(link = "log")` | positive continuous | **supported (experimental)** | | Negative binomial (fixed θ) | `family = MASS::negative.binomial(theta = ...)` | overdispersed counts | **supported (fixed θ only)** | | Multinomial regression | `family = "multinomial"` | factor `y` (≥ 3 levels) | **experimental** | | Multi-response Gaussian | `family = "mgaussian"`, `cbind(y1, y2) ~ ...` LHS | matrix `y` | **experimental** | Two important caveats: - The **piecewise exponential survival** model and **native Cox** model are *not* the same thing. The former is a Poisson regression on a long-format dataset with an interval factor and `offset(log(exposure))`; the latter is glmnet's partial-likelihood Cox path triggered by `family = "cox"` together with a `Surv()` LHS. Both are useful, but the parameters they estimate and the data they consume are different. Do not conflate them. - **Negative binomial** support is for the **fixed-θ** family object (`MASS::negative.binomial(theta = ...)`). Joint estimation of θ in the style of `MASS::glm.nb()` is a different problem (alternating optimization plus a separate likelihood for θ) and is **not** implemented here. ## Linear regression ```{r linear} set.seed(101) n <- 150 dat_lm <- data.frame( y = 0.5 + 0.8 * rnorm(n) - 0.3 * rnorm(n) + rnorm(n, sd = 0.5), x1 = rnorm(n), x2 = rnorm(n) ) fit_lm <- fbrglm(y ~ x1 + x2, data = dat_lm, family = "gaussian", lambda = "fix", lambda_value = 0.05) coef(fit_lm) head(predict(fit_lm, type = "response")) ``` ## Logistic regression ```{r logistic} set.seed(102) n <- 200 dat_logit <- data.frame( y = rbinom(n, 1, 0.5), x1 = rnorm(n), x2 = rnorm(n) ) fit_logit <- fbrglm(y ~ x1 + x2, data = dat_logit, family = "binomial", lambda = "fix", lambda_value = 0.05) head(predict(fit_logit, type = "response")) # probabilities in [0, 1] head(predict(fit_logit, type = "class")) # 0 / 1 ``` ## Poisson regression with an offset A common count-data setup attaches an *exposure* (time, area, population) to each row. The Poisson likelihood is written on the event count with `log(exposure)` as an additive offset on the linear predictor. ```{r poisson} set.seed(103) n <- 200 exposure <- runif(n, min = 0.5, max = 2) x1 <- rnorm(n); x2 <- rnorm(n) rate <- exp(-1 + 0.4 * x1 - 0.2 * x2) dat_pois <- data.frame( y = rpois(n, rate * exposure), x1 = x1, x2 = x2, exposure = exposure ) fit_pois <- fbrglm(y ~ x1 + x2, data = dat_pois, family = "poisson", offset = log(dat_pois$exposure), lambda = "fix", lambda_value = 0.05) new_dat <- dat_pois[1:5, ] predict(fit_pois, newdata = new_dat, newoffset = log(new_dat$exposure), type = "response") ``` ## Piecewise exponential survival model via Poisson regression A Cox-like survival model can be fit with a Poisson regression on a long-format dataset whose rows are *subject* × *time-interval*. The construction is: - split each subject's follow-up time into a small number of intervals; - for each interval the subject is at risk in, generate one row with the time at risk (`exposure`), an event indicator (`event`), the interval label (a factor), and the time-fixed covariates; - model `event ~ x1 + x2 + interval + offset(log(exposure))` with a Poisson likelihood. This is the *piecewise exponential* model and is closely related to Cox regression when the intervals are fine enough. It is **not** the same as glmnet's partial-likelihood Cox path (see the next section). ```{r pwe-data} set.seed(104) n_subj <- 120 breaks <- c(0, 1, 2, 3) n_int <- length(breaks) - 1 x1 <- rnorm(n_subj) x2 <- rnorm(n_subj) rate <- exp(-1 + 0.4 * x1 - 0.2 * x2) T_event <- rexp(n_subj, rate) T_obs <- pmin(T_event, 3) delta <- as.integer(T_event <= 3) rows <- list() for (i in seq_len(n_subj)) { for (j in seq_len(n_int)) { lo <- breaks[j]; hi <- breaks[j + 1] if (T_obs[i] <= lo) next exposure <- min(T_obs[i], hi) - lo if (exposure <= 0) next event_here <- as.integer(delta[i] == 1L & T_obs[i] > lo & T_obs[i] <= hi) rows[[length(rows) + 1]] <- data.frame( id = i, interval = j, exposure = exposure, event = event_here, x1 = x1[i], x2 = x2[i] ) } } long_dat <- do.call(rbind, rows) long_dat$interval <- factor(long_dat$interval, levels = seq_len(n_int)) fit_pwe <- fbrglm(event ~ x1 + x2 + interval, data = long_dat, family = "poisson", offset = log(long_dat$exposure), lambda = "fix", lambda_value = 0.05) coef(fit_pwe) ``` ## Native Cox regression `family = "cox"` together with `survival::Surv(time, status) ~ ...` on the LHS hands the problem to glmnet's partial-likelihood Cox path directly. fbrglm carries the `Surv` response through `model.frame()` without converting it. `type = "class"` is not meaningful here and is rejected. ```{r cox, message = FALSE, warning = FALSE} if (requireNamespace("survival", quietly = TRUE)) { set.seed(105) n <- 200 x1 <- rnorm(n); x2 <- rnorm(n) rate <- exp(0.4 * x1 - 0.2 * x2) T_event <- rexp(n, rate) t_obs <- pmin(T_event, 3) delta <- as.integer(T_event <= 3) dat_cox <- data.frame(t_obs = t_obs, delta = delta, x1 = x1, x2 = x2) fit_cox <- fbrglm(survival::Surv(t_obs, delta) ~ x1 + x2, data = dat_cox, family = "cox", lambda = "fix", lambda_value = 0.05) fit_cox$family_name head(predict(fit_cox, type = "link")) } ``` Cox support is marked **experimental**: glmnet's `coxnet` is mature on its own, but fbrglm has not yet been exercised against the full breadth of Cox usage (strata, ties, time-varying covariates), so unusual setups may need to fall back to `as_glmnet(fit_cox)`. ## Gamma regression ```{r gamma} set.seed(106) n <- 200 x1 <- rnorm(n); x2 <- rnorm(n) eta <- 0.4 * x1 - 0.2 * x2 dat_g <- data.frame( y = rgamma(n, shape = 2, rate = exp(-eta)), x1 = x1, x2 = x2 ) fit_g <- fbrglm(y ~ x1 + x2, data = dat_g, family = stats::Gamma(link = "log"), lambda = "fix", lambda_value = 0.05) fit_g$family_name head(predict(fit_g, type = "response")) ``` ## Negative binomial regression (fixed θ) ```{r negbin, message = FALSE} if (requireNamespace("MASS", quietly = TRUE)) { set.seed(107) n <- 200 x1 <- rnorm(n); x2 <- rnorm(n) mu <- exp(0.4 * x1 - 0.2 * x2) dat_nb <- data.frame( y = rnbinom(n, mu = mu, size = 2), x1 = x1, x2 = x2 ) fit_nb <- fbrglm(y ~ x1 + x2, data = dat_nb, family = MASS::negative.binomial(theta = 2), lambda = "fix", lambda_value = 0.05) fit_nb$family_name head(predict(fit_nb, type = "response")) } ``` Joint estimation of θ (`MASS::glm.nb()` style) is **not** in scope here; θ must be chosen externally. ## Multinomial regression (experimental) ```{r multinomial} set.seed(108) n <- 200 dat_mult <- data.frame( y = factor(sample(c("a", "b", "c"), n, replace = TRUE)), x1 = rnorm(n), x2 = rnorm(n) ) fit_mult <- fbrglm(y ~ x1 + x2, data = dat_mult, family = "multinomial", lambda = "fix", lambda_value = 0.05) fit_mult$family_name p_resp <- predict(fit_mult, type = "response") dim(p_resp) # one column per class head(p_resp) head(predict(fit_mult, type = "class")) ``` For multinomial fits, `coef(fit)` returns a list of (sparse) matrices — one per class — exactly as glmnet does; callers who want the underlying object can use `as_glmnet(fit_mult)`. ## Multi-response Gaussian (experimental) ```{r mgaussian} set.seed(109) n <- 150 dat_mg <- data.frame( y1 = rnorm(n), y2 = rnorm(n), x1 = rnorm(n), x2 = rnorm(n) ) fit_mg <- fbrglm(cbind(y1, y2) ~ x1 + x2, data = dat_mg, family = "mgaussian", lambda = "fix", lambda_value = 0.05) fit_mg$family_name pred_mg <- predict(fit_mg, type = "response") dim(pred_mg) # one column per response ``` ## Limitations - `infer = "none"` only — no classical standard errors, *z* values, *p* values, or confidence intervals. Honest post-selection inference is the next milestone. - Native Cox, multinomial, and mgaussian are marked **experimental**: fit and predict work for the basic cases shown here, but the more unusual variants (Cox strata, ties handling, multinomial grouped / ungrouped) have not been exhaustively tested. - Negative binomial supports **fixed θ** only. Joint θ estimation in the spirit of `MASS::glm.nb()` is out of scope. - The piecewise exponential survival example is a Poisson formulation and is *not* identical to native Cox partial likelihood.