Zarathu
Default | Repeated Measure | Survey | |
---|---|---|---|
Continuous | Linear regression | GEE | Survey GLM |
Event | GLM (logistic) | GEE | Survey GLM |
Time & Event | Cox | Marginal Cox | Survey Cox |
0,1,2,3 (rare event) | GLM (Poisson) | GEE | Survey GLM |
Continuous
It predicts a continuous outcome (dependent variable) by fitting the best possible straight-line relationship to one or more explanatory variables (independent variables).
\[Y = \beta_0 + \beta_1 X + \epsilon\] Estimate \(\beta_0\), \(\beta_1\) by minimizing sum of squared errors
\(Y\) must be continuous & normally distributed
\(X\) can be continuous or categorical
If continuous: same as correlation analysis
If binary: same as t-test with equal variance
This regression estimates how age affects nodes (number of affected lymph nodes).
R = .093, R² = .009: Very weak linear relationship
Unstandardized coefficient for age = -0.028, p < .001
As age increases by 1 year, the number of nodes is expected to decrease by 0.028. The relationship is statistically significant but weak.
This reverses the regression direction, predicting age from nodes.
Although both models are statistically significant (p < .001), the effect size (R² ≈ 0.009) is very small. This indicates that only about 0.9% of the variance in either variable is explained by the other.
We can also use the Bivariate Correlation function to test linear association between two continuous variables.
There is a weak negative, yet statistically significant, correlation between age and nodes.
This analysis fits a linear regression model where time is predicted by sex.
There is no statistically significant association between sex and time.
T-test and Simple Linear Regression are equivalent when comparing 2 groups.
To compare the variable time between male and female groups, use a t-test in SPSS:
The regression result is consistent with the t-test. Therefore, t-test provides the same result as regression when the predictor has only two levels, and is more intuitive.
The variable rx
includes three treatment groups:
1 = Lev
2 = Lev+5FU
3 = Obs
(reference)
In SPSS, linear regression does not automatically treat string variables as categorical. To include a categorical variable like rx
in a regression model, we must first convert it to numeric.
Use Automatic Recode → Numeric values starting from 1
1 = Lev, 2 = Lev+5FU, 3 = Obs
Now that rx_new
is numeric, SPSS will automatically create dummy variables when used in regression, BUT it will use the lowest number as the reference group (e.g., 1 = Lev).
To change the reference group…
Create two dummy variables:
Go to Transform → Compute Variable…
rx_lev
= 1 if rx_new == 1
, else 0
rx_lev5fu
= 1 if rx_new == 2
, else 0The reference group (row where all dummy variables are 0) is Obs, which is not coded explicitly.
time
rx_lev
, rx_lev5fu
The constant is the mean for the Obs group.
Coefficients show the difference from Obs:
rx_lev
= difference between Lev and Obs
rx_lev5fu
= difference between Lev+5FU and Obs
Only rx_lev5fu
shows a significant difference (p < .001)
We can also use ANOVA to test whether any of the rx_new
groups differ overall.
- Dependent = `time`
- Factor = `rx_new`
This gives an overall p-value for overall group comparison.
\[Y = \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + \cdots + \epsilon\]
Interpretation of \(\beta_1\): When adjusting for \(X_2\), \(X_3\), etc., a one-unit increase in \(X_1\) leads to an increase of \(\beta_1\) in \(Y\).
In academic papers, it is common to present results both before and after adjustment. This allows comparison of crude vs. adjusted effects.
Analyze each variable one at a time:
Go to Analyze → Regression → Linear
Set time
as Dependent
Add one predictor (e.g., sex
) to Independent(s)
Control for multiple variables:
Set time
as Dependent
Add all predictors (e.g., sex
, age
, rx
) to Independent(s)
Event
It is used to predict a categorical outcome by estimating its probability. Unlike linear regression, this model uses a logistic (=sigmoid) curve to ensure the output is always constrained between 0 and 1. Internally, it predicts the log-odds of the event, ln(P/(1−P)), and this can be then mathematically converted back into the final probability (P).
An Odds Ratio (OR) is a measure of association that compares the odds of an event occurring in one group to the odds of a different group. So, in other words, the Odds Ratio = \(\exp(\beta_1)\)
Analyze → Regression → Binary Logistic...
status
to the Dependent box, sex
, age
, and rx_new
into the Covariates box.rx_new
, move to Categorical CovariatesObs
)Exp(B) gives odds ratios.
Time & Event
Survival analysis focus on the duration until a specific event of interest such as death occurs. A characteristic of this analysis is censoring, which occurs when the event has not been observed for some subjects by the end of the study, or when subjects are lost to follow-up. Most data are right-censored: the individual either died on day or survived up to day.
The first step in any survival analysis is “How is this group surviving over time?” The Kaplan-Meier estimator is the most common method used to answer this.
It provides an estimation of the Survival Function, S(t), which is the probability that an individual will survive past time t. So the resulting Kaplan-Meier curve is a step function. The survival probability remains flat until an event occurs, at which point the curve drops.
time
as Time, status(1)
as Status, rx_new
as FactorIt is usually accompanied by the log-rank test p-value to compare survival between groups.
\[ \begin{aligned} P(t) &= \frac{\text{Survived at } t}{\text{At risk at } t} \quad \text{(Interval survival)} \\\\ S(t) &= S(t-1) \times P(t) \end{aligned} \]
A Kaplan–Meier curve is built by chaining together little “survival chances” at each time when at least one death happens. Start with everyone alive at time zero and imagine the survival probability is 1. At each event time, look only at the people who were still under observation just before that moment—the “risk set.” The step size is simply the fraction of that group who make it past the event time. We multiply these fractions across event times to keep updating the curve. Censored observations don’t cause a drop; they only shrink the risk set for the future.
Now apply that idea to the figure. Five people begin. At time 3, one person dies while five were still at risk, so four out of five get past time 3; the curve steps down to 0.80. At time 4, someone is censored. Because no one dies, the curve stays flat at 0.80, but the risk set for later times is now smaller. By time 5, three people are still under observation and two die there; only one of those three gets past time 5, so we scale the curve by one-third, landing around 0.27. At time 8, the last person is censored, so nothing changes numerically—the curve remains about 0.27 through study end.
Read this way, the KM plot is a staircase that only steps down on death times. Each step reflects “what fraction of those still being followed just before this time survived beyond it,” and the final height (about 0.27 here) is the estimated probability of surviving past the last event time.
While KM curves and Log-Rank tests are excellent for comparing groups based on a single categorical variable (like “Treatment A” vs. “Treatment B”), they cannot handle the influence of multiple factors simultaneously (like age, tumor size, blood pressure, and treatment type).
The Survival function S(t) described the probability of not having an event by time t. In contrast, the Hazard Function describes the instantaneous risk of the event occurring precisely at time t, given that the individual has survived up to that moment. A high hazard means a high immediate risk of failure.
\[ \begin{aligned} h(t) &= \exp({\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots}) \\\\ &= h_0(t) \exp({\beta_1 X_1 + \beta_2 X_2 + \cdots}) \end{aligned} \]
Cox model finally evaluates the Hazard Ratio (HR). When \(X_1\) increases by 1, \(h(t)\) increases by a factor of \(\exp(\beta_1)\). In other words, \[\text{HR} = \exp{(\beta_1)}\]
HR = 1: The covariate has no effect on the hazard.
HR > 1: The covariate increases the risk (e.g., smoking). An HR of 2.0 means that the group has double the instantaneous risk of the event compared to the reference group.
HR < 1: The covariate is protective and decreases the risk (e.g., an effective treatment).
time
, Status = status
sex
, age
, rx_new
to Covariatesrx_new
, set Reference Category to Last, click ChangeExp(B)
= hazard ratioThe most crucial assumption of this model is the Proportional Hazards Assumption. This means that the HR comparing any two groups must be constant over the entire follow-up time. Visually, this means their survival curves should not cross, and the relative distance between their hazard functions must remain the same over time. No formal test for the assumption is strictly required — it can be checked visually.
Landmark analysis is an alternative approach often used to handle specific biases (like “immortal time bias,” where patients must survive long enough to receive a treatment, making the treatment look artificially successful).
This method involves setting a specific time point (=landmark point) and includes only those subjects who survived up to the landmark time. These subjects are then divided into groups based on their status at the landmark (e.g., those who received a transplant by 6 months vs. those who did not). The survival analysis (using KM and Log-Rank) then begins from the landmark point onward, comparing the subsequent survival of these fixed groups.
If the effect of a variable (its HR) changes over time, we use this model to interact the variable with time, allowing the model to estimate an HR that is a function of time (HR(t)) rather than a single constant.
Piecewise time-dependent Cox model It partitions the observation data into three distinct time interval and estimates a separate and different coefficient (effect) for specific variable within each of those three time intervals.
Call:
coxph(formula = Surv(tstart, time, status) ~ trt + prior + karno:strata(tgroup),
data = vet2)
n= 225, number of events= 128
coef exp(coef) se(coef) z Pr(>|z|)
trt -0.011025 0.989035 0.189062 -0.058 0.953
prior -0.006107 0.993912 0.020355 -0.300 0.764
karno:strata(tgroup)tgroup=1 -0.048755 0.952414 0.006222 -7.836 4.64e-15 ***
karno:strata(tgroup)tgroup=2 0.008050 1.008083 0.012823 0.628 0.530
karno:strata(tgroup)tgroup=3 -0.008349 0.991686 0.014620 -0.571 0.568
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
trt 0.9890 1.011 0.6828 1.4327
prior 0.9939 1.006 0.9550 1.0344
karno:strata(tgroup)tgroup=1 0.9524 1.050 0.9409 0.9641
karno:strata(tgroup)tgroup=2 1.0081 0.992 0.9831 1.0337
karno:strata(tgroup)tgroup=3 0.9917 1.008 0.9637 1.0205
Concordance= 0.725 (se = 0.024 )
Likelihood ratio test= 63.04 on 5 df, p=3e-12
Wald test = 63.7 on 5 df, p=2e-12
Score (logrank) test = 71.33 on 5 df, p=5e-14
Default | Repeated measure | Survey | |
---|---|---|---|
Continuous | linear regression | GEE | Survey GLM |
Event | GLM (logistic) | GEE | Survey GLM |
Time & Event | Cox | marginal Cox | Survey Cox |
0,1,2,3 (rare event) | GLM (poisson) | GEE | Survey GLM |
https://openstat.ai/