Why is matrix calculus needed in factor analysis?

Factor analysis estimates the parameters Lambda (factor loadings), Phi (factor correlation), and Psi (unique variance) of the covariance structure Sigma = Lambda Phi Lambda^T + Psi by maximum likelihood. Because the ML objective F_ML depends on these parameters through Sigma, the chain rule together with the trace and log-determinant differentiation formulas is indispensable. See formulas 19.1-19.4 on this page for details.

What is the SEM general gradient formula?

It is dF_ML/dtheta_i = tr[(Sigma^-1 - Sigma^-1 S Sigma^-1) dSigma/dtheta_i]. This single equation gives the gradient with respect to any parameter theta_i (structural coefficients B, factor loadings Lambda, covariances Phi or Psi, and so on). The parameter-specific calculation is confined to the dSigma/dtheta_i term.

What is the difference between 2PL and 3PL IRT models?

2PL has two parameters per item: discrimination a_j and difficulty b_j, with P = sigma(a(theta-b)). 3PL adds a guessing parameter c_j, giving P = c + (1-c)sigma(a(theta-b)). 3PL captures correct-by-guessing responses at low ability levels and is realistic for multiple-choice tests.

What is the difference between RAM and LISREL?

Both are SEM notations. LISREL handles latent variables eta and observed variables y in separate matrices; RAM (Reticular Action Model) bundles all variables into a single vector v and uses a filter (selection) matrix F to pick out the observed variables. RAM is more compact and is adopted by software such as OpenMx.

Latent Variable Models

Notation convention
Formulas on this page use the denominator layout. See Layout Conventions for details.

Overview

A family of methods that estimate latent structures (factors, structural equations, ability) which cannot be observed directly. They share the structure "observed distribution = marginalization of the product of latent and observed distributions"; the resulting likelihoods are complex, making matrix calculus an essential tool.

This page covers factor analysis (the ML objective and gradients with respect to three parameter blocks), structural equation modeling (SEM) (LISREL/RAM covariance structures and the general gradient formula), and item response theory (IRT) (gradients of 2PL/3PL with respect to discrimination, difficulty, ability, and guessing, plus the information function).

Key formula highlights:

Factor analysis ML objective $F_{\text{ML}} = \log|\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{S}\boldsymbol{\Sigma}^{-1}) - \log|\boldsymbol{S}| - p$
SEM general gradient $\partial F/\partial \theta = \text{tr}[(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\partial \boldsymbol{\Sigma}/\partial \theta]$
IRT 2PL/3PL likelihood gradients and the Fisher information function

Factor Analysis

19.1 ML objective for factor analysis

Formula: $F_{\text{ML}} = \log|\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{S}\boldsymbol{\Sigma}^{-1}) - \log|\boldsymbol{S}| - p$

Conditions: $\boldsymbol{S}$: sample covariance matrix, $\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}$

Proof

Factor analysis evaluates whether the observed covariance $\boldsymbol{S}$ can be explained by the model covariance $\boldsymbol{\Sigma}(\boldsymbol{\theta}) = \boldsymbol{\Lambda}\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}$. Up to a constant, the multivariate normal log-likelihood is

$$\ell = -\dfrac{n}{2}\!\left(\log|\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\right)$$

Instead of maximizing $\ell$, we minimize $-2\ell/n$, which becomes $\log|\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{S}\boldsymbol{\Sigma}^{-1})$.

This quantity attains its minimum value $\log|\boldsymbol{S}| + p$ at $\boldsymbol{\Sigma} = \boldsymbol{S}$ (a Kullback-Leibler-type property of information). Subtracting that minimum to anchor the optimum at $0$ yields the desired discrepancy function

$$F_{\text{ML}} = \log|\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{S}\boldsymbol{\Sigma}^{-1}) - \log|\boldsymbol{S}| - p \geq 0$$

$F_{\text{ML}} = 0$ corresponds to perfect fit ($\boldsymbol{\Sigma} = \boldsymbol{S}$), and the quantity is used for $\chi^2$ testing (in large samples $n F_{\text{ML}}$ follows a $\chi^2$ distribution with the appropriate degrees of freedom).

19.2 Gradient with respect to factor loadings

Formula: $\displaystyle\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Lambda}} = 2(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\boldsymbol{\Lambda}\boldsymbol{\Phi}$

Conditions: $\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}$

Proof

We compute the gradient in two stages. First, the gradient of $F_{\text{ML}}$ with respect to $\boldsymbol{\Sigma}$: the $\log$-determinant differentiates to $\boldsymbol{\Sigma}^{-1}$, and the trace $\text{tr}(\boldsymbol{S}\boldsymbol{\Sigma}^{-1})$ differentiates to $-\boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1}$, so

$$\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Sigma}} = \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1}$$

Calling this $\boldsymbol{G}$, it can be interpreted as a "residual matrix": $\boldsymbol{G} = \boldsymbol{0}$ exactly when $\boldsymbol{S} = \boldsymbol{\Sigma}$ (perfect model fit).

Next we use the chain rule to pass to the gradient with respect to $\boldsymbol{\Lambda}$. Using the differential form $dF = \text{tr}(\boldsymbol{G}\,d\boldsymbol{\Sigma})$, the total differential of $\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}$ obeys the product rule

$$d\boldsymbol{\Sigma} = (d\boldsymbol{\Lambda})\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Lambda}\boldsymbol{\Phi}(d\boldsymbol{\Lambda})^\top$$

Substituting into the trace and using the cyclic property $\text{tr}(\boldsymbol{A}\boldsymbol{B}) = \text{tr}(\boldsymbol{B}\boldsymbol{A})$ to move $d\boldsymbol{\Lambda}$ to the end, the symmetry of $\boldsymbol{G}$ ($\boldsymbol{G}^\top = \boldsymbol{G}$) makes the two terms equal:

$$dF = \text{tr}(\boldsymbol{G}\boldsymbol{\Lambda}\boldsymbol{\Phi}\,d\boldsymbol{\Lambda}^\top) + \text{tr}(\boldsymbol{G}\boldsymbol{\Lambda}\boldsymbol{\Phi}\,d\boldsymbol{\Lambda}^\top) = 2\,\text{tr}(\boldsymbol{G}\boldsymbol{\Lambda}\boldsymbol{\Phi}\,d\boldsymbol{\Lambda}^\top)$$

Comparing with $dF = \text{tr}((\partial F/\partial \boldsymbol{\Lambda})^\top d\boldsymbol{\Lambda})$ gives

$$\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Lambda}} = 2\boldsymbol{G}\boldsymbol{\Lambda}\boldsymbol{\Phi} = 2(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\boldsymbol{\Lambda}\boldsymbol{\Phi} \quad \square$$

19.3 Gradient with respect to unique variances

Formula: $\displaystyle\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Psi}} = \text{diag}(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})$

Conditions: $\boldsymbol{\Psi}$ is a diagonal matrix

Proof

The unique variance $\boldsymbol{\Psi}$ has only diagonal entries as free parameters. Within $\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}$, only the additive term depends on $\boldsymbol{\Psi}$, and partial differentiation with respect to a diagonal entry $\Psi_{ii}$ yields a matrix with a single $1$ at position $(i, i)$.

$$\dfrac{\partial \Sigma_{kl}}{\partial \Psi_{ii}} = \delta_{ki}\delta_{li}, \quad \text{i.e.}\ \dfrac{\partial \boldsymbol{\Sigma}}{\partial \Psi_{ii}} = \boldsymbol{e}_i\boldsymbol{e}_i^\top$$

Substituting into the chain rule $\partial F/\partial \Psi_{ii} = \text{tr}(\boldsymbol{G}\,\partial \boldsymbol{\Sigma}/\partial \Psi_{ii})$ gives $\boldsymbol{e}_i^\top \boldsymbol{G}\,\boldsymbol{e}_i = G_{ii}$.

$$\dfrac{\partial F_{\text{ML}}}{\partial \Psi_{ii}} = G_{ii}$$

That is, the gradient with respect to each diagonal entry of $\boldsymbol{\Psi}$ equals the corresponding diagonal entry of the residual matrix $\boldsymbol{G}$. In matrix form,

$$\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Psi}} = \text{diag}(\boldsymbol{G}) = \text{diag}(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1}) \quad \square$$

19.4 Gradient with respect to factor correlation

Formula: $\displaystyle\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Phi}} = \boldsymbol{\Lambda}^\top(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\boldsymbol{\Lambda}$

Conditions: $\boldsymbol{\Phi}$ is a symmetric matrix (factor correlation matrix)

Proof

The gradient with respect to the factor correlation $\boldsymbol{\Phi}$ is obtained from the gradient with respect to $\boldsymbol{\Sigma}$, $\boldsymbol{G} = \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1}$, via the chain rule. Inside $\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Phi}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}$, only the central sandwich term depends on $\boldsymbol{\Phi}$, so the total differential is

$$d\boldsymbol{\Sigma} = \boldsymbol{\Lambda}(d\boldsymbol{\Phi})\boldsymbol{\Lambda}^\top$$

Substituting into $dF = \text{tr}(\boldsymbol{G}\,d\boldsymbol{\Sigma})$ and using the cyclic property of trace to move $d\boldsymbol{\Phi}$ to the end:

$$dF = \text{tr}\!\left(\boldsymbol{G}\boldsymbol{\Lambda}(d\boldsymbol{\Phi})\boldsymbol{\Lambda}^\top\right) = \text{tr}\!\left(\boldsymbol{\Lambda}^\top\boldsymbol{G}\boldsymbol{\Lambda}\,d\boldsymbol{\Phi}\right)$$

Comparing with $dF = \text{tr}((\partial F/\partial \boldsymbol{\Phi})^\top d\boldsymbol{\Phi})$ identifies the gradient. Since $\boldsymbol{\Lambda}^\top \boldsymbol{G}\boldsymbol{\Lambda}$ is symmetric, taking the transpose leaves it unchanged, so

$$\dfrac{\partial F_{\text{ML}}}{\partial \boldsymbol{\Phi}} = \boldsymbol{\Lambda}^\top\boldsymbol{G}\boldsymbol{\Lambda} = \boldsymbol{\Lambda}^\top(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\boldsymbol{\Lambda} \quad \square$$

Interpretation: $\boldsymbol{\Lambda}^\top \boldsymbol{G}\boldsymbol{\Lambda}$ represents the "residual evaluated in factor space" and indicates the direction in which factor correlations should be adjusted.

Structural Equation Modeling (SEM)

19.5 SEM model-implied covariance

Formula: $\boldsymbol{\Sigma}(\boldsymbol{\theta}) = \boldsymbol{\Lambda}(\boldsymbol{I} - \boldsymbol{B})^{-1}(\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Gamma}^\top + \boldsymbol{\Psi})(\boldsymbol{I} - \boldsymbol{B})^{-\top}\boldsymbol{\Lambda}^\top + \boldsymbol{\Theta}$

Conditions: LISREL notation, $\boldsymbol{\eta} = \boldsymbol{B}\boldsymbol{\eta} + \boldsymbol{\Gamma}\boldsymbol{\xi} + \boldsymbol{\zeta}$

Proof

In standard LISREL notation, SEM builds the observed covariance in two stages: a structural model (a system of equations among endogenous latent variables $\boldsymbol{\eta}$) and a measurement model (a projection onto the observed variables $\boldsymbol{y}$).

Solving the structural model $\boldsymbol{\eta} = \boldsymbol{B}\boldsymbol{\eta} + \boldsymbol{\Gamma}\boldsymbol{\xi} + \boldsymbol{\zeta}$ for $\boldsymbol{\eta}$, we have $(\boldsymbol{I} - \boldsymbol{B})\boldsymbol{\eta} = \boldsymbol{\Gamma}\boldsymbol{\xi} + \boldsymbol{\zeta}$, hence

$$\boldsymbol{\eta} = (\boldsymbol{I} - \boldsymbol{B})^{-1}(\boldsymbol{\Gamma}\boldsymbol{\xi} + \boldsymbol{\zeta})$$

Assuming the exogenous latent covariance $\text{Cov}(\boldsymbol{\xi}) = \boldsymbol{\Phi}$, the structural disturbance covariance $\text{Cov}(\boldsymbol{\zeta}) = \boldsymbol{\Psi}$, and that $\boldsymbol{\xi}$ and $\boldsymbol{\zeta}$ are uncorrelated, the covariance of $\boldsymbol{\eta}$ is

$$\boldsymbol{\Sigma}_\eta = (\boldsymbol{I} - \boldsymbol{B})^{-1}(\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Gamma}^\top + \boldsymbol{\Psi})(\boldsymbol{I} - \boldsymbol{B})^{-\top}$$

From the measurement model $\boldsymbol{y} = \boldsymbol{\Lambda}\boldsymbol{\eta} + \boldsymbol{\epsilon}$ with $\text{Cov}(\boldsymbol{\epsilon}) = \boldsymbol{\Theta}$, the observed covariance is

$$\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{\Sigma}_\eta\boldsymbol{\Lambda}^\top + \boldsymbol{\Theta}$$

Substituting the expression for $\boldsymbol{\Sigma}_\eta$ yields the desired model-implied covariance. The model parameters are $\boldsymbol{\theta} = (\boldsymbol{B}, \boldsymbol{\Gamma}, \boldsymbol{\Phi}, \boldsymbol{\Psi}, \boldsymbol{\Lambda}, \boldsymbol{\Theta})$.

19.6 SEM general gradient formula

Formula: $\displaystyle\dfrac{\partial F_{\text{ML}}}{\partial \theta_i} = \text{tr}\left[(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\displaystyle\dfrac{\partial \boldsymbol{\Sigma}}{\partial \theta_i}\right]$

Conditions: $\theta_i$ is any parameter

Proof

This formula is SEM's most powerful tool: the gradient takes the same form for every parameter. The key observation is that $F_{\text{ML}}$ depends on a parameter $\theta_i$ only through $\boldsymbol{\Sigma}(\boldsymbol{\theta})$.

Using the matrix chain rule, the partial derivative of the scalar $F$ with respect to $\theta_i$ is

$$\dfrac{\partial F}{\partial \theta_i} = \text{tr}\!\left(\dfrac{\partial F}{\partial \boldsymbol{\Sigma}} \cdot \dfrac{\partial \boldsymbol{\Sigma}}{\partial \theta_i}\right)$$

where, from the definition of $F_{\text{ML}}$ (the same calculation as in 19.1),

$$\dfrac{\partial F}{\partial \boldsymbol{\Sigma}} = \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1}$$

Substituting back into the chain rule gives the desired general gradient formula

$$\dfrac{\partial F_{\text{ML}}}{\partial \theta_i} = \text{tr}\!\left[(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\dfrac{\partial \boldsymbol{\Sigma}}{\partial \theta_i}\right] \quad \square$$

All that remains is to compute $\partial \boldsymbol{\Sigma}/\partial \theta_i$ on a per-parameter basis, so every SEM gradient calculation fits in the same framework.

Remarks: $\displaystyle\dfrac{\partial \boldsymbol{\Sigma}}{\partial \theta_i}$ is computed individually for each parameter ($\boldsymbol{\Lambda}$, $\boldsymbol{B}$, $\boldsymbol{\Gamma}$, etc.).

19.7 Gradient of SEM structural coefficients

Formula: $\displaystyle\dfrac{\partial \boldsymbol{\Sigma}}{\partial B_{ij}} = \boldsymbol{\Lambda}\boldsymbol{A}^{-1}\boldsymbol{e}_i\boldsymbol{e}_j^\top\boldsymbol{\Sigma}_\eta\boldsymbol{\Lambda}^\top + \boldsymbol{\Lambda}\boldsymbol{\Sigma}_\eta\boldsymbol{e}_j\boldsymbol{e}_i^\top\boldsymbol{A}^{-\top}\boldsymbol{\Lambda}^\top$

Conditions: $\boldsymbol{A} = \boldsymbol{I} - \boldsymbol{B}$, $\boldsymbol{\Sigma}_\eta = \boldsymbol{A}^{-1}(\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Gamma}^\top + \boldsymbol{\Psi})\boldsymbol{A}^{-\top}$

Proof

$\boldsymbol{B}$ is the coefficient matrix of the structural equation, and its effect on $\boldsymbol{\Sigma}$ is mediated by the inverse of $\boldsymbol{A} := \boldsymbol{I} - \boldsymbol{B}$. The derivative of $\boldsymbol{A}$ with respect to a single entry $B_{ij}$ is simply $\partial \boldsymbol{A}/\partial B_{ij} = -\boldsymbol{e}_i\boldsymbol{e}_j^\top$ (a $-1$ at exactly one entry).

Applying the inverse-matrix differentiation formula $\partial \boldsymbol{A}^{-1}/\partial B_{ij} = -\boldsymbol{A}^{-1}(\partial \boldsymbol{A}/\partial B_{ij})\boldsymbol{A}^{-1}$, the negative signs cancel:

$$\dfrac{\partial \boldsymbol{A}^{-1}}{\partial B_{ij}} = \boldsymbol{A}^{-1}\boldsymbol{e}_i\boldsymbol{e}_j^\top\boldsymbol{A}^{-1}$$

Letting $\boldsymbol{M} := \boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Gamma}^\top + \boldsymbol{\Psi}$, we have $\boldsymbol{\Sigma} = \boldsymbol{\Lambda}\boldsymbol{A}^{-1}\boldsymbol{M}\boldsymbol{A}^{-\top}\boldsymbol{\Lambda}^\top + \boldsymbol{\Theta}$. Differentiating with respect to $\boldsymbol{B}$, both $\boldsymbol{A}^{-1}$ and $\boldsymbol{A}^{-\top}$ vary, so the product rule produces two terms.

$$\dfrac{\partial \boldsymbol{\Sigma}}{\partial B_{ij}} = \boldsymbol{\Lambda}\dfrac{\partial \boldsymbol{A}^{-1}}{\partial B_{ij}}\boldsymbol{M}\boldsymbol{A}^{-\top}\boldsymbol{\Lambda}^\top + \boldsymbol{\Lambda}\boldsymbol{A}^{-1}\boldsymbol{M}\dfrac{\partial \boldsymbol{A}^{-\top}}{\partial B_{ij}}\boldsymbol{\Lambda}^\top$$

Packaging this into a $\partial/\partial \boldsymbol{B}$ matrix (or its vectorized form) requires introducing $\boldsymbol{\Sigma}_\eta = \boldsymbol{A}^{-1}\boldsymbol{M}\boldsymbol{A}^{-\top}$ and using a Kronecker-product representation. This is what SEM software (lavaan, OpenMx, etc.) implements internally.

19.8 RAM model covariance structure

Formula: $\boldsymbol{\Sigma} = \boldsymbol{F}(\boldsymbol{I} - \boldsymbol{A})^{-1}\boldsymbol{S}(\boldsymbol{I} - \boldsymbol{A})^{-\top}\boldsymbol{F}^\top$

Conditions: $\boldsymbol{v} = \boldsymbol{A}\boldsymbol{v} + \boldsymbol{u}$, $\text{Cov}(\boldsymbol{u}) = \boldsymbol{S}$, $\boldsymbol{F}$: filter (selection) matrix

Proof

RAM (Reticular Action Model), proposed by McArdle, is a unified SEM notation that does not distinguish latent and observed variables and bundles every variable into a single vector $\boldsymbol{v}$. Its advantage is that the eight LISREL matrices are condensed into just three matrices ($\boldsymbol{A}, \boldsymbol{S}, \boldsymbol{F}$).

In the RAM structural equation $\boldsymbol{v} = \boldsymbol{A}\boldsymbol{v} + \boldsymbol{u}$, $\boldsymbol{A}$ holds the directed coefficients between variables, and $\boldsymbol{u}$ is exogenous input (errors included). Solving for $\boldsymbol{v}$:

$$(\boldsymbol{I} - \boldsymbol{A})\boldsymbol{v} = \boldsymbol{u} \quad\Longrightarrow\quad \boldsymbol{v} = (\boldsymbol{I} - \boldsymbol{A})^{-1}\boldsymbol{u}$$

Letting $\boldsymbol{S}$ be the covariance of $\boldsymbol{u}$, the covariance of all variables is

$$\text{Cov}(\boldsymbol{v}) = (\boldsymbol{I} - \boldsymbol{A})^{-1}\,\boldsymbol{S}\,(\boldsymbol{I} - \boldsymbol{A})^{-\top}$$

To extract only the observed variables, sandwich the result with a 0/1 filter (selection) matrix $\boldsymbol{F}$ that picks out the observed-variable rows.

$$\boldsymbol{\Sigma} = \boldsymbol{F}\,\text{Cov}(\boldsymbol{v})\,\boldsymbol{F}^\top = \boldsymbol{F}(\boldsymbol{I} - \boldsymbol{A})^{-1}\boldsymbol{S}(\boldsymbol{I} - \boldsymbol{A})^{-\top}\boldsymbol{F}^\top$$

This is the RAM model-implied covariance. It is more compact than LISREL and is adopted by software such as OpenMx.

Remarks: RAM is more compact than LISREL and is used by software such as OpenMx. $\boldsymbol{F}$ is a 0-1 matrix that selects the rows of the observed variables.

19.9 Gradient with respect to the RAM matrix A

Formula: $\displaystyle\dfrac{\partial \boldsymbol{\Sigma}}{\partial A_{ij}} = \boldsymbol{F}\boldsymbol{E}^{-1}\boldsymbol{e}_i\boldsymbol{e}_j^\top\boldsymbol{E}^{-1}\boldsymbol{S}\boldsymbol{E}^{-\top}\boldsymbol{F}^\top + \boldsymbol{F}\boldsymbol{E}^{-1}\boldsymbol{S}\boldsymbol{E}^{-\top}\boldsymbol{e}_j\boldsymbol{e}_i^\top\boldsymbol{E}^{-\top}\boldsymbol{F}^\top$

Conditions: $\boldsymbol{E} = \boldsymbol{I} - \boldsymbol{A}$

Proof

Setting $\boldsymbol{E} := \boldsymbol{I} - \boldsymbol{A}$, we have $\boldsymbol{\Sigma} = \boldsymbol{F}\boldsymbol{E}^{-1}\boldsymbol{S}\boldsymbol{E}^{-\top}\boldsymbol{F}^\top$. Differentiating with respect to a single entry $A_{ij}$, by an argument analogous to 19.7 we get $\partial \boldsymbol{E}/\partial A_{ij} = -\boldsymbol{e}_i\boldsymbol{e}_j^\top$, and the inverse-matrix differentiation formula gives

$$\dfrac{\partial \boldsymbol{E}^{-1}}{\partial A_{ij}} = \boldsymbol{E}^{-1}\boldsymbol{e}_i\boldsymbol{e}_j^\top\boldsymbol{E}^{-1}$$

The expression for $\boldsymbol{\Sigma}$ contains $\boldsymbol{E}^{-1}$ on the left and $\boldsymbol{E}^{-\top}$ on the right, so the product rule leaves two terms.

$$\dfrac{\partial \boldsymbol{\Sigma}}{\partial A_{ij}} = \boldsymbol{F}\dfrac{\partial \boldsymbol{E}^{-1}}{\partial A_{ij}}\boldsymbol{S}\boldsymbol{E}^{-\top}\boldsymbol{F}^\top + \boldsymbol{F}\boldsymbol{E}^{-1}\boldsymbol{S}\dfrac{\partial \boldsymbol{E}^{-\top}}{\partial A_{ij}}\boldsymbol{F}^\top$$

Expanding the first term yields $\boldsymbol{F}\boldsymbol{E}^{-1}\boldsymbol{e}_i\boldsymbol{e}_j^\top\boldsymbol{E}^{-1}\boldsymbol{S}\boldsymbol{E}^{-\top}\boldsymbol{F}^\top$, and the second term is its transpose. Substituting into the general gradient formula 19.6 then gives $\partial F_{\text{ML}}/\partial A_{ij}$, the standard implementation route.

Item Response Theory (IRT)

19.10 IRT log-likelihood

Formula: $\ell = \displaystyle\sum_{i,j} [x_{ij}\log P_{ij} + (1-x_{ij})\log(1-P_{ij})]$

Conditions: $x_{ij} \in \{0, 1\}$, $P_{ij} = P(X_{ij}=1|\theta_i, a_j, b_j)$

Proof

In an IRT model, the probability $P_{ij}$ that respondent $i$ answers item $j$ correctly is a function of the ability $\theta_i$ and the item parameters ($a_j, b_j, c_j$). Because each response $X_{ij}$ is binary (correct $=1$, incorrect $=0$), it follows a Bernoulli distribution.

$$P(X_{ij} = x_{ij}) = P_{ij}^{x_{ij}}(1 - P_{ij})^{1 - x_{ij}}$$

Assuming respondents are independent and items are conditionally independent within each respondent (local independence), the likelihood of the entire data set is the product of all responses.

$$L = \prod_{i,j} P_{ij}^{x_{ij}}(1 - P_{ij})^{1 - x_{ij}}$$

Taking the logarithm yields a sum of Bernoulli log-likelihoods.

$$\ell = \log L = \sum_{i,j}\!\left[x_{ij}\log P_{ij} + (1 - x_{ij})\log(1 - P_{ij})\right]$$

The difference among item models (1PL/2PL/3PL) lies only in the functional form of $P_{ij}$; the structure of the log-likelihood is unchanged. Each gradient formula below is obtained simply by applying the chain rule to differentiate $\ell$ via $P_{ij}$ down to the individual parameters.

19.11 Gradient of 2PL discrimination

Formula: $\displaystyle\dfrac{\partial \ell}{\partial a_j} = \displaystyle\sum_i (x_{ij} - P_{ij})(\theta_i - b_j)$

Conditions: $P_{ij} = \sigma(a_j(\theta_i - b_j))$, $\sigma$ is the logistic function

Proof

In the 2PL model $P_{ij} = \sigma(z_{ij})$ with the linear index $z_{ij} = a_j(\theta_i - b_j)$. The partial derivative with respect to $a_j$ uses the standard logistic identity $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ together with $\partial z_{ij}/\partial a_j = \theta_i - b_j$:

$$\dfrac{\partial P_{ij}}{\partial a_j} = P_{ij}(1 - P_{ij})(\theta_i - b_j)$$

Next, differentiate the log-likelihood with respect to $P_{ij}$ and chain through to $a_j$. The Bernoulli log-likelihood differentiates to

$$\dfrac{\partial}{\partial P_{ij}}[x_{ij}\log P_{ij} + (1-x_{ij})\log(1-P_{ij})] = \dfrac{x_{ij}}{P_{ij}} - \dfrac{1 - x_{ij}}{1 - P_{ij}}$$

Multiplying these together and summing over respondents $i$ neatly cancels the $P_{ij}(1 - P_{ij})$ factor in numerator and denominator.

$$\dfrac{\partial \ell}{\partial a_j} = \sum_i \dfrac{x_{ij}(1-P_{ij}) - (1-x_{ij})P_{ij}}{P_{ij}(1-P_{ij})} \cdot P_{ij}(1-P_{ij})(\theta_i - b_j)$$

The numerator simplifies to $x_{ij} - P_{ij}$, giving

$$\dfrac{\partial \ell}{\partial a_j} = \sum_i (x_{ij} - P_{ij})(\theta_i - b_j) \quad \square$$

This has an intuitively appealing form: a sum of the residual (observed $x_{ij}$ minus predicted $P_{ij}$) weighted by the respondent's ability deviation $(\theta_i - b_j)$.

19.12 Gradient of 2PL difficulty

Formula: $\displaystyle\dfrac{\partial \ell}{\partial b_j} = \displaystyle\sum_i (P_{ij} - x_{ij}) a_j$

Conditions: $P_{ij} = \sigma(a_j(\theta_i - b_j))$

Proof

The partial derivative with respect to the difficulty parameter $b_j$ is essentially the same calculation as 19.11. The only difference is $\partial z_{ij}/\partial b_j = -a_j$ (a sign flip). The logistic derivative gives

$$\dfrac{\partial P_{ij}}{\partial b_j} = -P_{ij}(1 - P_{ij}) a_j$$

Carrying this negative sign through the same flow as 19.11, $P_{ij}(1 - P_{ij})$ cancels and the sign flips:

$$\dfrac{\partial \ell}{\partial b_j} = \sum_i (x_{ij} - P_{ij})(-a_j) = \sum_i (P_{ij} - x_{ij}) a_j \quad \square$$

The gradient correctly points in the direction of decreasing $b_j$ when $x_{ij} > P_{ij}$ (more correct responses than predicted), and increasing $b_j$ in the opposite case.

19.13 Gradient of the ability parameter

Formula: $\displaystyle\dfrac{\partial \ell}{\partial \theta_i} = \displaystyle\sum_j (x_{ij} - P_{ij}) a_j$

Conditions: $P_{ij} = \sigma(a_j(\theta_i - b_j))$

Proof

The ability $\theta_i$ is a parameter of respondent $i$, so the partial derivative with respect to $\theta_i$ is summed over items $j$ (treating items as the variable). With $\partial z_{ij}/\partial \theta_i = a_j$,

$$\dfrac{\partial P_{ij}}{\partial \theta_i} = P_{ij}(1 - P_{ij}) a_j$$

Following the same flow as 19.11 (the cancellation of $P_{ij}(1 - P_{ij})$):

$$\dfrac{\partial \ell}{\partial \theta_i} = \sum_j (x_{ij} - P_{ij}) a_j \quad \square$$

The form is "sum of item residuals weighted by item discrimination $a_j$." It is used in Newton-Raphson and Fisher scoring methods for ability estimation. Items with higher discrimination contribute more, which is intuitive.

Remarks: This gradient is used by the Newton-Raphson method and Fisher scoring for ability estimation.

19.14 Gradient of 3PL discrimination

Formula: $\displaystyle\dfrac{\partial \ell}{\partial a_j} = \displaystyle\sum_i \displaystyle\dfrac{(x_{ij} - P_{ij})(1-c_j)P^*_{ij}(1-P^*_{ij})(\theta_i - b_j)}{P_{ij}(1-P_{ij})}$

Conditions: $P_{ij} = c_j + (1-c_j)P^*_{ij}$, $P^*_{ij} = \sigma(a_j(\theta_i - b_j))$

Proof

In the 3PL model, the guessing parameter $c_j \in (0, 1)$ acts as a lower bound and lifts the curve $P^*_{ij} = \sigma(z_{ij})$ (with $z_{ij} = a_j(\theta_i - b_j)$) upward.

$$P_{ij} = c_j + (1 - c_j)\,P^*_{ij}$$

Differentiating with respect to $a_j$, since $c_j$ does not depend on $a_j$, the factor $(1 - c_j)$ remains in front and the inner $P^*_{ij}$ differentiates as in 19.11.

$$\dfrac{\partial P_{ij}}{\partial a_j} = (1 - c_j)\,P^*_{ij}(1 - P^*_{ij})(\theta_i - b_j)$$

The gradient of the log-likelihood follows from the chain rule:

$$\dfrac{\partial \ell}{\partial a_j} = \sum_i \dfrac{x_{ij} - P_{ij}}{P_{ij}(1 - P_{ij})}\dfrac{\partial P_{ij}}{\partial a_j}$$

Unlike the 2PL case, here $P_{ij}$ and $P^*_{ij}$ differ ($P_{ij} = c_j + (1-c_j)P^*_{ij}$), so $P_{ij}(1 - P_{ij})$ does not cancel completely; the ratio of $P^*_{ij}(1 - P^*_{ij})$ to $P_{ij}(1 - P_{ij})$ remains.

$$\dfrac{\partial \ell}{\partial a_j} = \sum_i \dfrac{(x_{ij} - P_{ij})(1 - c_j)P^*_{ij}(1 - P^*_{ij})(\theta_i - b_j)}{P_{ij}(1 - P_{ij})} \quad \square$$

As $c_j \to 0$, this reduces to the 2PL case.

19.15 Gradient of the guessing parameter

Formula: $\displaystyle\dfrac{\partial \ell}{\partial c_j} = \displaystyle\sum_i \displaystyle\dfrac{x_{ij} - P_{ij}}{P_{ij}(1 - c_j)}$

Conditions: $P_{ij} = c_j + (1-c_j)P^*_{ij}$

Proof

We start with the partial derivative with respect to the guessing parameter $c_j$. Differentiating $P_{ij} = c_j + (1 - c_j)P^*_{ij}$ with respect to $c_j$, with $P^*_{ij}$ independent of $c_j$ (it depends only on the inner $z_{ij}$), gives

$$\dfrac{\partial P_{ij}}{\partial c_j} = 1 - P^*_{ij}$$

Combining via the chain rule with the Bernoulli log-likelihood gradient (the same expression for $\partial \ell/\partial P_{ij}$ used in 19.11):

$$\dfrac{\partial \ell}{\partial c_j} = \sum_i\!\left[\dfrac{x_{ij}}{P_{ij}} - \dfrac{1 - x_{ij}}{1 - P_{ij}}\right](1 - P^*_{ij})$$

Putting the numerators over a common denominator gives $x_{ij}(1 - P_{ij}) - (1 - x_{ij})P_{ij} = x_{ij} - P_{ij}$:

$$= \sum_i \dfrac{x_{ij} - P_{ij}}{P_{ij}(1 - P_{ij})}(1 - P^*_{ij})$$

Using the key identity $1 - P_{ij} = 1 - c_j - (1 - c_j)P^*_{ij} = (1 - c_j)(1 - P^*_{ij})$, the simple relationship between $(1 - P^*_{ij})$ and $(1 - P_{ij})$ simplifies the expression.

$$\dfrac{\partial \ell}{\partial c_j} = \sum_i \dfrac{x_{ij} - P_{ij}}{P_{ij}(1 - c_j)} \quad \square$$

(Because we divide by $1 - c_j$, estimation becomes unstable as $c_j$ approaches $1$ — a known practical difficulty in IRT implementations.)

19.16 2PL item information function

Formula: $I_j(\theta) = a_j^2 P_j(\theta)(1 - P_j(\theta))$

Conditions: $P_j(\theta) = \sigma(a_j(\theta - b_j))$

Proof

Start from the definition of Fisher information:

$$I_j(\theta) = \mathbb{E}\!\left[\left(\dfrac{\partial \log P(X_j|\theta)}{\partial \theta}\right)^2\right]$$

For a Bernoulli distribution (the distribution of the response $X_j$ to item $j$), the $\theta$ derivative of the log-likelihood takes the form $\displaystyle\dfrac{X_j - P_j}{P_j(1 - P_j)}P'_j$. Using $\mathbb{E}[X_j] = P_j$ and $\text{Var}(X_j) = P_j(1 - P_j)$ to compute the squared expectation gives the standard Bernoulli Fisher information formula

$$I_j(\theta) = \dfrac{(P'_j)^2}{P_j(1 - P_j)}$$

where $P'_j = \partial P_j/\partial \theta$. For 2PL, the same calculation as 19.13 gives $P'_j = a_j P_j(1 - P_j)$, so

$$I_j(\theta) = \dfrac{[a_j P_j(1 - P_j)]^2}{P_j(1 - P_j)} = a_j^2 P_j(1 - P_j) \quad \square$$

One factor of $P_j(1 - P_j)$ cancels. The maximum $a_j^2/4$ is attained at $P_j = 0.5$ (i.e., $\theta = b_j$), the central insight of IRT: respondents whose ability equals the item difficulty are measured most precisely.

Remarks: The information function attains its maximum $a_j^2/4$ at $\theta = b_j$ (the difficulty). Higher discrimination yields more information.

19.17 3PL item information function

Formula: $I_j(\theta) = a_j^2 \displaystyle\dfrac{(1-c_j)^2 P^{*2}_j (1-P^*_j)^2}{P_j(1-P_j)}$

Conditions: $P_j = c_j + (1-c_j)P^*_j$, $P^*_j = \sigma(a_j(\theta - b_j))$

Proof

The standard Fisher information formula $I_j(\theta) = (P'_j)^2 / [P_j(1 - P_j)]$ remains valid for 3PL. What changes is the functional form of $P_j$ and its derivative.

Differentiating $P_j = c_j + (1 - c_j)P^*_j$ with respect to $\theta$, the constant $c_j$ drops out:

$$P'_j = \dfrac{\partial P_j}{\partial \theta} = (1 - c_j)\dfrac{\partial P^*_j}{\partial \theta} = (1 - c_j)\,a_j\,P^*_j(1 - P^*_j)$$

Substituting into the Fisher information formula:

$$I_j(\theta) = \dfrac{(P'_j)^2}{P_j(1 - P_j)} = \dfrac{(1 - c_j)^2\,a_j^2\,P^{*2}_j(1 - P^*_j)^2}{P_j(1 - P_j)} \quad \square$$

Unlike 2PL, the numerator's $P^*_j(1 - P^*_j)$ does not match the denominator's $P_j(1 - P_j)$, so the cancellation does not occur (in 2PL, $P_j = P^*_j$ leads to a complete cancellation). With guessing, low-ability respondents can answer correctly by chance up to a level of $c_j$, lowering effective discrimination and reducing Fisher information at the low end of the ability scale.

Remarks: When $c_j > 0$, the information at low ability levels is reduced.

19.18 Gradient of the information function with respect to discrimination

Formula: $\displaystyle\dfrac{\partial I_j}{\partial a_j} = 2a_j P_j(1-P_j) + a_j^2(1-2P_j)P_j(1-P_j)(\theta - b_j)$ (2PL)

Conditions: $I_j = a_j^2 P_j(1-P_j)$

Proof

In test design, we want to know "how the measurement precision at a given ability $\theta$ changes when item $j$'s discrimination $a_j$ varies." This is captured by $\partial I_j/\partial a_j$.

Differentiate the 2PL Fisher information $I_j = a_j^2 P_j(1 - P_j)$ with respect to $a_j$. Since $a_j$ enters both the prefactor $a_j^2$ and the inner $P_j = \sigma(a_j(\theta - b_j))$, the product rule is required.

The contribution through $P_j$ is $\partial P_j/\partial a_j = P_j(1 - P_j)(\theta - b_j)$ (the same calculation as 19.11). With this, the derivative of $P_j(1 - P_j)$ is

$$\dfrac{\partial}{\partial a_j}[P_j(1 - P_j)] = (1 - 2P_j)\,P_j(1 - P_j)(\theta - b_j)$$

(this is the typical second-derivative form of the logistic). Applying the product rule yields two terms:

$$\dfrac{\partial I_j}{\partial a_j} = 2a_j\,P_j(1 - P_j) + a_j^2(1 - 2P_j)\,P_j(1 - P_j)(\theta - b_j)$$

Factoring out the common factor $2a_j P_j(1 - P_j)$:

$$\dfrac{\partial I_j}{\partial a_j} = 2a_j P_j(1 - P_j)\bigl[1 - \tfrac{1}{2}a_j(\theta - b_j)(2P_j - 1)\bigr] \quad \square$$

An equivalent factored form is $2a_j P_j(1-P_j)\bigl[1 + \tfrac{1}{2}a_j(\theta-b_j)(1-2P_j)\bigr]$; at $\theta = b_j$ ($P_j = 1/2$) the second bracket term vanishes and the gradient reduces to $2a_j P_j(1-P_j)$, exposing the structure clearly. This quantifies how raising the discrimination changes Fisher information across the ability range, and is used in test design and item selection (CAT).