# Chapter 2 Literature Review

Diagnostic classification models (DCMs; also known as cognitive diagnostic models), are class of psychometric models that define a mastery profile on a predefined set of attributes (Rupp & Templin, 2008b; Rupp et al., 2010). These attributes are categorical in nature, and although they can consist of more than two categories, they most usually are binary (Bradshaw, 2017). Given an attribute profile for an individual, the probability of providing a correct response to an item is determined by the attributes that are required by the item.

This profile of attribute mastery that gives rise to item responses makes DCMs an inherently different type of assessment than what is most commonly used in psychometrics. For example classical test theory (DeVellis, 2006), item response theory (Reckase, 2009), and structural equation modeling (Ullman & Bentler, 2003) all assume a continuous latent trait. This can result in difficulty in interpreting what an assessment score means. In an educational setting, a process known as standard setting (Cizek, 2006) is typically conducted to categorize the continuous score so that stakeholders and parents can better interpret what a score means (Hambleton, 2006). In contrast, assessments that are scaled with a diagnostic model provide a categorical class for each attribute that is mastered. This allows for a greater differentiation of respondent latent traits. However, a decision must still be made as to what probability of attribute mastery is sufficient for reporting an individual as a master.

Take, for example, a standard K-12 math assessment. Using traditional test scaling methods, a student would receive an overall math score, performance level determined by the standard setting process, and possibly a selection of subscores. However, because the unidimensional variants of these methods are most commonly used, subscores have been shown to be problematic with these types of assessments (Feinberg & Wainer, 2014; Sinharay, Haberman, & Wainer, 2011).

Diagnostic models on the other hand are multidimensional models. Thus, if the math assessment were scaled using diagnostic models, the student would receive a probability of mastery on each of the attributes that was assessed (Bradshaw & Templin, 2014), although it is possible to also use a standard setting process within DCMs if desired (see Templin, 2010; Templin, Poggio, Irwin, & Henson, 2007). What these attributes are is determined in the test design process. They could be specific skills, educational standards, or subareas within the larger content area (e.g., algebra, geometry, and statistic all fall within the larger math construct). Therefore, it is critical to determine what level of score reporting is desired prior to test construction. For example both Rupp et al. (2010) and Almond, Mislevy, Steinberg, Yan, & Williamson (2015) discuss the evidence centered design framework, and how this approach to validity can aid in the construction of a diagnostic assessment. Under the evidence centered design framework, generally speaking, test design begins with the inferences about student ability that are desired, and then works back to the evidence needed to support those inferences. In this way, the grain size of the desired inferences will dictate the grain size of the attribute definitions.

In this chapter, the statistical structure of DCMs is outlined, the key differences between traditional sub-types of DCMs are highlighted. The log-linear cognitive diagnosis model is then explored in more depth, as this model subsumes all other DCMs and is the basis for this study. Finally, model estimation and reduction techniques are discussed across a variety of psychometric models, including diagnostic models, item response theory, and structural equation modeling.

## 2.1 Structure of diagnostic classification models

In this paper, the discussion of diagnostic models is restricted to binary attributes assessed by dichotomously scored items. Practically, this means that all models presented are extensions of latent class models. Specifically, DCMs can be thought of as a restricted latent class model where each class represents a profile of attribute mastery. When using binary attributes, the number of unique classes is equal to \(2^A\), where \(A\) is the number of attributes assessed. Given the specification of available attribute profiles, the probability of respondent \(r\) providing a response to an item is as follows.

\[\begin{equation} P(\text{X}_r=\text{x}_r)=\sum_{c=1}^C\nu_c\prod_{i=1}^I\pi_{ic}^{x_{ir}}(1-\pi_{ic})^{1-x_{ir}} \tag{2.1} \end{equation}\]In equation (2.1) \(\pi_{ic}\) is the probability of a respondent in class \(c\) providing a correct response to item \(i\), and \(x_{ir}\) is the response (i.e., 0, 1) of respondent \(r\) to item \(i\). Thus, \(\pi_{ic}^{x_{ir}}(1-\pi_{ic})^{1-x_{ir}}\) can be described in words as the probability of a respondent in class \(c\) providing the observed response to item \(i\). The probabilities are then multiplied across all items, giving, the probability of a respondent in class \(c\) providing the observed response pattern. This portion of equation (2.1) is known as the *measurement model*, and defines how the items are related to the attributes (equation (2.2) shows just the measurement model).

Continuing with equation (2.1), the probability of respondent in class \(c\) providing the observed response vector is then by multiplied by \(\nu_c\), the probability that any given respondent belongs to class \(c\). This product represents the probability that a given respondent is in class \(c\) and provided the observed response pattern. Summing over all possible classes gives the probability that a randomly chosen respondent would provide the observed response pattern. This section of equation (2.1) that defines the joint probability of membership in each class is known as the structural model (equation (2.3) shows just the structural model). In the structural \(\pmb{\nu}\) is constrained to sum to 1, such that the probability of a respondent not belonging to any class is 0.

\[\begin{equation} \sum_{c=1}^C\nu_c \tag{2.3} \end{equation}\]Historically, diagnostic models have used an unconstrained structural model, meaning that the values \(\pmb{\nu}\) directly correspond to the proportion of respondents estimated to be in each of the latent classes. Thus, the structural model is consistent across the wide variety of diagnostic models that exists. What differs between these models is how the measurement model, or how the items relate to the attributes. This process begins with the specification of a Q-matrix.

### 2.1.1 The Q-matrix

The specification of which attributes are measured by each item is defined *a priori* by the Q-matrix. The Q-matrix is an \(n\ items\ \times\ a\ attributes\) matrix filled with 0s and 1s. A 0 indicates the item is not measured by the attribute, whereas a 1 indicates that the item is measured by the attribute (Tatsuoka, 1983). The Q-matrix is developed in consultation with content area experts to determine the attributes that need to be present in order for the item to be answered correctly (Bradshaw, 2017). Because the Q-matrix defines how the items relate to the latent attributes, the correct specification of the Q-matrix is critical to the accuracy of the parameter estimates and scores. Both Kunina-Habenicht, Rupp, & Wilhelm (2012) and Rupp & Templin (2008a) used simulation studies to demonstrate the ill-effects of misspecification on classification accuracy and parameter bias. Given this importance, it is common practice to make changes to the Q-matrix following the estimation of the DCM (Rupp et al., 2010).

For example de la Torre (2008) proposed a method for empirically validating the Q-matrix following estimation with the deterministic-input, noisy-and-gate model (see section 2.2.1). Using this method, de la Torre found acceptable Type I and Type II error rates, indicating that the method was able to adequately identify places where the Q-matrix was both correctly and incorrectly specified. Similarly, DeCarlo (2011) found that changing the Q-matrix structure could significantly improve the placement of respondents into latent classes. For example, the initial specification of the Q-matrix for fraction subtraction data (Tatsuoka, 1990) leads to respondents with no correct answers mastering the majority of skills. By changing the specification of the Q-matrix, DeCarlo (2011) was able to correct this, resulting in a more interpretable output. Chen, Liu, Xu, & Ying (2015) took this approach to the extreme by estimating the entire Q-matrix based only on the dependencies seen in item responses. Using this method, content experts are removed from the process of creating the Q-matrix entirely, and it is specified entirely by empirical methods.

What the Q-matrix is unable to define is how the attributes interact with each other on a given item to influence performance. If an item is measured by multiple attributes, does the respondent have to have mastered all of the attributes in order to have a high probability of answering the item correctly? Or would mastery of any of the attributes be sufficient? This definition of how the attributes interact with the items is defined by the measurement model (equation (2.2)). Traditionally, this choice of compensatory versus non-compensatory has been accomplished by choosing one of a variety of DCMs that have been proposed in the literature.

## 2.2 Types of DCMs

Traditionally, the type of compensation employed in the measurement model has been defined through the selection of a specific DCM. The individual types of DCMs each defined the compensatory or non-compensatory nature of the attributes and items differently. Thus, the relationships of attributes and items must be assumed *a priori*, and then enforced by the selected model. This relationship can be either non-compensatory or compensatory. Ostensibly, both compensatory and non-compensatory models could be estimated, compared, and then a final model selected *a posteriori*; however, usually when selecting one of these models, there is a conceptual reason for the selection, which may not be compatible with other types of DCMs. Non-compensatory DCMs require all of the attributes measured by an item to be mastered in order for the item to be answered correctly. In compensatory DCMs, mastery of some of the attributes measured by an item may be enough to provide a high probability of success. A high level description of these classes of DCMs follows.

### 2.2.1 Noncompensatory DCMs

Non-compensatory DCMs are defined such that all attributes associated with an item must be mastered in order for the respondent to have a high probability of answering the item correctly. In other words, having an excess of ability on one of the attributes measured by an item cannot make up for the lack of ability on another. This class of DCMs includes the determinisitic-input, noisy-and-gate (DINA; de la Torre & Douglas, 2004; Haertel, 1989; Junker & Sijtsma, 2001), noisy-input, deterministic-and gate (NIDA; Henson & Douglas, 2005; Junker & Sijtsma, 2001), and reduced non-compensatory reparameterized unified (reduced NC-RUM; DiBello, Stout, & Roussos, 1995; Hartz, 2002) models. In the DINA and NIDA models, there are slipping and guessing parameters that are held constant across items or attributes respectively. In these models, the slipping parameter represents the probability of incorrectly applying an attribute that has be mastered, whereas the guessing parameter represents the probability of correctly applying an attribute that hasn’t been mastered. The reduced NC-RUM is parameterized slightly differently. In this model, the probability of providing a correct response when all required attributes have been mastered, with a penalty factor then applied for each attribute that isn’t mastered. However, in all of these models, the presence of one of the required attributes is unable to make up for the absence of another.

### 2.2.2 Compensatory DCMs

In contrast to the non-compensatory DCMs outlined above, compensatory DCMs are structured such that mastering a subset of the required attributes is sufficient to provide a correct response to the item. This means that not only a subset attributes that are measured by an item have to be mastered in order for the respondent to have a high probability of success. DCMs in this class include the deterministic-input, noisy-or-gate (DINO; Templin & Henson, 2006), noisy-input, deterministic-or-gate (NIDO; Rupp & Templin, 2008b), compensatory reparameterized unified (C-RUM; Hartz, 2002) models. The DINO model is parameterized similarly to the DINA and NIDA models, with slipping and guessing parameters that are held constants across items. However, in this model, the slipping parameter represents the probability of providing an incorrect response when *at least one* of the required attributes has been mastered (rather than when all attributes have been mastered as in the DINA model). A similar interpretation is made for the guessing parameter.

The NIDO and C-RUM models are parameterized slightly differently. Rather than modeling parameters on the probability scale, a linear predictor is estimated on the log-odds scale, and then mapped to item scores using the logit link function (see section 2.3.1). In the NIDO model, an intercept is added to the linear predictor for all attributes measured by the item, and an additional main effect parameter for each of the mastered attributes. The C-RUM model is similar; however, rather than an estimating intercept for each attribute, the C-RUM model estimates an intercept for the entire item, which represents the log-odds of a correct response when none of the measured attributes are mastered. An additional main effect term is then added for each of the mastered attributes.

## 2.3 The log-linear cognitive diagnosis model

The log-linear cognitive diagnosis model (LCDM) is a general framework for diagnostic models that subsumes most of the existing DCMs, including those discussed in section 2.2 (Henson, Templin, & Willse, 2008; Rupp et al., 2010). Log-linear models are most commonly used in categorical data analysis when examining the change in frequency of respondents in a category across groups (Agresti, 2012). In these models, the frequency of respondents in a category is predicted by dummy coded grouping variables. Consider an example where a researcher is attempting to determine if there is a relationship between gender and political party affiliation (for the purposes of this example this will be limited to democratic or republican). This would result a 2x2 table similar to Table 2.1.

Democratic | Republican | |
---|---|---|

Male | 400 | 500 |

Female | 600 | 300 |

The relationship between gender and party affiliation would be written mathematically as:

\[\begin{equation} \ln(F_{ij})=\lambda_0+\lambda_i^{Gender}+\lambda_j^{Party}+\lambda_{ij}^{Gender*Party} \tag{2.4} \end{equation}\]In equation (2.4) the log frequency of respondents in a cell is given by a linear predictor. The intercept, \(\lambda_0\) represents the frequency for individuals in the reference group for both gender and party affiliation. The next two terms, \(\lambda_i^{Gender}\) and \(\lambda_j^{Party}\) represent the simple main effects for gender and party affiliation respectively. Finally, the interaction term, \(\lambda_{ij}^{Gender*Party}\) represents how related the two factors are. For instance, if gender and party affiliation are completely independent of one another, the interaction term would be equal to 0.

The LCDM is a log-linear model with categorical latent traits. Consider an item on an achievement test that measures a single attribute, \(\alpha_1\). This would lead to a cross classification table similar to Table 2.1, but with a latent attribute.

Master | Nonmaster | |
---|---|---|

Correct (X=1) | 900 | 200 |

Incorrect (X=0) | 100 | 600 |

The mathematical definition of Table 2.2 would be as follows:

\[\begin{equation} \ln(F_{ij})=\lambda_0+\lambda_i^{\alpha_1}+\lambda_j^{x}+\lambda_{ij}^{\alpha_1*x} \tag{2.5} \end{equation}\]Table 2.2 and equation (2.5) could both be extended to multiple latent attributes by creating a three way cross classification table and adding the appropriate main effects and additional interaction terms. Because mastery of the attributes is unobserved, it must be estimated by relating the observed response to the unobserved attribute. As discussed in section 2.2, the relationship between observed data and the latent attributes is known as the measurement model.

### 2.3.1 LCDM measurement model

In order to use a log-linear model to predict probabilities of events occurring (rather than frequencies), a different link function must be used. Equations (2.4) and (2.5) used a log-link, as frequencies are only bounded on the lower end of the distribution by 0. Probabilities, on the other hand, are bounded by 0 on the lower and 1 on the upper ends of the distribution. Therefore, a logistic, or logit, link is used. As discussed in section 2.2.2, these types of generalized linear models involve combining the predictors into what is known as a *kernel* (or linear predictor in the generalized linear modeling literature; Stroup, 2012), which is an unbounded continuous value that is mapped to the item responses through a link function. When dealing with dichotomous data, this is most commonly achieved using the logit link function (Stroup, 2012), defined is in equation (2.6).

Similarly, the inverse of the logit can be expressed as follows.

\[\begin{equation} \pi_{ic} = \text{logit}^{-1}(\eta_{ic}) = \frac{\exp(\eta_{ic})}{1 + \exp(\eta_{ic})} \tag{2.7} \end{equation}\]The inverse logit in equation (2.7) is more commonly seen in psychometrics, especially in reference to item response theory (Ayala, 2009).

For the LCDM, the general notation used by Rupp et al. (2010) for parameters in the linear predictor is \(\lambda_{i,l,(a,a',...)}\). In this notation, the first subscript identifies the item for the parameter, the second parameter indicates the level of the parameter (i.e., 0 = intercept, 1 = main effect, 2 = two-way interaction, etc.), and the third subscript specifies which attribute(s) are measured by the parameter. For example, if item 1 on an assessment measured both attributes 1 and 2, the probability of a correct response would be defined by an intercept, \(\lambda_0\), a simple main effect for attribute 1, \(\lambda_{1,1,(1)}\), a simple main effect for attribute 2, \(\lambda_{1,1,(2)}\), and the interaction between attributes 1 and 2 \(\lambda_{1,2,(1,2)}\).

For any number of attributes, \(A\), the kernel for the logit can be defined as follows:

\[\begin{equation} \text{kernel}_i=\lambda_{i,0} +\sum_{a=1}^A\lambda_{i,1,(a)}\alpha_{ca}q_{ia}+\sum_{a=1}\sum_{a'>1}^A\lambda_{i,2,(a,a')}\alpha_{ca}\alpha_{ca'}q_{ia}q_{ia'}+... \tag{2.8} \end{equation}\]Equation (2.8) demonstrates that the kernel for item is made up of the intercept, all main effects for attributes that have both been mastered by individuals in latent class \(c\) and are measured by item \(i\), and all two-way interactions that meet the conditions of all attributes in the interaction have been mastered by latent class \(c\) are measured by item \(i\). Equation (2.8) could continue on, adding higher level interaction terms as more and more attributes are measured by item \(i\), up to the total number of attributes, \(A\).

Written more succinctly, equation (2.8) can be expressed with matrix notation as:

\[\begin{equation} \text{kernel}_i=\lambda_{i,0}+\pmb{\lambda}_i^T\textbf{h}(\pmb{\alpha}_c,\textbf{q}_i) \tag{2.9} \end{equation}\]For item \(i\), \(\pmb{\lambda}_i^T\) represents the transpose of the \((2^A-1)\times1\) vector of item parameters that contains the main effects and interactions, and \(\textbf{h}(\pmb{\alpha}_c,\textbf{q}_i)\) is the \((2^A-1)\times1\) vector of attribute and Q-matrix combinations. Thus, written in a more general form, the probability of a respondent in latent class \(c\) providing a correct response to item \(i\) can be defined as:

\[\begin{equation} \pi_{ic} = P(X_{ic}=1\ |\ \pmb{\alpha}_c) = \frac{\exp(\lambda_{i,0}+\pmb{\lambda}_i^T\textbf{h}(\pmb{\alpha}_c,\textbf{q}_i))}{1+\exp(\lambda_{i,0}+\pmb{\lambda}_i^T\textbf{h}(\pmb{\alpha}_c,\textbf{q}_i))} \tag{2.10} \end{equation}\]This expression of a DCM has many advantages over those defined in section 2.2. First, the LCDM has parameters that are easier to interpret than those seen in the traditional DCMs. For example, the DINA and NIDA models both contain guessing and slipping parameters, but the interpretation of them differs due to how parameters are constrained in these models. Additionally, the slipping parameter (the probability of getting the item wrong despite having mastered all of the constituent attributes) is less useful than \((1-s_i)\), or the probability of providing a correct response, which is usually the value of interest. In contrast, the LCDM parameters are directly analogous to the parameters of a generalized linear models. Each parameter represents the change in the log-odds of a correct response.

Additionally, by placing constraints on the parameters of the LCDM, it is possible to estimate the aforementioned DCMs. For example, the DINA model requires that all attributes be mastered in order to increase the probability of a correct response. This can be accomplished by constraining all parameters except the intercept and highest level interaction term of equation (2.8) to be 0 (Henson et al., 2008; Rupp et al., 2010). With these constraints, the log-odds of success are equal to \(\lambda_{i,0}\) when not all attributes have been mastered, and \(\lambda_{i,0}+\lambda_{i,2,(a,a')}\) when all attributes have been mastered (assuming the item only measures two attributes). The inverse logit (equation (2.7)) of \(\lambda_{i,0}\) is equal to the guessing parameter \(g_i\) in the DINA model, and the inverse logit of \(\lambda_{i,0}+\lambda_{i,2,(a,a')}\) is equal to \((1-s_i)\).

Similarly, the DINO model can also be replicated through constraints on the LCDM model. In the DINO model, mastering one attribute is just as good as mastering a different or multiple attributes that are measured by the item. Thus, the first constraint is that the main effects must be equal. If an item measures two attributes, the increase in the log-odds of providing a correct response is equal, regardless of which of the two is mastered. The second constraint is that the interaction term is equal to the negative of the main effect parameter. This means that three parameters (two main effects and an interaction) all have the same absolute value, but the main effects are positive and the interaction is negative. This has the effect of the interaction cancelling out the additional increase in log-odds of a correct response for mastering additional attributes. Thus, the kernel for LCDM parameterization of the DINO model can be written as:

\[\begin{equation} \begin{split} \text{kernel}_i&=\lambda_{i,0} + \lambda_i\alpha_1+\lambda_i\alpha_2-\lambda_i\alpha_1\alpha_2 \\ &= \lambda_{i,0} + \lambda_i(\alpha_1+\alpha_2-\alpha_1\alpha2) \end{split} \tag{2.11} \end{equation}\]When neither of the attributes measured by the item have been mastered, \(\alpha_1\) and \(\alpha_2\) are 0, and equation (2.11) simplifies to \(\lambda_{i,0}\), which is equivalent to the inverse logit of \(g_i\) in the DINO model. If only attribute 1 has been mastered, the \(\alpha_2\) will be equal to 0, and equation (2.11) simplifies to \(\lambda_{i,0} + \lambda_i\), which is equivalent to \((1-s_i)\) in the DINO model. Finally, if both attributes have been mastered, then equation (2.11) becomes \(\lambda_{i,0} + \lambda_i(1+1-1\times1) = \lambda_{i,0} + \lambda_i\), which is identical the result when only one attribute was mastered.

Because the LCDM is able to encompass this variety of DCMs, the choice of compensatory versus non-compensatory DCM becomes irrelevant. Instead, the saturated LCDM can be estimated, and if the items truly follow the DINA, DINO, or other lower-level DCM, the estimated parameters will reflect this. Further the LCDM provides a framework for testing the assumptions of these other DCMs. For example, two models could be fit to the same data: one fully saturated LCDM, the other with constraints on the item parameters. A likelihood ratio test can then be performed to determine if the reduced model fits as well as the saturated model.

## 2.4 Structural models

To this point, the discussion has focused on the measurement model of DCMs. Recall from equation (2.1), reprinted here, that this is only one piece of the diagnostic model.

\[\begin{equation} P(\text{X}_r=\text{x}_r)=\sum_{c=1}^C\nu_c\prod_{i=1}^I\pi_{ic}^{x_{ir}}(1-\pi_{ic})^{1-x_{ir}} \end{equation}\]The measurement model, which relates the attributes to the observed item responses, is concerned with estimated \(\pi_{ic}\). The structural model is focused on \(\nu_c\). In DCMs, \(\nu_c\) represents the base rate probability of each latent class (Rupp et al., 2010). The base probabilities allow for the calculation of mastery rates for each attribute marginally, as well as the correlations between the attributes. In an assessment with \(A\) attributes there are \(2^A\) latent classes. Because all elements of \(\nu\) must sum to 0, there are \(2^A - 1\) parameters to estimate, as the final element can be calculated by taking 1 minus the sum of the other elements.

Estimating each of these probabilities directly is referred to as the “unstructured” or “unconstrained” structural model (Rupp et al., 2010). By estimating the probabilities directly, it is possible to observe if there are any classes that have few respondents, possibly indicating the presence of an attribute hierarchy as described by (Templin & Bradshaw, 2014a). However, this type of unconstrained model can cause problems in high dimensional attribute spaces. Because the number of structural parameters to be estimated is \(2^A-1\), the number of parameters increases exponentially with each added attribute. For example, a five attribute assessment requires \(2^5-1=31\) parameters, whereas a 10 attribute assessment would require \(2^{10}-1=1,023\) parameters. Thus, it is often desirable to reduce the number of parameters that need to be estimated.

Two such approaches for reducing the structural model are the unstructured tetrachoric model (Hartz, 2002) and structured tetrachoric model (de la Torre & Douglas, 2004). These approaches work well in many instances (e.g., if the primary interest is the relationships between the attributes), however, they can be rather restrictive in what can be estimated. For example, suppose that and unstructured tetrachoric model is utilized, and it is determined that the structural model has been reduced too much. With this method, it is not a straightforward proposition as to how to add parameters back in without going all the way back to the unconstrained model. Additionally, suppose that a researcher is interested in both the potential hierarchical structure of attributes and the attributes’ associations. How should the researcher reduce the model, and achieve both of these desired outcomes? Rupp et al. (2010) suggest a “top-down approach” using log-linear models, which is described in the following section (section 2.4.1).

### 2.4.1 Log-linear structural models

The log-linear approach to structural models was proposed by (Henson & Templin, 2005), and is very similar to the log-linear model that was used to define the measurement model of the LCDM in section 2.3.1. Specifically, the kernel for latent class \(c\) can be defined as follows:

\[\begin{equation} \text{kernel}_c=\sum_{a=1}^A\gamma_{1,(a)}\alpha_{ca}+\sum_{a=1}^{A-1}\sum_{a'=a+1}^A\gamma_{2,(a,a')}\alpha_{ca}\alpha_{ca'}+...+\gamma_{A,(a,a',...)}\prod_{a=1}^A\alpha_{ca} \tag{2.12} \end{equation}\]The parameters included for latent class \(c\) are a main effect for each attribute that has been mastered by individuals in the class, as well interactions between the mastered attributes (two-way up to \(A\)-way, where \(A\) is the total number of attributes assessed). For an assessment measuring two attributes, the structural model would be defined as outlined in Table 2.3.

Class | Attribute Profile | Kernel |
---|---|---|

1 | [0,0] | 0 |

2 | [1,0] | \(\gamma_{1,(1)}\) |

3 | [0,1] | \(\gamma_{1,(2)}\) |

4 | [1,1] | \(\gamma_{1,(1)}+\gamma_{1,(2)}+\gamma_{2,(1,2)}\) |

The fully saturated log-linear structural model is equivalent to the unconstrained structural model. However, this parameterization has many benefits over the tetrachoric methods. First, using this method allows for a hypothesis test on each of the estimated parameters. Thus, non-significant parameters can be removed from the model, allowing for model reduction to occur without enforcing a less flexible structure. This parameterization can also be used to reduce the structural model prior to estimation. For example, Xu & von Davier (2008) used a log-linear structural model in their analysis of data from the National Assessment of Educational Progress. In the structural model, they allow for only main effects, two-way and three-way interactions. Estimating only main effects results in independent attributes. The addition of the two-way interaction allows for variances (and therefore correlations) to be estimated. Finally, the three-way interaction allows for the third moment, skewness, to be captured. Although the log-linear structural model may not be as intuitive as the tetrachoric models to those familiar with structural models in structural equation modeling and multidimensional item response theory, its flexibility makes it easier to remove parameters from the structural model when the attribute structure is unclear *a priori*.

## 2.5 Model reduction

As has been generally discussed thus far, model reduction is the process of removing parameters from the model in order to create a more parsimonious and efficient model (Templin & Bradshaw, 2014b), while still maintaining a structure that is capable of capturing the complexity of the data. The parameters can be removed from either the structural model or the measurement model. In practice, model reduction can take place in different contexts. The first is what usually comes to mind when thinking of model reduction. That is, removing parameters after the initial estimation of the model. However, model reduction is also common when the initial model fails to converge. In this scenario, without the parameter estimates and hypothesis tests from the initial model, heuristic decisions must be made in order to reduce the model to a structure that is estimable. Although different, understanding these processes is critical to the interpretation of the results. Before examining prior research on this using DCMs, the related literature from other latent variable models such as structural equation modeling and item response theory will be examined.

### 2.5.1 Model reduction in structural equation modeling

As in DCMs, in structural equation modeling, the measurement model relates the observed variables to the latent traits and the structural model defines the relationships between the latent traits. As such, both parts of the structural equation model can be reduced. In practice, this is a multistage process. Both Kline (2002) and Ullman (2012) suggest first fitting each measurement model separately. In other words, for each latent variable, fit a unidimensional model first. Then, add or remove parameters as necessary in order to ensure model fit. Once all of the unidimensional models have been assessed for model fit, they can be estimated simultaneously in the structural equation model, and the structural model can be reparameterized as needed to ensure the fit of the whole model.

Thus, the structural equation modeling world seems to follow a measurement model first, then structural model approach to model reduction. In an examination of the relationships between mental toughness, motivation, and emotion in sports, Perry, Nicholls, Clough, & Crust (2015) examined the factors from each questionnaire separately before combining them into the full model. They found that the full model had significantly better fit when the measurement models were adjusted prior to the estimation of the full model. Similarly, Burkholder & Harlow (2003) followed this procedure to reduce their model investigating HIV behavior risk. In this model, there was no reduction of the measurement model as each factor was just-identified. Thus, there was only reduction at the structural level, where all non-significant regression coefficients were removed.

However, it should be noted that model modifications are not limited to the removal of parameters in structural equation models. It is also common to use modification indices to locate places in the model where there is misfit and add parameters to improve the overall fit (Brown, 2006; Kaplan, 2009). These parameters could include additional regression paths, covariances between latent factors, or residual covariances. Thus, when structural equation models are modified following the initial estimation it is common practice to not only reduce the model by removing non-significant parameters, but also add additional parameters in order to ensure model fit.

### 2.5.2 Model reduction in item response theory

Unlike structural equation modeling, model reduction is relatively uncommon in item response theory. There are a few possible reasons for this. First, the majority of operationally used item response theory models are unidimensional, with multidimensional item response theory models having yet to see wide spread operational use (see Fukuhara & Kamata, 2011; Reckase, 1997; Sinharay, 2010; Thissen & Steinberg, 1986). In unidimensional models, there are no relationships between latent variables to estimate, as there is only one. Thus, only the measurement model is of consequence. In unidimensional item response theory models, this comes down to the selection of the number of parameters to be included (i.e., 1-parameter logistic model, 2-parameter logistic model, or 3-parameter model). Thus, after choosing an initial model, the options are to either reduce the model by removing items that don’t fit, or change models to add additional parameters.

In contrast, multidimensional item response theory models do offer opportunities for model reduction. In multidimensional models, the covariance structure of the latent factors can be reduced, as well as respecifying the latent traits that are measured by each item. However, although there are a few examples of various specifications being tested (see Kingston & McKinley, 1988; McKinley & Kingston, 1988), this is typically not done in practice. Indeed, there is no mention at all of modifying the structure of the multidimensional model following estimation in the most widely cited textbook on multidimensional item response theory models (see Reckase, 2009). Instead, a saturated covariance matrix is estimated for the latent traits, and decisions about the measurement model parameters are confined to removing items that don’t fit the model.

### 2.5.3 Model reduction in DCMs

In diagnostic models, the model reduction process is more similar to structural equation modeling than item response theory, in that it is a common practice to reduce the model by removing non-significant parameters after the estimation of the initial model. For example, Jurich & Bradshaw (2013) used the LCDM with a log-linear structural model to scale the Socialcultural Dimension Assessment version 6 (Halonen, Harris, Pastor, Abrahamson, & Huffman, 2005). In the estimation of the LCDM, four separate structural models were defined *a priori*: the fully saturated log-linear model, reduced model where three- and four-way interactions were removed (constrained to be 0), and a model with only main effects and two-way interactions that were constrained to be equal. Initially, Jurich & Bradshaw (2013) estimated the model with a fully saturated measurement and structural model. Following this initial estimation they removed non-significant parameters from the measurement model, before using the reduced measurement model to evaluate the structural models.

A similar approach was used by Bradshaw, Izsák, Templin, & Jacobson (2014) in their analysis of the Diagnosing Teachers’ Multiplicative Reasoning assessment. In this analysis, a fully saturated LCDM and log-linear structural model were used in the initial estimation. However, when three- and four-way interaction terms were specified in the structural model, the model was not able to converge. Thus, they settled on a reduced structural model with only main effects and two-way interactions included. Using the reduced structural model, Bradshaw et al. proceeded to remove non-significant terms from the measurement model. This same procedure was followed by de la Torre, Ark, & Rossi (2015) in their analysis of the Dutch version of the Millon Clinical Multiaxial Inventory-III (T. Millon, Millon, Davis, & Grossman, 2009). Using a saturated structural model, the measurement model was reduced by removing non-significant terms from each item. However, no further model reduction was done to the structural model.

As can be seen from these studies, there is an inconsistency as to the order in which model reduction should occur in DCMs. Bradshaw et al. (2014) reduced the structural model first out of necessity due to convergence, whereas Jurich & Bradshaw (2013) elected to reduce the measurement model first. However, this has not, to this point, been an investigation as to which of the processes should be preferred.

One option would be to follow the direction of the structural equation modeling literature and always reduce the measurement model first. This is potentially problematic for a few reasons. First, structural equation models assume that the data is normally distributed, whereas diagnostic models assume dichotomous items. It is possible to have non-dichotomous data (e.g., Skrondal & Rabe-Hesketh, 2004), but that is beyond the scope of this study. Additionally the latent variables in structural equation models are generally continuous, whereas DCMs use categorical latent variables. Because the observed data and latent variable space both follow different distributions, it’s possible that the best practices may differ between the two models.

Additionally, there is the added complexity of precisely how the measurement model is estimated in structural equation models. Recall from section 2.5.1 that the recommended practice to estimate each latent variable’s own measurement model first in order to ensure fit (Kline, 2002). However the measurement model for DCMs require multiple attributes in order to estimate the interaction effects (see equation (2.8)). Thus, the measurement models could only be estimated separately when there was a simple specification of the Q-matrix where each item measured only one attribute. Therefore, the approach taken by the structural equation modeling community is unlikely to be feasible for most DCM assessments.

## 2.6 The current study

Despite the uncertainty in the process of model reduction, this is a crucial aspect of DCM estimation. Rojas, de la Torre, & Olea (2012) showed that estimating high level interaction terms when unnecessary can decrease the classification accuracy compared to a reduced model. However, Rojas et al. (2012) only examined reduction of the measurement model. Therefore, the present study seeks to examine reduction of both the measurement and structural models to answer the following research questions:

- Does choosing different a model reduction process impact the output of the model?
- What are the benefits of reducing the measurement and/or structural model(s)?
- Are there advantages or disadvantages to the order of reduction (i.e., measurement or structural reduction first)?

This is accomplished through two studies. The first is a pilot study to further demonstrate the importance of the order of model reduction. Specifically the Diagnosing Teachers’ Multiplicative Reasoning assessment data (as described in Bradshaw et al., 2014) is reduced in different orders to compare the values of the resulting parameter estimates. Second, given the results of the pilot study, and Monte Carlo simulation study is conducted to examine the effects of model reduction and the order of reduction under a variety of data generation and dimensionality conditions.