Recently, we (Ingmar Visser and Maarten) released the the latest version of depmixS4 on CRAN. DepmixS4 is an R package to estimate mixture and hidden Markov models. This version includes several improvements (e.g., prettier model summaries, faster EM for large models) and a new (experimental!) feature to estimate models by maximizing the “classification likelihood”.

## Classification likelihood

Estimating models with the classification likelihood can be achieved by setting `classification="hard"`

in the `emcontrol`

argument of the fit method. The `emcontrol`

argument expects a list with several arguments and the best way to construct it is with the `em.control`

function.

An example (from the `em.control`

help page) is:

```
# load the package
library(depmixS4)
# load the speed example dataset
data(speed)
# specify a multivariate hidden Markov model with two states
mod <- depmix(list(rt ~ 1, corr ~ 1), data = speed, nstates = 2, family = list(gaussian(),
multinomial("identity")), ntimes = c(168, 134, 137))
# set the seed to get reproducible results
set.seed(1)
# fit the model by calling fit
fmod_CEM1 <- fit(mod, emcontrol = em.control(classification = "hard"))
```

```
iteration 0 logLik: -197.4
converged at iteration 4 with logLik: -174.5
```

Using the classification likelihood, each observation is assigned to a single state (the maximum a posteriori state). For hidden Markov models, the MAP state sequence is computed by the Viterbi algorithm. Using such “hard” assignment of observations to states tends to speed up convergence considerably. However, the classification likelihood is a different objective function than the usual likelihood and the resulting estimates may vary considerably. Initial tests using classification EM (CEM) for hidden Markov models indicate that the resulting estimates depend more strongly on starting-values than when using the regular EM algorithm.

By default, `depmixS4`

initially assigns observations randomly to states, which is a simple way to obtain random starting values. If we want to estimate with different starting values, we can simply run the fit method again (with a different seed):

```
set.seed(3)
fmod_CEM2 <- fit(mod, emcontrol = em.control(classification = "hard"))
```

```
iteration 0 logLik: -433.4
converged at iteration 4 with logLik: -263.5
```

Note the large difference in final likelihood values! Clearly, the starting values in the second run were not overly good, and estimation converged to a poor local maximum. As might be expected, there is also a large difference in the model estimates:

```
summary(fmod_CEM1)
```

```
Initial state probabilties model
pr1 pr2
0 1
Transition matrix
toS1 toS2
fromS1 0.90761 0.09239
fromS2 0.07937 0.92063
Response parameters
Resp 1 : gaussian
Resp 2 : multinomial
Re1.(Intercept) Re1.sd Re2.pr1 Re2.pr2
St1 5.527 0.2093 0.47059 0.5294
St2 6.397 0.2354 0.09524 0.9048
```

```
summary(fmod_CEM2)
```

```
Initial state probabilties model
pr1 pr2
1 0
Transition matrix
toS1 toS2
fromS1 0.7877 0.2123
fromS2 0.6126 0.3874
Response parameters
Resp 1 : gaussian
Resp 2 : multinomial
Re1.(Intercept) Re1.sd Re2.pr1 Re2.pr2
St1 6.137 0.4736 0 1
St2 5.705 0.3583 1 0
```

If we run the same example with the regular EM, we get the following results:

```
set.seed(1)
fmod_EM1 <- fit(mod, emcontrol = em.control(classification = "soft"))
```

```
iteration 0 logLik: -553.9
iteration 5 logLik: -517.9
iteration 10 logLik: -296.5
iteration 15 logLik: -296.1
iteration 20 logLik: -296.1
converged at iteration 22 with logLik: -296.1
```

```
set.seed(3)
fmod_EM2 <- fit(mod) # soft classification is the default
```

```
iteration 0 logLik: -554.7
iteration 5 logLik: -554.4
iteration 10 logLik: -543.8
iteration 15 logLik: -298.8
iteration 20 logLik: -296.1
iteration 25 logLik: -296.1
converged at iteration 28 with logLik: -296.1
```

Both runs converge to the same solution and the parameter estimates happen to be quite similar to the first solution found with the CEM algorithm:

```
summary(fmod_EM2)
```

```
Initial state probabilties model
pr1 pr2
0 1
Transition matrix
toS1 toS2
fromS1 0.89886 0.1011
fromS2 0.08359 0.9164
Response parameters
Resp 1 : gaussian
Resp 2 : multinomial
Re1.(Intercept) Re1.sd Re2.pr1 Re2.pr2
St1 5.521 0.2023 0.47207 0.5279
St2 6.392 0.2396 0.09851 0.9015
```

In conclusion, while it is always advisable to estimate a model repeatedly with different starting values, it seems especially relevant when using the classification likelihood.

The CEM algorithm is used more regularly in mixture modelling. For special cases of mixture models, it is equivalent to the k-means clustering algorithm. It's properties in hidden Markov modelling require further investigation.