Solving the MLE
As we have learnt in calculus, at the maximum of the log-likelihood defined in Equation \eqref{eqn:log-lik}, the derivative with respect to the parameters, say \( \vmu_k \) should be 0.
This means, at the maximum
\begin{align}
\frac{\partial \ln p\left(\unlabeledset|\set{P(k),\vmu_k,\mSigma_k}_{k=1}^K\right)}{\partial \vmu_k} &= 0 \\\\
-\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \frac{P(k)\Gauss(\vx_\nunlabeledsmall | \vmu_k, \mSigma_k)}{\sum_{j=1}^K P(j)\Gauss(\vx_\nunlabeledsmall | \vmu_j, \mSigma_j)} \mSigma_k(\vx_\nunlabeledsmall - \vmu_k) &= 0
\label{eqn:log-lik-deriv}
\end{align}
For simplicity, let's represent the fraction term in the above equation with a \( \gamma_{\nunlabeledsmall k} \) as
\begin{equation}
\gamma_{\nunlabeledsmall k} = \frac{P(k)\Gauss(\vx_\nunlabeledsmall | \vmu_k, \mSigma_k)}{\sum_{j=1}^K P(j)\Gauss(\vx_\nunlabeledsmall | \vmu_j, \mSigma_j)}
\label{eqn:gamma-uk}
\end{equation}
It can be observed that \( \gamma_{\nunlabeledsmall k} \) is calculating the proportion of \( \vx_\nunlabeledsmall \) in the cluster \( k \).
Substituting this representation into the Equation \eqref{eqn:log-lik-deriv}, we get
\begin{align}
-\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \mSigma_k(\vx_\nunlabeledsmall - \vmu_k) = 0
\label{eqn:log-lik-deriv-gamma}
\end{align}
Being a covariance matrix, we can assume \( \mSigma^{-1} \) to be nonsingular and multiply both sides of the above equation with it.
\begin{align}
-\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} (\vx_\nunlabeledsmall - \vmu_k) = 0
\label{eqn:log-lik-deriv-gamma-2}
\end{align}
Re-arranging terms of this equation, we get
\begin{align}
\vmu_k = \frac{\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \vx_\nunlabeledsmall}{\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k}}
\label{eqn:muk-soln-first}
\end{align}
Note the normalization term.
It sums over all the examples in the dataset, with a weighing factor \( \gamma_{\nunlabeledsmall k} \) for each example.
Intuitively, we can imagine this term being the soft number of examples that belong to the cluster \( k \).
Thus, we can represent this term as \( \nunlabeled_k \).
\begin{equation}
\nunlabeled_k = \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k}
\label{eqn:u-k}
\end{equation}
Given this form, the optimal value for \( \vmu_k \) can be written as
\begin{align}
\vmu_k = \frac{1}{\nunlabeled_k} \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \vx_\nunlabeledsmall
\label{eqn:muk-soln}
\end{align}
Along similar lines, taking derivative with respect to the parameters \( \mSigma_k \), we arrive at its solution as
\begin{align}
\mSigma_k = \frac{1}{\nunlabeled_k} \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} (\vx_\nunlabeledsmall - \vmu_k)(\vx_\nunlabeledsmall - \vmu_k)^T
\label{eqn:sigmak-soln}
\end{align}
and similarly, for \( P(k) \), we arrive at its solution as
\begin{align}
P(k) = \frac{\nunlabeled_k}{\nunlabeled}
\label{eqn:pk-soln}
\end{align}