AuthorsC. M. Rosenberg and L. Moonen
TitleImproving Problem Identification via Automated Log Clustering using Dimensionality Reduction
AfilliationSoftware Engineering
Project(s)The Certus Centre (SFI)
Publication TypeProceedings, refereed
Year of Publication2018
Conference Name12th International Symposium on Empirical Software Engineering and Measurement (ESEM 2018)

Background: Continuous engineering practices, such as continuous
integration and continuous deployment, see increased adoption in
modern software development.  A frequently reported challenge for
adopting these practices is the need to make sense of the large
amounts of data that they generate.

Goal: We consider the problem of automatically grouping logs of runs
that failed for the same underlying reasons, so that they can be
treated more effectively, and investigate the following questions:
(1) Does an approach developed to identify problems in system logs
generalize to identifying problems in continuous deployment logs?
(2) How does dimensionality reduction affect the quality of automated
log clustering?  (3) How does the criterion used for merging clusters
in the clustering algorithm affect clustering quality?

Method: We replicate and extend earlier work on clustering system
log files to assess its generalization to continuous deployment
logs.  We consider the optional inclusion of one of these dimensionality
reduction techniques: Principal Component Analysis (PCA), Latent
Semantic Indexing (LSI), and Non-negative Matrix Factorization
(NMF).  Moreover, we consider three alternative cluster merge
criteria (Single Linkage, Average Linkage, and Weighted Linkage),
in addition to the Complete Linkage criterion used in earlier work.
We empirically evaluate the 16 resulting configurations on continuous
deployment logs provided by our industrial collaborator.

Results: Our study shows that (1) identifying problems in continuous
deployment logs via clustering is feasible, (2) including NMF
significantly improves overall accuracy and robustness, and (3)
Complete Linkage performs best of all merge criteria analyzed.

Conclusions: We conclude that problem identification via automated
log clustering is improved by including dimensionality reduction,
as it decreases the pipeline's sensitivity to parameter choice,
thereby increasing its robustness for handling different inputs.

Citation Keyrosenberg:2018:improving