Clustering Algorithms for Long-term Twitter observation

How can we divide massive data streams meaningful classes without accessing the full dataset ? In this thesis we will explore clustering algorithms that can deal with data as it is scraped from social networks.
Master

Clustering is a fundamental concept in unsupervised learning. However, most clustering techniques are offline algorithms, which means that all data must be present from the start. Today many problems require online algorithms that are able to cluster data while it is being produced. In this thesis we will implement such an algorithm that is capable of clustering massive datasets.

Goal

The goal of this thesis is to implement an online clustering algorithm for massive datasets on a supercomputer, and test it with a continuous stream of Twitter data.

Learning outcome

Implementation of large scale parallel algorithms
Use of supercomputers
Online clustering algorithms

Qualifications

Knowledge of C/C++
Familiarity with parallel graph algorithms
Experience with MPI and/or OpenMP

Supervisors

  • Johannes Langguth
  • Xing Cai

Collaboration partners

Indiana University Bloomington

Contact person