An IP Address Data Type for MongoDB to Support Efficient Handling of Network Statistics for Big Data Applications in Machine Learning/Artificial Intelligence

Development of an IP address data type for MongoDB, in order to support efficient handling of network statistics for Big Data Applications in Machine Learning/Artificial Intelligence
Master

MongoDB is a NoSQL database for Big Data applications, also used to provide the input data for Machine Learning and Artificial Intelligence systems. Unlike SQL databases, where it is possible to define table structures with advanced datatypes, stored procedures, etc.., NoSQL databases just store key-value data with a limited set of data types (like numeric, string, binary). Particularly, there is no data type for IP addresses (IPv4 and IPv6) in MongoDB. This makes the handling of large databases (e.g. for Big Data applications in Machine Learning/Artificial Intelligence use cases) simple and efficient. Simula@OsloMet is collecting large amounts of statistics (many terabytes per year) of network measurements. Storing IP addresses in MongoDB therefore means to encode them into binary form, and then store them as binary data.

For effective filtering of IP addresses and networks, the lack of support for an IP address data type in MongoDB is a significant performance issue. For example, it is not possible to query for certain ranges of IP addresses (like all addresses belonging to a certain network). This is, however, a frequent talk for the Machine Learning/Artificial Intelligence research in our Big Data collections to identify interesting network data.

The goal of this master's thesis is therefore to extend MongoDB by an IP address data type to handle IP addresses efficiently, and to allow for queries matching certain ranges of IP addresses (e.g. all addresses belonging to a certain prefix like 128.39.37.128/26 or 2001:700:4100::/48). As basis, there is an old patch (jira.mongodb.org/browse/SERVER-2413; from 2011), which has never been progressed or followed up. Your task would be to analyse this patch as basis, provide a patch for the latest version of MongoDB, develop some example material (e.g. insertion and query examples), make some performance comparisons (e.g. your patch vs. storing IP addresses as binaries), and finally provide your patch as a "pull request" to the upstream MongoDB developers. This will bring you in contact with state-of-the-art Open Source software development, makes your work visible in the industry and research communities, and finally it provides a building block for the Big Data collections for our researcher's work on Machine Learning/Artificial Intelligence in large-scale network installations!

This project is your chance to get involved into NoSQL database development, MongoDB, Open Source software development as well as into international, top-level research at Simula@OsloMet in Oslo! Are you interested in this challenge? Do not hesitate to contact us!

Goal

  • Develop an extension to MongoDB
  • Develop an extension for querying MongoDB with IP address filtering
  • Performance analysis

Learning outcome

  • You will learn about NoSQL databases and storage of data.
  • You will learn about NoSQL database (MongoDB) programming.
  • You will learn about system performance and efficiency.
  • Get involved in state-of-the-art Open Source software development.

Qualifications

  • Some knowledge about computer networks.
  • You should have some programming experience in C and/or C++.
  • Experience with databases (SQL and/or NoSQL) is a plus
  • Experience with scripting languages (e.g. Python, GNU R) is a plus

Supervisors

  • Thomas Dreibholz
  • Ahmed Elmokashfi

References

Contact person