What is RAWFIE?
RAWFIE (Road-, Air-, and Water- based Future Internet Experimentation) is a project funded by the European Commission (Horizon H2020 program) under the Future Internet Research Experimentation (FIRE+) initiative that aims at providing research facilities for Internet of Things (IoT) devices. The DMML team has been involved in the RAWFIE project since early 2015 and delivered their final contributions in June 2019 as the the project ended. The members of the team invested in the project, Jason and Lionel, were responsible for the design and development of the platform's data analysis tools, enabling experimenters to apply machine learning algorithms on data collected via the platform's unmanned devices.
The project introduces a unique platform by integrating numerous test beds of unmanned vehicles for research experimentation in vehicular, aerial and maritime environments. The platform supports experimenters with smart tools to conduct and monitor experiments in the domains of IoT, networking, sensing and satellite navigation. The project brings together thirteen partner organizations from eight EU countries. Open calls have attracted researchers from academia and industry, test bed operators and unmanned vehicles manufacturers.
As mentioned in the above, Jason and Lionel were tasked with the conception and implementation if the data analysis components of the platform, which allows the experimenters to run various analytics on the data gathered in the test bed they are operating in.
Since the analytics are to be carried out on real-world data collected by the experimenter with unmanned vehicles, ensuring reliable and secure data persistence and storage is a primary challenge of the data analysis pipeline. As part of a platform that aims at providing a service for users to gather data with expensive hardware and conduct experiments on it, said pipeline was designed to have high fault-tolerance, fast read/write speeds, and data redundancy mechanisms. Such features were provided by the platform by adopting Apache Kafka, a distributed streaming platform, as RAWFIE's streaming data persistent storage and message bus solution. The operated unmanned vehicles publish streams of records as they interact with their respective surroundings to the central message queue, with each type of sensor publishing to a dedicated topic (e.g. fuel consumption, battery level, on-board processing unit temperature, current position, estimated depth, etc.). As UXVs produce and publish data streams to the message bus, the developed analytics component can subscribe and consume these streams and pipe them through various streaming data processing and streaming machine learning algorithms. Since the message bus stores streams of records in a fault-tolerant, persistent and durable fashion, carrying out batch machine learning tasks is also part of the feature suite provided by the platform. On top of that, to lighten the load on the streaming message bus and further strengthen data redundancy, the collected data is duplicated and stored in a HDFS (Hadoop Distributed File System) storage solution, capable of reliably storing large amounts of data. The data analysis component is able to read streaming data from Kafka, read batch data from HDFS, and write analytics results to either HDFS or a Grafana dashboard sitting on top of a Whisper database. Interfaces are also provided to browse the message bus and HDFS storage respective contents.
The data analysis component is composed of two entities: the data analysis engine, and the data analysis tool. The data analysis engine relies on the Apache Spark distributed compute engine, a scalable, high performance solution for both batch and streaming data. The data analysis tool acts as a front-end for the data analysis engine, and adopts a notebook interface by building on the Apache Zeppelin project. The notebook interface allows the service user to either design algorithms from scratch within a notebook or to orchestrate various algorithms and data processing subroutines that have been embedded within the tool to favor fast prototyping. Once the user has designed the analytics to run on the data retrieved from the various fault-tolerant storage solutions (refer to the previous section for more details), the task is submitted to the data analysis engine for execution. The data sources and destinations are specified in the notebook using user-friendly subroutines from the provided toolbox, which also provides access to a suite of machine learning algorithms. Via the data analysis tool, the experimenter can conduct both real-time online learning tasks on streaming data published by UXVs on the message bus) and batch learning tasks on non-real-time records, collected during a past experiment, that have been duplicated to HDFS. The availability of various data browser for every available source, combined with the notebook interface to read, process and write data, make the data analysis contributes to the ease of use and flexibility of the developed data analysis solution. Since the notebook interface allows for tasks to be written in various programming languages, with Python and Scala arguably being the best suited, the user is able to use the language's common data analytics libraries (e.g. Apache Spark's MLlib available for Scala and Python, NumPy and SciPy for Python). For more advances analytics, the user is also able to use the TensorFlow and PyTorch libraries to train powerful neural network models and benefit from the libraries' automatic differentiation capabilities. Among others, the algorithmic toolbox that comes along with the data analysis component contains streaming k-means, streaming linear regression, streaming logistic regression, principal component analysis, Gaussian mixture models, collaborative filtering, decision trees, naive Bayes, support vector machines and random forests.
Results at a glance:
- Conception and implementation if the data analysis components of the platform.
- Creation of the data analysis component, which is composed of two entities: the data analysis engine, and the data analysis tool.
- Creation of a pipeline ensuring reliable and secure data persistence and storage. The pipeline was designed to have high fault-tolerance, fast read/write speeds, and data redundancy mechanisms.