The DAIR Program is longer accepting applications for cloud resources, but access to BoosterPacks and their resources remains available. BoosterPacks will be maintained and supported until January 17, 2025. 

After January 17, 2025: 

  • Screenshots should remain accurate, however where you are instructed to login to your DAIR account in AWS, you will be required to login to a personal AWS account. 
  • Links to AWS CloudFormation scripts for the automated deployment of each sample application should remain intact and functional. 
  • Links to GitHub repositories for downloading BoosterPack source code will remain valid as they are owned and sustained by the BoosterPack Builder (original creators of the open-source sample applications). 

Flight Plan: Apption Data Assessment Tool

This BoosterPack was created and authored by: Apption

DAIR BoosterPacks are free, curated packages of cloud-based tools and resources about a specific emerging technology, built by experienced Canadian businesses who have built products or services using that technology and are willing to share their expertise.

Sample Solution Overview

This package introduces a user-friendly solution for:

  1. Analyzing unstructured data, identifying data types, and providing storage recommendations
  2. Identifying sensitive data such as first and last names
  3. Converting data from unstructured sources onto cloud (or other) SQL Server databases in a few guided steps

The Apption Data Assessment Tool is built on the .NET core platform and can be launched in the CANARIE DAIR Cloud or executed in Electron.NET (embedded browser).

Please see the Sample Solution page for more information on the Solution including how to deploy the sample application.

The Solution showcases the following technologies: Docker, ASP.NET Core, Blazor, Electron.NET.

Objectives

Key Features

Machine learning and analytics in complex systems frequently require the addition of external data sets to generate new insights. These data sets are often unstructured, with a large amount of columns, and sensitive data might be hidden in poorly described columns.

Today, to integrate these unstructured files, a data engineer requires many tools and significant effort to understand the data, perform QA and load the data into a central repository. These tools are expensive and feature rich, where data transformation and analysis is included but often with a narrow focus.

Also, if the files contain sensitive information, the environment might require specific security considerations. In Canada, PIPEDA (the Personal Information Protection and Electronic Documents Act) requires corporations to put safeguards around the handling of any personal information.

Existing ETL tools require significant effort to create packages – even for simple files -and end up being a bottleneck in any data exploration or science project. This solution provides a simple 4-step workflow covering the most common tasks.

Technical Benefits

In addition to the application features, this solution can be used as a template to integrate with the following technologies:

  • .NET Core 3 on Linux
  • Docker deployment with .NET Core
  • Blazor web pages for building rich interactive UIs using C# instead of JavaScript
  • Electron.NET for packaging web pages as standalone application
  • Visual Studio Solution with common code for Docker and Electron packages

Scalable & Portable Design

The code base is designed for portability across multiple OSes (Linux, Windows, MacOS) and hosts (Docker, Electron). The underlying architecture follows patterns that enable the efficient handling of large data sets.

The API is extensible and other analyzers can be added to identify new data types.

Application Workflow

The diagram below illustrates the structure of the solution.

System Architecture

Resources

Reference information about the underlying technologies used in creating the solution can be found here; .NET CoreBlazorElectron, and Docker.

Tutorials

The table below provides a non-comprehensive list of links to tutorials the author has found to be most useful.

Tutorial Content Summary
 

ASP.NET Core

ASP.NET Core is a cross-platform, high-performance, open-source framework for building modern, cloud-based, Internet-connected applications
Blazor Blazor is a project that uses WebAssembly (https://webassembly.org) in order to allow client side development using C# as opposed to the usual Javascript frameworks (React, Angular, etc)
Electron.NET Electron.NET (built using Electron https://electronjs.org) is a tool that allows the users to host .NET apps across different platforms
Docker Docker technology enables the running of applications (docker images) on Docker Engines which are essentially isolated virtual machines that sit on top of server operating systems.

 

Tips and Traps

  • Working with Blazor: This new technology has significantly simplified web development by allowing you to write all the front-end logic in C#. However, the lifecycle/rendering of the components needs to be understood for complex user interactions.
    • An example of Blazor C# in can be found in the WebAppMaterialize project, in the folder components, and the subfolder pages. Any file ending in .razor will contain client side C#.
  • NET Core: It is important to understand IoC and Dependency Injection in order to architect the application properly and design the services.
    • In Uploadcontroller.cs in the WebAppMaterialize project, the constructor demonstrates an example of constructor injection, a type of dependency injection.
  • Large File Upload: Code on both the client side and server side were required to implement XHR file upload (the technology splits the file into chunks instead of one large upload). JQuery was used on the web client interface and a custom controller was developed for server side.
    • The upload functionality can be found in UploadController.cs in the WebAppMaterialize project, in the Controllers folder.
  • Multi-threading: Reactive design with Rx.NET was used to streamline the processing pipeline in multiple threads. The configuration of the scheduler threading was a key to separate event processing from the UI feedback updates.
    • Multi-threading examples can be found in the StreamReadFileAsync function which is written in FileAnalyzer.cs in the DataTools project.