Articles | Journal of Emerging Investigators

Utilizing meteorological data and machine learning to predict and reduce the spread of California wildfires

Bilwar et al. | Jan 15, 2024

This study hypothesized that a machine learning model could accurately predict the severity of California wildfires and determine the most influential meteorological factors. It utilized a custom dataset with information from the World Weather Online API and a Kaggle dataset of wildfires in California from 2013-2020. The developed algorithms classified fires into seven categories with promising accuracy (around 55 percent). They found that higher temperatures, lower humidity, lower dew point, higher wind gusts, and higher wind speeds are the most significant contributors to the spread of a wildfire. This tool could vastly improve the efficiency and preparedness of firefighters as they deal with wildfires.

Transfer learning and data augmentation in osteosarcoma cancer detection

Chu et al. | Jun 03, 2023

Osteosarcoma is a type of bone cancer that affects young adults and children. Early diagnosis of osteosarcoma is crucial to successful treatment. The current methods of diagnosis, which include imaging tests and biopsy, are time consuming and prone to human error. Hence, we used deep learning to extract patterns and detect osteosarcoma from histological images. We hypothesized that the combination of two different technologies (transfer learning and data augmentation) would improve the efficacy of osteosarcoma detection in histological images. The dataset used for the study consisted of histological images for osteosarcoma and was quite imbalanced as it contained very few images with tumors. Since transfer learning uses existing knowledge for the purpose of classification and detection, we hypothesized it would be proficient on such an imbalanced dataset. To further improve our learning, we used data augmentation to include variations in the dataset. We further evaluated the efficacy of different convolutional neural network models on this task. We obtained an accuracy of 91.18% using the transfer learning model MobileNetV2 as the base model with various geometric transformations, outperforming the state-of-the-art convolutional neural network based approach.

Comparing model-centric and data-centric approaches to determine the efficiency of data-centric AI

La et al. | Apr 20, 2023

In this study, three models are used to test the hypothesis that data-centric artificial intelligence (AI) will improve the performance of machine learning.

Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

Balaji et al. | Sep 11, 2021

The purpose of the study was to determine whether graph-based machine learning techniques, which have increased prevalence in the last few years, can accurately classify data into one of many clusters, while requiring less labeled training data and parameter tuning as opposed to traditional machine learning algorithms. The results determined that the accuracy of graph-based and traditional classification algorithms depends directly upon the number of features of each dataset, the number of classes in each dataset, and the amount of labeled training data used.

Evaluating need for adversarial training data given algorithmic defense methods against adversarial attacks

Yian et al. | Jul 05, 2026

The purpose of this study was to determine the necessity of previous non-algorithmic attacks (Adversarial Training) in light of algorithmic defense methods (Gradient Masking and Defensive Distillation) against FGSM attacks. We found a significant increase in image classification accuracy from defense methods with the non-algorithmic defense method compared to ones without. By analyzing the significance with a McNemar test, we determined that the inclusion of non-algorithmic defense methods is still necessary in light of new algorithmic defense methods.

Evaluating the effectiveness of synthetic training data for day-ahead wind speed prediction in the Great Lakes

Wycoff et al. | Dec 21, 2025

The authors looked at the feasibility to predict wind speeds that will have less reliance on using historical data.

Predicting smoking status based on RNA sequencing data

Yang et al. | Aug 30, 2024

Given an association between nicotine addiction and gene expression, we hypothesized that expression of genes commonly associated with smoking status would have variable expression between smokers and non-smokers. To test whether gene expression varies between smokers and non-smokers, we analyzed two publicly-available datasets that profiled RNA gene expression from brain (nucleus accumbens) and lung tissue taken from patients identified as smokers or non-smokers. We discovered statistically significant differences in expression of dozens of genes between smokers and non-smokers. To test whether gene expression can be used to predict whether a patient is a smoker or non-smoker, we used gene expression as the training data for a logistic regression or random forest classification model. The random forest classifier trained on lung tissue data showed the most robust results, with area under curve (AUC) values consistently between 0.82 and 0.93. Both models trained on nucleus accumbens data had poorer performance, with AUC values consistently between 0.65 and 0.7 when using random forest. These results suggest gene expression can be used to predict smoking status using traditional machine learning models. Additionally, based on our random forest model, we proposed KCNJ3 and TXLNGY as two candidate markers of smoking status. These findings, coupled with other genes identified in this study, present promising avenues for advancing applications related to the genetic foundation of smoking-related characteristics.

Effects of different synthetic training data on real test data for semantic segmentation

Zhang et al. | Jun 22, 2023

Semantic segmentation - labelling each pixel in an image to a specific class- models require large amounts of manually labeled and collected data to train.

Heat conduction: Mathematical modeling and experimental data

Zhu et al. | Dec 02, 2021

In this experiment, the authors modify the heat equation to account for imperfect insulation during heat transfer and compare it to experimental data to determine which is more accurate.

LawCrypt: Secret Sharing for Attorney-Client Data in a Multi-Provider Cloud Architecture

Zhang et al. | Jul 19, 2020

In this study, the authors develop an architecture to implement in a cloud-based database used by law firms to ensure confidentiality, availability, and integrity of attorney documents while maintaining greater efficiency than traditional encryption algorithms. They assessed whether the architecture satisfies necessary criteria and tested the overall file sizes the architecture could process. The authors found that their system was able to handle larger file sizes and fit engineering criteria. This study presents a valuable new tool that can be used to ensure law firms have adequate security as they shift to using cloud-based storage systems for their files.

Browse Articles

Utilizing meteorological data and machine learning to predict and reduce the spread of California wildfires

Transfer learning and data augmentation in osteosarcoma cancer detection

Comparing model-centric and data-centric approaches to determine the efficiency of data-centric AI

Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

Evaluating need for adversarial training data given algorithmic defense methods against adversarial attacks

Evaluating the effectiveness of synthetic training data for day-ahead wind speed prediction in the Great Lakes

Predicting smoking status based on RNA sequencing data

Effects of different synthetic training data on real test data for semantic segmentation

Heat conduction: Mathematical modeling and experimental data

LawCrypt: Secret Sharing for Attorney-Client Data in a Multi-Provider Cloud Architecture

Search Articles

Popular Tags

Browse Articles

Search Articles

Category

School Level

Popular Tags