Related Coursework: DTSA 5001 – Data Science Foundations: Statistical Inference
Developed a strong foundation in probability theory essential for data science. Covered key concepts including permutations, combinations, conditional probability, Bayes’ theorem, expectation, and variance.
What I Know
Fundamentals of probability theory and its application to data problems
Joint, marginal, and conditional probability
Permutations and combinations in probabilistic modeling
Project: Built a simulation-based model in Python to estimate probabilities of complex real-world scenarios (e.g., card draws, A/B test outcomes), verifying theoretical results with empirical outcomes.
Related Coursework: DTSA 5002 – Statistical Inference for Estimation in Data Science
Focused on parameter estimation methods using real-world data.
What I Know
Maximum Likelihood Estimation (MLE) and method of moments
Bias, variance, and mean squared error as evaluation metrics
Confidence intervals for population parameters
Project: Estimated parameters of a normal distribution using MLE on financial market returns. Implemented the estimation process in R and visualized confidence intervals, comparing empirical and theoretical distributions.
Related Coursework: DTSA 5003 – Statistical Inference and Hypothesis Testing in Data Science Applications
Concentrated on statistical hypothesis testing for data-driven decision-making.
What I Know
One-sided and two-sided hypothesis tests
Uniformly Most Powerful (UMP) tests
p-values, Type I and Type II errors, and power analysis
Project: Conducted a hypothesis test comparing average Reddit sentiment scores for two time periods to identify shifts in community opinion on specific stocks. Applied both classical and simulation-based techniques to validate findings.
Related Coursework: DTSA 5504/CSCA 5502 – Data Mining Foundations and Practice
Explored the complete data mining lifecycle including data collection, cleaning, transformation, and storage in preparation for downstream analytics.
What I Know
Built end-to-end data pipelines for mining large datasets
Techniques for data preprocessing: handling missing values, normalization, transformation
Importance of scalability, data quality, and automation in pipeline design
Project:
Created a data pipeline to ingest and process jiu-jitsu competition data from multiple sources (bjj.university, bjj.tips). Automated collection and transformation, then loaded data into BigQuery for analysis of technique frequency and athlete trends.
Related Coursework: DTSA 5505/CSCA 5512 – Data Mining Methods
Studied core data mining algorithms and techniques, focusing on supervised and unsupervised learning.
What I Know
Applied classification, regression, and clustering techniques
Evaluated models using confusion matrices, ROC curves, and cross-validation
Gained hands-on experience with algorithms like KNN, decision trees, and K-means
Project:
Built a classification model to predict Reddit sentiment (positive, neutral, negative) based on post content. Compared performance of logistic regression, decision trees, and random forests using scikit-learn.
Related Coursework: DTSA 5506/CSCA 5522 – Data Mining Project
Applied end-to-end data mining techniques on a self-directed project, synthesizing pipeline and modeling skills.
What I Know
Full-cycle data mining: data collection, preprocessing, modeling, evaluation
Feature engineering, hyperparameter tuning, and cross-validation
Communicating insights effectively through visualization and reporting
Project:
Developed a time series forecasting model to predict short-term stock price movement using Reddit mention volume and sentiment. Used VADER for sentiment scoring, engineered lag features, and tested ARIMA and LSTM models to assess predictive performance.
Related Coursework: DTSA 5011 – Statistical Modeling for Data Science Applications
Focused on developing and interpreting linear regression models using R.
What I Know
Simple and multiple linear regression
Model assumptions, diagnostics, and residual analysis
Variable selection, multicollinearity, and interaction effects
Project: Analyzed factors influencing user engagement on a content website by fitting multiple linear regression models. Explored feature importance, multicollinearity, and visualized predictions using ggplot2 in R.
Related Coursework: DTSA 5012 – ANOVA and Experimental Design
Covered statistical methods for comparing group means and designing experiments to assess treatment effects.
What I Know
One-way and two-way ANOVA
Randomized design principles
F-tests, post-hoc comparisons, and interpreting interaction effects
Project: Designed and analyzed a simulated A/B/C test to evaluate content strategies. Used one-way ANOVA to detect significant differences in user behavior across groups and applied Tukey’s HSD for post-hoc analysis.
Related Coursework: DTSA 5013 – Generalized Linear Models and Nonparametric Regression
Expanded on classical linear models to handle non-normal data distributions and non-linear relationships.
What I Know
Generalized linear models (GLMs) for binary and count data
Link functions (logit, log, etc.) and distribution families
Intro to nonparametric regression (e.g., LOESS, splines)
Project: Built a logistic regression model to predict binary classification outcomes on a marketing dataset. Used GLMs to model conversion likelihood based on user attributes and compared performance with nonparametric smoothing techniques.
Related Coursework: DTSA 5509/CSCA 5622 – Introduction to Machine Learning
Introduced core concepts and algorithms for supervised learning with a focus on classification and regression tasks.
What I Know
Algorithms: logistic regression, decision trees, random forests, support vector machines (SVMs)
Model training, validation, and evaluation using metrics like accuracy, precision, recall, and F1-score
Cross-validation, hyperparameter tuning, and overfitting prevention
Project: Developed a classification model to predict stock price movement direction using Reddit sentiment scores. Compared logistic regression, decision trees, and random forest models to assess predictive power.
Related Coursework: DTSA 5510/CSCA 5632 – Unsupervised Algorithms in Machine Learning
Focused on uncovering structure in unlabeled data using clustering and dimensionality reduction techniques.
What I Know
K-means, DBSCAN, hierarchical clustering
Principal Component Analysis (PCA) and t-SNE for dimensionality reduction
Evaluating clusters using silhouette score and inertia
Project: Used unsupervised learning to cluster Reddit posts based on word embeddings. Visualized topic clusters with PCA and t-SNE to explore emerging market themes and investor sentiment.
Related Coursework: DTSA 5511/CSCA 5642 – Introduction to Deep Learning
Explored the fundamentals of neural networks and deep learning architectures.
What I Know
Feedforward neural networks, backpropagation, and activation functions
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)
Frameworks: TensorFlow and Keras
Project
Trained a basic neural network on textual Reddit data to classify post sentiment. Experimented with architecture depth and dropout regularization, comparing performance against traditional models.
Related Coursework: CSCA 5112 – Introduction to Generative AI
Provided a foundational overview of generative models and their applications in modern AI systems, including text, image, and audio generation.
What I Know
Core concepts: transformers, diffusion models, GANs, and autoregressive models
Use cases in text generation, image synthesis, and conversational AI
Ethical implications, risks, and biases in generative systems
Project: Fine-tuned a GPT-based language model using custom text data to generate topic-specific summaries. Evaluated coherence, relevance, and bias, and explored prompt engineering techniques to guide outputs effectively.
Related Coursework: DTSA 5301 – Data Science as a Field
Introduced the data science lifecycle, interdisciplinary nature of the field, and key roles and responsibilities in real-world data teams.
What I Know
Overview of the data science workflow: problem definition, data collection, analysis, and communication
Interplay between statistics, computer science, and domain expertise
Importance of reproducibility, documentation, and collaboration
Project: Created a case study presentation outlining how a data science team might approach optimizing conversion rates for a digital product using user behavior data and A/B testing.
Related Coursework: DTSA 5302 – Cybersecurity for Data Science
Examined best practices for maintaining data privacy, integrity, and security in data pipelines and analytics workflows.
What I Know
Secure handling of sensitive and personally identifiable information (PII)
Common data vulnerabilities and threat models
Regulatory frameworks (e.g., GDPR, HIPAA) relevant to data professionals
Project: Performed a privacy risk assessment of a mock data pipeline to identify weak points in data encryption, access control, and anonymization practices.
Related Coursework: DTSA 5303 – Ethical Issues in Data Science
Explored the ethical responsibilities of data scientists and the impact of algorithmic decisions on individuals and society.
What I Know
Bias in data and algorithms
Fairness, accountability, and transparency in modeling
Ethical dilemmas in predictive analytics and surveillance
Project: Analyzed a real-world case of algorithmic bias (e.g., facial recognition or hiring algorithms) and proposed an ethical framework for improving transparency and fairness.
Related Coursework: DTSA 5304 – Fundamentals of Data Visualization
Focused on effectively communicating data insights through clear, compelling visualizations.
What I Know
Principles of visual perception, design, and storytelling
Choosing the right chart type for the message
Tools: matplotlib, seaborn, ggplot2, and interactive dashboards
Project: Designed an interactive dashboard using Plotly and Dash to explore sentiment trends in Reddit posts over time, integrating filters and hover features to enhance user experience.
Related Coursework: DTSA 5733 – Relational Database Design
Explored the fundamentals of designing scalable, efficient relational databases tailored for analytical workflows.
What I Know
Entity-relationship modeling and schema normalization
Primary/foreign keys, integrity constraints, and referential integrity
Trade-offs between normalization and performance in data warehousing
Project: Designed a normalized relational schema to support an analytics dashboard for tracking jiu-jitsu techniques and athlete performance over time. Integrated the design with PostgreSQL for live querying.
Related Coursework: DTSA 5734 – The Structured Query Language (SQL)
Gained proficiency in SQL for querying, manipulating, and aggregating structured data in relational databases.
What I Know
Advanced SQL queries: joins, subqueries, window functions, CTEs
Aggregation, filtering, and conditional logic for analytics
Performance tuning with indexing and query optimization
Project: Built complex queries to extract trends and insights from a Google BigQuery dataset consisting of stock market and Reddit mention data. Used CTEs and window functions to calculate moving averages and detect anomalies.
Related Coursework: DTSA 5735 – Advanced Topics and Future Trends in Database Technologies (Elective)
Examined evolving technologies in data storage and management, including NoSQL databases and cloud-native solutions.
What I Know
Comparison of relational and non-relational (NoSQL) database models
Introduction to distributed databases and horizontal scaling
Trends in real-time data processing and cloud-based data architectures
Project: Prototyped a hybrid data architecture using BigQuery and Firestore to manage both structured and semi-structured data. Designed the system to support future integration with real-time analytics tools.
Related Coursework: CSCA 5414 / DTSA 5503 – Foundations of Data Structures and Algorithms
Introduced essential algorithmic strategies for solving complex optimization problems efficiently.
What I Know
Design and implementation of dynamic programming solutions
Greedy algorithms and their correctness analysis
Time and space complexity analysis using Big-O notation
Project: Implemented dynamic programming and greedy algorithms to solve a set of optimization challenges, including longest common subsequence and interval scheduling. Evaluated trade-offs between approaches and benchmarked performance in Python.
Related Coursework: CSCA 5424 – Approximation Algorithms and Linear Programming
Focused on near-optimal solutions for NP-hard problems and optimization using linear programming.
What I Know
Formulation and solving of linear programs using simplex and duality
Design of approximation algorithms with provable guarantees
Applications to scheduling, graph problems, and resource allocation
Project: Formulated a linear program to optimize ad placements on a content website and developed an approximation algorithm to handle scalability for larger input sets. Visualized solution efficiency across test scenarios.
Related Coursework: CSCA 5434 – Advanced Data Structures, RSA and Quantum Algorithms
Explored advanced algorithmic techniques and emerging concepts in cryptography and quantum computation.
What I Know
Advanced data structures: tries, AVL trees, heaps, and hash maps
RSA encryption algorithm and number theory foundations
Introduction to quantum computing concepts and quantum search algorithms (e.g., Grover’s algorithm)
Project: Simulated RSA encryption and decryption using Python, applying modular arithmetic and prime generation. Also explored quantum algorithm principles through pseudocode implementation and comparison to classical search methods.
Related Coursework: CSCA 5063 – Network Systems Foundation
Introduced the principles and architecture of modern computer networks, focusing on how data is transmitted, routed, and secured across systems.
What I Know
OSI and TCP/IP networking models
IP addressing, subnetting, and routing protocols
Basics of network performance, congestion, and fault tolerance
Project: Mapped and analyzed network traffic flows in a simulated enterprise environment using Wireshark. Evaluated protocol layers, packet loss, and latency to diagnose performance bottlenecks.
Related Coursework: CSCA 5073 – Linux Networking
Hands-on exploration of network configuration, monitoring, and troubleshooting in Linux-based systems.
What I Know
Network interface configuration, firewall rules, and SSH tunneling
Tools such as netstat, ip, tcpdump, and iptables
Basics of shell scripting for network diagnostics and automation
Project: Configured a virtual Linux server to host and secure a basic web application. Set up port forwarding, implemented firewall rules with iptables, and created scripts to monitor uptime and network usage.
Related Coursework: DTSA 5020 – Regression and Classification
Focused on applying statistical models for supervised learning, emphasizing the mathematical underpinnings and interpretability of model output.
What I Know
Linear and logistic regression for prediction and classification
Model assessment with training/test splits, confusion matrices, and ROC curves
Bias-variance tradeoff and regularization techniques (Lasso, Ridge)
Project: Built logistic regression and Lasso models to predict customer churn based on behavioral data. Evaluated accuracy and interpretability using coefficient analysis and cross-validation.
Related Coursework: DTSA 5021 – Resampling, Selection and Splines
Explored advanced model tuning techniques and flexible function fitting for non-linear relationships.
What I Know
Cross-validation (k-fold, LOOCV), bootstrapping for variance estimation
Variable selection techniques: stepwise, Lasso, Ridge
Polynomial regression and smoothing splines for modeling complex trends
Project: Applied cross-validation and spline regression to model housing prices with non-linear features (e.g., square footage vs. price). Compared performance with standard linear models and visualized residual patterns.
Related Coursework: DTSA 5022 – Trees, SVM, and Unsupervised Learning
Integrated both supervised and unsupervised learning techniques with a focus on model robustness and interpretability.
What I Know
Decision trees, random forests, and boosting methods
Support Vector Machines (SVM) with various kernels
K-means clustering and hierarchical clustering for unsupervised analysis
Project: Compared decision trees, random forests, and SVMs for classifying sentiment in Reddit posts. Used unsupervised clustering to identify emerging topics and validate classification boundaries.
Related Coursework: DTSA 5507 / CSCA 5008 – Fundamentals of Software Architecture for Big Data
Introduced foundational principles of software architecture with a focus on scalability, modularity, and performance in big data systems.
What I Know
Core architectural components: services, APIs, storage, and compute layers
Trade-offs in design: consistency vs. availability, batch vs. stream processing
Key concepts: fault tolerance, load balancing, and system resilience
Project: Designed a high-level architecture for a Reddit sentiment analysis pipeline. Mapped out ingestion, preprocessing, storage, and model inference stages with considerations for scalability and modular design.
Related Coursework: DTSA 5508 / CSCA 5018 – Software Architecture Patterns for Big Data
Explored common architectural patterns and paradigms used in large-scale data systems.
What I Know
Lambda and Kappa architectures for real-time and batch data processing
Microservices vs. monolithic architecture trade-offs
Event-driven, layered, and service-oriented patterns
Project: Prototyped a Lambda-style architecture using Cloud Functions and BigQuery to process and analyze high-volume Reddit data streams. Incorporated Pub/Sub messaging for real-time data handling.
Related Coursework: DTSA 5714 / CSCA 5028 – Applications of Software Architecture for Big Data
Applied architectural principles to real-world data problems, emphasizing system integration and end-to-end data flow.
What I Know
End-to-end design and deployment of scalable data pipelines
Integration of storage systems (e.g., BigQuery, Firestore) with application layers
Monitoring, logging, and performance optimization strategies
Project: Developed a full-stack pipeline to analyze Reddit sentiment and stock market data. Integrated APIs, scheduled ETL jobs, and built dashboards to present insights—emphasizing maintainability and modularity in the system design.