MAT 6493 - Geometric Data Analysis - Fall 2023 UdeM
MAT 6493
Geometric Data Analysis
Fall 2023
Guy Wolf (wolfguy@mila.quebec) & Yanlei (Kaly) Zhang (yanlei.zhang@mila.quebec)
Modern data analysis methods are expected to handle massive amounts of high dimensional data that are being collected in a variety of domains.
The high dimensionality of such data introduces numerous challenges, typically referred to as the "curse of dimensionality",
which render traditional statistical learning approaches impractical or ineffective for their analysis. To cope with these challenges,
significant effort has been focused on developing geometric data analysis approaches that model and capture the intrinsic geometry of processed data,
rather than directly modeling their distribution. In this course we will explore such approaches and provide an analytical study of the models and algorithms they use.
We will start by considering supervised learning and distinguish classifiers that are based on geometric principles from posterior and likelihood estimation approaches.
Next, we will consider the unsupervised learning task of clustering data and contrast approaches based on density estimation from ones that rely on metric spaces or graph constructions.
Finally, we will consider more fundamental tasks in intrinsic representation learning, with particular focus on dimensionality reduction and manifold learning, e.g.,
with diffusion maps, tSNE, and PHATE. Time permitting, we will include guest talks on research areas related to the course,
and discuss recent developments in graph signal processing and geometric deep learning.
This is a graduate-level 4 credit course at UdeM, available also via the ISM. It is suitable for CS, statistics, and applied math students interested in data science and machine learning.
Meetings
Lectures:
Tuesdays & Thursdays, 5183 Pav. Andre-Aisenstadt, from 15h30 to 17h20 (officially, but possibly ending early depending on pace & covered materials)
Office Hours:
By request on MS Teams with possibility to set an in-person meeting if necessary.
Textbooks
No required textbook, but the following books were used when preparing some (although not all) of the course materials:
Other course materials are based on research papers that will be cited in the course slides.
Topics:
Slides will be made available on MS Teams and are designed to be sufficient for studying the course even without regularly attending class.
- Topic 01 - Intoduction (incl. curse of dimensionality & overiew of data analysis tasks)
- Topic 02 - Data Formalism (incl. summary statistics, data types, preprocessing, and simple visualizations)
- Topic 03 - Bayesian Classification (incl. decision boundaries, MLE, MAP, Bayes error rate, and Bayesian belief networks)
- Topic 04 - Decision Trees (incl. random forests, random projections, and Johnson-Lindenstrauss lemma)
- Topic 05 - Principal Component Analysis (incl. preprocessing & dimensionality reduction)
- Topic 06 - Support Vector Machines (incl. the "kernel trick" & mercer kernels)
- Topic 07 - Multidimensional Scaling (incl. spectral theorem & distance metrics)
- Topic 08 - Density-based Clustering (incl. intro. to clustering & cluster eval. with RandIndex)
- Topic 09 - Partitional Clustering (incl. lazy learners, kNN, voronoi partitions)
- Topic 10 - Hierarchical Clustering (incl. large-scale & graph partitioning)
- Topic 11 - Manifold Learning (incl. Isomap & LLE)
- Topic 12 - Diffusion Maps
Guest speakers (covering specialised topics of interest):
- Nov. 9th -- Graph representations & graph neural networks
- Nov. [TBA] -- Optimal transport, Earth-mover distances, and neural ODE [tentative]
- Nov. [TBA] -- Topological data analysis [tentative]
Final grade composition:
The final grade in this class will be based on three components:
- 30% -- homework
- 45% -- final project
- 25% -- literature review
Final projects:
- For the final project, students will form small groups (at 2-4 team members).
- Each group should designate a person of contact (POC) for the group.
- Group members are expected to equally contribute to the project.
- Each group member will be expected to specify their individual contribution in the final report.
- By October 13th [tentative], each group should submit a project proposal.
- Proposals are expected to span 2-3 pages and include at least the following sections:
- Project description & goals;
- Planned contributions of each team member;
- Used data / data sources.
- Projects should involve multiple methods applied to data analysis tasks chosen by each team (subject to approval of the submitted proposal), demonstrating understanding of underlying principles learned in class.
Literature review:
- Each student is expected to perform a literature review on a topic of interest relevant to (geometric) data analysis.
- The literature review should include 3-4 papers on the chosen topic.
- The intended length of selected papers should be at least 8-10 pages not including references.
- Shorter papers (4-5 pages) may be acceptable, but then you should cover more of them (e.g., 4-5 papers).
- If you cover significantly longer papers (e.g., 20+ pages), you can choose 2-3 papers.
- Selected papers should be published within the past 10 years in a reputable peer-reviewed venue (e.g., well known conferences or journals)
- You may include at most one unpublished preprint, as long as it is available on arxiv or biorxiv. The rest of the papers should be published.
- You may include at most one older paper (published more than 10 years ago), as long as you can justify it as a seminal work required to understand the other selected papers.
- The review will consist of a short written report (4-6 pages) and a short recorded presentation (10-20 minutes)
- By October 20th [tentative], each student should propose a topic for their review (no teamwork - each student must have a separate topic)
- By November 13th [tentative], each student should propose and justify the papers to be included in their report
Homework:
While discussion between students is not discouraged, homework are meant to be done & submitted individually.
- Problem Set I - due in Oct.
- Topics covered: T01-T04. [tentative]
- Instuctions will be available on MS Teams.
- Problem Set II - due in Nov.
- Topics covered: T05-T06. [tentative]
- Instuctions will be available on MS Teams.
- Problem Set III - due in Dec.
- Topics covered: T07-T12. [tentative]
- Instuctions will be available on MS Teams.