FE582 Foundations of Financial Data Science
Course Catalog Description
This course will provide an overview of issues and trends in data quality, data storage, data scrubbing, and data
flows. Topics will include data abstractions and integration, enterprise level data issues, data management issues
with collection, warehousing, preprocessing and querying, similarity and distances, clustering methods,
classification methods, text mining, and time series. Case studies will be presented in support of the theoretical
concepts. Furthermore, the Hadoop-based programming framework for big data issues will be introduced along with any
governance and policy issues. These concepts will be applied to areas such as digital marketing and computational
advertising, energy and healthcare analytics, social media and social networks, and capital markets financial data.
A one credit Hanlon lab course, FE-513: Practical Aspects of Database Design is co-requisite to this course in order
to facilitate learning of the practical side of data management.
This course is the first course for the certificate in Financial Services Analytics. Financial services analytics
is the science and technology of creating data-driven decision-making analytics for the financial services
industry. This can lead to more effective business operations, enhanced customer services and product offerings,
and improved risk analysis and risk management. This course is the key building block in this certificate as
good data and the understanding of data is critical to the creation of robust financial services analytics. The
financial services analytics certificate has four key areas making up its knowledge base:
- Foundations of Financial Data Science (FE-582)
- Introduction to Knowledge Engineering (FE-590)
- Financial Systems Technology (FE-595)
- Data Visualization Applications (FE-550)
Co-Requisite: FE 513 – Practical Aspects of Database Design
After taking this course, the students will be able to:
- Have a working knowledge of the issues of data quality, data storage, data scrubbing, data flows, and data
encryption and their potential solutions.
- Understand and design various schemas needed for the representation of financial data.
- Tackle problems dealing with data management issues such as collection, warehousing, preprocessing and
- Will get a primer on database management as well as advantages and disadvantages from the attached lab
course FE 513.
- Understand how to write applications using the map-reduce feature of Hadoop clusters.
- Have a working understanding of all the databases available for them through the Hanlon lab.
- Apply the newly acquired data management and database skills to financial data from the capital markets,
social media, and the financial services sector.
No single textbook covers all the topics. Several references will be used and supplementary notes will be
provided whenever appropriate.
Charu C. Aggarwal, Data
Classification: Algorithms and Applications. CRC Press, 2015. (ISBN: 978-1-4665-8674-1)
Charu C. Aggarwal, Data
Mining. Springer, 2015. (ISBN: 978-3-319-14141-8)
Deborah Nolan and
Duncan T. Lang, Data Science in R: A Case Studies Approach to Computational Reasoning and Problem
Solving, CRC Press, 2015. (ISBN: 978-1-4822-3481-7)
Norman Matloff, The Art
of R Programming, No Starch Press, 2011. (ISBN: 978-1-59327-384-2)
Cathy O’Neil and
Rachel Schutt, Data Science, O’Reilly, 2014. (ISBN: 978-1-449-35865-5)
- Assignments 60%
- Project 40%
- 50% Final Exam
||Introduction to Financial Data Science
Data Science Process
Sample Data Processing
The Basic Data Types
The Major Building Blocks: A Bird’s Eye View
Introduction to R
Case Study: Exploratory Data Analysis (NYC Real Estate)
| Week 2
||Financial Data Quality Issues and Data Scrubbing.
Feature Extraction and Portability
Data Reduction and Transformation
Handling Missing Entries
Handling Incorrect and Inconsistent Entries
Sampling for Static Data and Data Streams
Dimensionality Reduction Intro
| Week 3
||Case Study: Data and Web Technologies
Web page retrieval, scrapping, regular expression extraction, basic statistical techniques to
identify wrong data entries
Piecewise linear model
| Week 4
||Similarity and Distances
Impact of High Dimensionality
Lp-norm. Generalized Minkovski Distance. Contrast
Impact of Locally Irrelevant Features. Impact of Different Lp-Norms
Match-Based Similarity Computation
Impact of Data Distribution. ISOMAP
Impact of Local Data Distribution. Similarity on Categorical Data
Similarity on Mixed Quantitative and Categorical Data
Text Similarity Measures. Time Series Similarity Measures
| Week 5
Linear Discriminant Analysis
Quadratic Discriminant Analysis, K-NN
| Week 6
| Week 7
||Tree-Based Methods. Regression Trees. Tree Pruning.
Using Decision Tree to Trade Stock. Building a Trading Strategy. Handling Time-Dependent Data in
Python. The Prediction Models.
| Week 8
||Financial Time Series
Using Decision Tree to Trade Stock
Building a Trading Strategy
Handling Time-Dependent Data in R
The Prediction Models
| Week 9
||Mining Text Data
Document Preparation and Similarity Computation
Specialized Clustering Methods for Text
Specialized Classification Methods for Text
| Week 10
||Case Study: Using Statistics to Identify Spam
| Week 11
| Week 12
||No Class (Thanksgiving Recess).
| Week 13
||Hadoop. HDFS. MapReduce. Hive. Pig
| Week 14
||Final Project Presentations