FA582 Foundations of Financial Data Science

Course Catalog Description


This course will provide an overview of issues and trends in data quality, data storage, data scrubbing, and data flows. Topics will include data abstractions and integration, data management issues with collection, warehousing, pre-processing, and querying, similarity and distances, clustering methods, classification methods, text mining, and time series. Case studies will be presented in support of the theoretical concepts. Furthermore, the Hadoop based programming framework for big data issues will be introduced along with any governance and policy issues. These concepts will be applied to areas such as real estate, social media and social networks, and capital markets financial data. A one credit Hanlon lab course, FE-513: Practical Aspects of Database Design is co-requisite to this course in order to facilitate learning of the practical side of data management.

Campus Fall Spring Summer
On Campus X X
Web Campus X X X


Professor Email Office
Dragos Bozdog
dbozdog@stevens.edu Babbio 429A

More Information

Course Description

This course provides the theoretical and practical foundation for Financial Analytics. The main objective is to enable the students to create data-driven decision in the financial industry based on data science tools and analytics methods. The goal is to be enable effective business operations, enhanced customer services and product offerings, improved risk analysis, and risk management.

Course Outcomes

After taking this course, the students will be able to:

  1. Have a working knowledge of the issues of data quality, data storage, data scrubbing, data flows, and data encryption and their potential solutions.
  2. Understand and design various schemas needed for the representation of financial data.
  3. Tackle problems dealing with data management issues such as collection, warehousing, preprocessing and querying.
  4. Will get a primer on database management as well as advantages and disadvantages from the attached lab course FE 513.
  5. Have a working understanding of all the databases available for them through the Hanlon lab.
  6. Apply the newly acquired data management and database skills to financial data from the capital markets, social media, and the financial services sector.

Course Resources


No single textbook covers all the topics. Several references will be used and supplementary notes will be provided whenever appropriate.

Additional References

  • CS1: Charu C. Aggarwal, Data Classification: Algorithms and Applications. CRC Press, 2015. (ISBN: 978-1-4665-8674-1)
  • CS2: Charu C. Aggarwal, Data Mining. Springer, 2015. (ISBN: 978-3-319-14141-8)
  • CS3: Deborah Nolan and Duncan T. Lang, Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving, CRC Press, 2015. (ISBN: 978-1-4822-3481-7)
  • NM: Norman Matloff, The Art of R Programming, No Starch Press, 2011. (ISBN: 978-1-59327-384-2)
  • CO: Cathy O’Neil and Rachel Schutt, Data Science, O’Reilly, 2014. (ISBN: 978-1-449-35865-5)


Grading Policies

1 Class participation 5%
2 Assignments 55%
3 Project 40%

Lecture Outline

Week Topic
Week 1 Introduction to Financial Data Science.
Data Science Process. Sample Data Processing.
The Basic Data Types.
The Major Building Blocks: A Bird’s Eye View. Introduction to R.
Case Study: Exploratory Data Analysis (NYC Real Estate)
Week 2 Financial Data Quality Issues and Data Scrubbing.
Data Preparation.
Feature Extraction and Portability.
Data Cleaning.
Data Reduction and Transformation. Handling Missing Entries.
Handling Incorrect and Inconsistent Entries.
Scaling and Normalization.
Data Reduction and Transformation.
Sampling for Static Data and Data Streams.
Dimensionality Reduction Intro.
Week 3 Web page retrieval, scrapping, regular expression extraction, basic statistical techniques to identify wrong data entries.
Case Study: Data and Web Technologies. Linear Model. Piecewise linear model.
Week 4 Similarity and Distances. Impact of High Dimensionality.
Generalized Minkovski Distance. Match-Based Similarity Computation.
Impact of Data Distribution. ISOMAP.
Impact of Local Data Distribution.
Similarity on Categorical Data.
Similarity on Mixed Quantitative and Categorical Data.
Text Similarity Measures.
Time Series Similarity Measures.
Week 5 Classification Methods.
Week 6 Tree-Based Methods.
Week 7 Clustering Methods.
Week 8 No Class (Spring Recess)
Week 9 Financial Time Series Data.
Using Decision Tree to Trade Stock. Building a Trading Strategy.
Handling Time-Dependent Data in R.
The Prediction Models.
Week 10 Mining Text Data.
Document Preparation and Similarity Computation.
Specialized Clustering Methods for Text.
Probabilistic Algorithms.
Topic Modeling.
Week 11 Case Study: Using Statistics to Identify Spam
Week 12 Outlier Analysis. (Extreme Values. Clustering Models. Distance-Based Models.
Density-Based Models.
Probabilistic Models. Information-Theoretical Models).
Week 13 Hadoop. Hadoop Applications.
HDFS and MapReduce.
Week 14 Review and Catching up.
Week 15 Final presentations.