FE512 Database Engineering

Course Catalog Description

Introduction

This lab course will introduce HFSL databases resources and the tools required to use them. More generally, the course will introduce a variety of software tools used to interact with data storage systems and manipulate them. The focus will be on how to use these tools in practice to perform common data engineering tasks, with a focus on the financial services industry, in both on-prem and cloud environments. The course will require students to spend a significant amount of time programming in Python, SQL, and other languages.

Campus Fall Spring Summer
On Campus X
Web Campus

Instructors

Professor Email Office
Ionut Florescu ifloresc@stevens.edu Babbio 544
Edward Loeser eloeser@stevens.edu Babbio 109

More Information

Course Description

The course covers basic and advanced knowledge of SQL using PostgreSQL. Python and R will be used alongside SQL to interact with data systems. Additional topics will be introduced.


Course Outcomes

At the end of this course, students will be able to: evaluate a data problem and implement a simple software stack to solve it; connect to databases, call data APIs, and manipulate data using a programming language; use SQL to write DDL and DML statement

The objective of the course is for students to:
• build fundamental skills in writing programs to manipulate data and interact with data APIs
• learn to write more complex, data-intensive programs
• understand basic database concepts
• become comfortable researching, evaluating, and implementing open-source software tools
• gain a basic understanding of the large and rapidly changing data storage technology landscape


Course Resources

Prerequisites and Textboks

There are no course prerequisites. Students will need to be comfortable with Python, SQL, and text-based user interfaces. There is no required textbook. However we present a list of reference books for the course:
• Designing Data-Intensive Applications, Martin Kleppmann
• Options, Futures, and Other Derivatives, John C. Hull
• Effective awk Programming, Arnold Robbins
• SQL in 10 Minutes, Ben Forta
PostgreSQL Docs , The PostgreSQL Global Development Group


Grading

Grading Policies

All Stevens, graduate students promise to be fully truthful and avoid dishonesty, fraud, misrepresentation, and deceit of any type in relation to their academic work. A student's submission of work for academic credit indicates that the work is the student's own. All outside assistance must be acknowledged. Any student who violates this code or who knowingly assists another student in violating this code shall be subject to discipline. If identical submissions are founded, the students will receive 0 point for the homework.

Grading Procedures:
• Programming Assignment 1: 20%
• Programming Assignment 2: 20%
• Programming Assignment 3: 30%
• Homework Assignments: 30%
Late Policy: All assignments will be due on the date and time posted.


Lecture Outline

Topic Due
SQL basics and programming interfaces
Week 1 Environment Setup, Python, R, CLI Tools, Linux, Windows PowerShell, MacOS Terminal, Shell Commands/Scripting, SSH, Generating Keys, Git, ODBC, JDBC, psycopg2
Week 2 SQL Basics, SQL Structure, pgAdmin, SQL Math HW 1
Week 3 SQL Concepts, SQL DML, PostgreSQL, Postgres Ecosystem, Using R and Python to run SQL commands HW 2
Databases and data wrangling
Week 4 Databases, File Storage, File Formats, Importing / Exporting Data, JSON, XML, XBRL [Machine Readable News Case Study] HW 3
Week 5 Databases, SQL DDL, Schema Design, Normalizing Data. [Machine Readable News Case Study] HW 4
Week 6 Parsing Larger Data Files. Constraints / Technology Choices. Improving performance by indexing. [Datascope Select Case Study] HW 5
Week 7 Troubleshooting Techniques, Command Line Interface (CLI) Tools: hexdump, regex, awk, etc. [Datascope Select Case Study] PA 1, HW 6
Application Programming Interfaces (API's) and Extract, Transform, Load (ETL)
Week 8 APIs, File Formats, HTTP Protocol, REST APIs. [Bloomberg, Kaiko Case Studies] HW 7
Week 9 Pulling Data from APIs, OpenAI, EDGAR [Datascope Select Case Study] HW 8
Week 10 Data Engineering Overview, ETL / ELT Concepts and Ecosystem, Introduction to Airflow, Using Python and R [Machine Readable News Case Study] HW 9
Containerization tools and more ETL
Week 11 DevOps Overview, Docker, Docker Compose, Kubernetes HW 10, PA 2
Week 12 Airflow Concepts, Running Airflow Locally [Machine Readable News Case Study] HW 11
Week 13 Airflow Environment, Writing Directed Acyclic Graphs (DAGs) [Machine Readable News Case Study] HW 12
Week 14 Writing and Debugging DAGs [Machine Readable News Case Study] PA 3