BIA678 Big Data Technologies Seminar



Course Catalog Description

Introduction

The field of Big Data is emerging as one of the transformative business processes of recent times. It utilizes classic techniques from Business Intelligence & Analysis, along with a new tools and processes to deal with the volume, velocity, and variety associate with big data. As they enter the workforce, a significant percentage of BIA students will be directly involved with big data either as technologists, managers, or users. This course will build on their understanding of the basic concepts of BI&A to provide them with the background to succeed in the evolving data centric world, not only from the point of view of the technologies required, but in terms of management, governance, and organization. Tools will include Hadoop, Hbase, and related software.

Prerequisites:

Admission requirements for the BI&A program.Course ObjectivesThe objective of this course is to study key technological, management, and governance techniques for application of big data. This will be done through a series of readings and lectures, some by outside experts; case studies of the application of big data; application of technologies typical of the field (e.g. Map/Reduce); and a semester long, small team project applying what has been learned. They will learn how to apply selected tools in areas such as data management, data analysis, and data visualization, and also learn how to deal with the issues related to the management of large sets of data. The course will concentrate on what is different in a big data environment, from what they have already learned about standard BIA environments.. Finally, through the analysis and discussion of case studies they get useful insights on how to optimize the value of big data processes and operations, to streamline the goals and to design flexible systems. Students taking the course will be expected to have some background in areas such as multivariate statistics, data mining, data management, and programming.

Additional learning objectives include the development of:

Written and oral communications skills: the individual project proposal will be used to assess written skills and the final presentations will be used to assess presentation skills.

Technical Reading Capability: Students will be required to read, and lead discussions on, seminal papers in the field of big data.

Team skills: The final project for the course will involve student teams; an online survey instrument will be used to measure individual contributions to team performance.


Campus Fall Spring Summer
On Campus
Web Campus

Instructors

Professor Email Office
David Belanger
dbelange@stevens.edu Babbio 409

More Information

Course Outcomes

After taking this course, students will be able to:

  1. Understand and discuss what big data is, and how it differs from traditional approaches to BI&A
  2. Plan and use the primary tools associated with big data in creating systems to take advantage of big data.
  3. Extract knowledge and intelligence from datasets which exhibit high volume, velocity, and/or variety.
  4. Plan and execute a project that includes the use of at least one big data dataset.
  5. Understand and discuss the meta issues around big data such as governance, security, privacy, and OAM&P.
  6. Understand and be able to execute analyses oriented to streaming data.
  7. Have a framework with which to understand new advances in the field, and distinguish hype from reality.
  8. Understand and discuss organizational issues related to big data.

Course Resources

Textbook

Required Text(s):

Soares, Sunil, “Big Data Governance – An Emerging Imperative.” Boise ID, MC Press, 2012.

Supplementary Readings:

  • Wu, et. al., “Data Mining with Big Data”, IEEE Transactions on Knowledge and Data Engineering, 1/2014 http://www.cs.umb.edu/~ding/papers/TKDE2013.pdf
  • Lin & Ryaboy, “Scaling Big Data Mining Infrastructure: The Twitter Experience”, SIGKDD Explorations, V14 I2 http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
  • McKinsey Global Institute, “Big Data: The next frontier for innovation, competition, and productivity”, 2011 http://www.mckinsey.com/Search.aspx?q=big%20data%20the%20next%20frontier%20for%20innovation%20competition%20and%20productivity&l=Insights%20%26%20Publications
  • Dean & Ghemawat, “MapReduce:Simplified Data Processing on Large Clusters”, http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf, 2004
  • Ghemawat, et al, “Google File System”, http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf , 2003
  • Compression vcodex, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.4161&rep=rep1&type=pdf,
  • Cortes, et al., Communities of Interest, http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=737FC800B052765E59749637FAB5AF7D?doi=10.1.1.23.8792&rep=rep1&type=pdf
  • CAP, IEEE Computer V45 N2 2/2012 pp. 21-58., esp: 21, 23, 30, 37, 43. Lynch & Gilbert, “Perspectives on the CAP Theorem”, http://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf , 2012
  • Abadi, et al, “Column-Stores vs. Row-Stores: How Different are they Really, http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf,
  • Chang et al, “Bigtable: A Distributed Storage System for Structured Data”, http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf , 2006
  • Decandia, et al, “Amazon’s Highly Available Key Value Store”, http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf,
  • Hbase Basics (Cassandra Basics) O’reilly ; http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
  • Widom, et. al, “STREAM: The Stanford Data Stream Management System”, http://ilpubs.stanford.edu:8090/641/1/2004-20.pdf, 2004
  • http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.4467&rep=rep1&type=pdf, 2004, Cranor, et al. “Gigascope: A Stream Database for Network Applications”
  • Johnson, “Stream Warehouseing” , http://www.stanford.edu/group/mmds/slides2012/s-johnson.pdf, An application to Darkstar
  • IBM Infosphere Streams, http://www-03.ibm.com/software/products/en/infosphere-streams
  • Wu, et al, “Top 10 Algorithms in Data Mining”, Knowledge Systems 2007, http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf,
  • Marai,Liz http://vis.cs.pitt.edu/teaching/cs2620/lectures/L04_TufteDesign.pdf
  • Shneiderman, Ben, Extreme Visualization: Squeezing a Billion Records into a Million Pixels http://www.cs.umd.edu/~ben/papers/Shneiderman2008Extreme.pdf
  • Scheidegger, et al, “Visual Embedding, a Model for Visualization”, http://cscheid.net/static/papers/visual_embedding.pdf, 1/2014
  • http://docs.media.bitpipe.com/io_11x/io_113511/item_821580/Big%20Data%20Needs%20Agile%20Information%20And%20Integration%20Governance.PDF
  • NIST Big Data Public Working Group and Standardization activities, http://bigdatawg.nist.gov/_uploadfiles/M0270_v1_9179221138.pdf,

  • Privacy Policies, for example: jpmorgan, at&t, google, smaller folks, …

  • Johnson, et al, “Bistro Data Feed Management System”, http://www.research.att.com/export/sites/att_labs/techdocs/TD_100454.pdf, ,,
  • MIT, “an evaluation framework for data quality tools”, http://mitiq.mit.edu/iciq/pdf/an%20evaluation%20framework%20for%20data%20quality%20tools.pdf

Grading

Grading Policies

Class Discussion Leadership (10%)

Each student will be required to lead the class discussion of one or more of the assigned readings. This will be done in front of the class. All students are expected to have read the assigned readings, and to take part in the discussion.

Each student will be required to lead the class discussion of one or more of the assigned readings. This will be done in front of the class. All students are expected to have read the assigned readings, and to take part in the discussison.

1 INDIVIDUAL Term Paper (33%).

Each student will be required to write a term paper of approximately 5 – 10 pages on a topic of their choice within the domain of big data.

TEAM PROJECT REPORT & PRESENTATION (33%)

The class will be divided into teams of approximately 5 students each. Each team will be expected to select a data set appropriate to big data, to conduct a variety of analyses on the data using big data associated tools, to present the project results to the class, and to create a written report on the project.


Lecture Outline

Topic Reading
Week 1 Introduction to Big Data Wu, et. al., 2014
Lin & Ryaboym, 2012
McKinsey Global Institute, 2011
Week 2 Core Technologies for Distribution and Scale Dean & Ghemawat, 2004
Ghemawat, et. al., 2003
Vcodex
Cortes, et. al.
Cloudera Tutorial
Week 3 Data Base Management Lynch & Gilbert, 2012
Abadi, et al, 2008
Chang, et al, 2006
Decandia, et al, 2008
Week 4 Data Stream Management Widom, et al, 2004
Cranor, et al, 2004
Johnson, 2012
Infosphere Speaker
Week 5 Data Analytics Wu et al, 2007
Week 6 Visualization in a big data world Marai, 2004
Sheiderman, 2008
Scheidigger, et al, 2012
Week 7 Data Governance Soares Parts 1, 2, and 3
Week 8 Meta Issues in Big Data Governance NIST Documents
Privacy Policies of selected companies (e.g. JPMorgan, AT&T)
MIT
Week 9 Applications
Week 10 Student Presentations of Term Projects