The Master's programme builds upon the foundation learned in Bachelor's programme and goes further in- depth in mathematics, programming, algorithms, databases, statistics, optimisation, bioinformatics, information retrieval and machine learning. The programme offers a rich set of elective courses covering cutting-edge achievements in data science.
Master's students apply their newly earned knowledge to a programme-long project based on an idea they propose during enrollment. The project expands as students’ knowledge grows and upon graduation the outcome is presented to peers, mentors, industry partners and even venture funds during university demo day.
During the first year their knowledge of mathematics, programming and data analysis will be significantly extended. The programme also offers opportunity to learn the key soft skills for the professional world including technical project management, writing and presenting. Finally during the first year students are expected to attend many of the talks and workshops offered by the university and work on the capstone project.
This course provides a basic review of combinatorics and graph theory. After that, the course delves into deeper facts and modern methods of discrete analysis underlying the several mathematical disciplines necessary for data analysis.
This course is particularly important for those who are new to or looking to improve their understanding of C ++. The course will help students familiarize themselves with algorithms and data structures. However, there will be also tools that may be new to those who have studied C++ at the bachelor’s level. At the master’s level, the student is ready - at least mentally - to do "industrial programming." In addition, the course will continue in the second semester, and it will eventually give the student an idea of the many subtleties of this programming language.
In this course, students first review the basics of algorithms and data structures, and then study and model complex modern algorithms, including algorithms on graphs (the construction of a spanning tree, topological sorting, shortest paths) and their associated data structures, such as a system of disjoint-set data structures, binomial and Fibonacci heaps. Students will also learn more about another set of problems associated with string and index construction text such as Rabin-Karp algorithm, Knuth-Morris-Pratt (KMP) algorithm, and Ukkonen's algorithm to build suffix trees and suffix arrays.
The course on databases is designed for students who know the basics of programming and it requires students to be familiar with the basic principles of the working of a computer, in particular how memory and disk subsystems work. Students will learn the basics of relational algebra and SQL. They will also familiarize themselves with the configuration of a database management system (DBMS), learn to design a database schema to solve applied problems, study the principles of query optimization, and get to know the mechanisms of database fault tolerance and concurrent database access.
In this course, students will move from the foundations of probability to more advanced concepts, facts, and tools needed to analyze data. In particular, the course examines the main types of distributions, limit theorems and laws of large numbers, the methods of parametric statistics (point and interval estimation of parameters, hypothesis testing), regression, and nonparametric statistics.
A practical introduction to using the Unix operating system with a focus on Linux command line skills. Topics include: grep and regular expressions, ZSH, Vim and Emacs, basic and advanced GDB features, permissions, working with the file system, revision control, Unix utilities, environment customization, and using Python for shell scripts.
The course is offered in cooperation with Interaction Design program of Harbour.Space University. The students in the data science programme get an opportunity to learn about challenges in creating usable, inyuitive, efficient and practical design for software applications and websites. The course describes challenges in the field of interaction design and discusses solutions. The course project will help students to appreciate the value that design can bring to a product. The course is cross-listed with Interaction Design curriculum, giving students of both programmes an opportunity to collaborate and work on a project that requires significant input from both fields.
This is an intensive course, which will outline a set of algorithms and approaches to a variety of complex topics which in turn have a large number of applications in modern data analysis. These topics include set cover problem, vertex cover, shortest path problem, minimum spanning tree, matching, assignment problems, job shop scheduling problems, the problem of packing (bin packing, rucksack problem), flow problems (highest value flow, lowest value flow, multiproduct flows), transportation theory (Hitchcock-Koopmans problem), traveling salesman problem.
This course on machine learning consists of two modules. It introduces students to some of the elements of modern data analysis. In the first module , students get an introduction to the foundational problems of machine learning and get more fully acquainted with the algorithms for solving problems of classification and clustering. Classification algorithms covered in this course include nearest neighbor algorithm, support vector machine (SVM) algorithm, Bayesian methods, decision trees, lists of rules. Clustering problems are solved by considering algorithms as a fixed number of clusters (K-Means, Expectation-Maximization (EM)), as well as methods for automatically determining the number of clusters (agglomerative and divisive clustering). The second module of the course is devoted to studying the problems of regression analysis, building the composition of algorithms, learning about model selection criteria and feature selection methods.
Students will become familiar with the programming language Python, which is an important tool. The course will pay special attention to the basis of the language, object-oriented programming (naturally extending from C ++), treatment of errors, code design and testing, string manipulation, memory model, functional programming, review of libraries, and concurrent computing in Python. The last topic will serve as an excellent way to improve the knowledge of the module on concurrent and distributed computing, which the students would have just passed before this course.
The course introduces students to the principles and practice of computer networking. Structure and components of computer networks, packet switching, layered architectures. Applications: web/http, voice-over-IP, p2p file sharing and socket programming. Reliable transport: TCP/IP, reliable transfer, flow control, and congestion control. The network layer: names and addresses, routing. Local area networks: ethernet and switches. Wireless networks and network security.
This course introduces computer programming using the Java programming language with object-oriented programming principles. You will learn all complex aspects of the language: data types, memory management and garbage collection, generics, annotations, standard data structures, IO, JDBS/JPA and multithreading. Special attention will be paid to the process of applications development, debugging and testing. Also, the emphasis is placed on the development of the web server application.
The course objectives are to: - Explain the need for Large Scale Machine Learning (LSML) and how it differs from traditional Machine Learning (ML); - Provide theoretical understanding of basic ML algorithms that work with big data; - Learn to use the existing LSML programs through practical exercises and modeling.
Among the problems and methods studied in this course are various modifications of the gradient method, the conjugate gradient method, Newton's method, self-concordant functions, results from convex analysis, minimization of non-smooth functions, non-smooth unconstrained minimization, projection method, methods of stochastic search, problem of conditional extremes dual problems, interior point method (centre tracking), regression tasks, application to classifications, and compressed sensing.
The course introduces students to practical aspects of working within a group of peers. We discuss ways of splitting the tasks, facilitating productive meetings in a situation of professional disagreement and different styles and personalities. The course is heavily centered around discussions and exploration of practical examples. Students are split into groups and given an opportunity to observe the described phenomena.
The course covers methods for efficient, structured and organised presentation of technical data in written and oral form. It introduces common structures and formats for technical documents ranging from workplace email communication and presentations to software requirements, API documentation and conference presentations. Students are taught to recognise audience and to present information that meets their needs at appropriate technical level. Students will be introduced to professional writing and presentation instruments and will gain introductory experience using such tools through extensive exercises. Additionally course focuses on the creation of visual materials including diagrams and charts.
The course will introduce students to the methods of assessing how difficult are certain computational problems are to solve, as well as the limits of mathematical algorithms and computers. The issues and challenges discussed in this course include computational models, complexity assessment, polynomial solvable problems, polynomial algorithms, theorem hierarchy and its use for evidence of solvability, polynomial reductions, reducible NP, proof of NP completeness, NP-complete problems, approximate solution of optimization problems, problems in the polynomial hierarchy and PSPACE, probabilistic polynomial algorithms, PSPACE completeness, circuit complexity, first-order complexity, interactive proofs, interactive protocols, one-way functions and their use in cryptography.
Students learn about challenges often encountered during collaborative work on a large project. These challenges stem from the professional reality that involves frequently changing requirements, imperfect effort estimations, frequent direction changes to the execution of the project as well as lack of coordination between team members. The course introduces proven techniques to manage these challenges. Successful project management allows teams to control costs, manage risks and meet deadlines. Students will learn methods to structure technical projects, identify key stages and tasks, determine task dependencies, assess the level of effort and design project plan, etc. We introduce popular project management software and offer students an opportunity to design their course project.
This course will cover the following important topics: general nonlinear optimization and its complexity, lower complexity bounds of smooth functions, fast gradient methods, lower complexity bounds of non-smooth functions, subgradient methods, polynomial algorithms, structural optimization, interior point polynomial methods, the most important applications of methods interior point, and many others.
By this point, students would have accumulated a vast knowledge of probabilistic and statistical methods and tools with which a variety of data can be analyzed. This course will cover a substantial part of these techniques and will consist of three modules.
In their first year, Master programme students will work on identifying the approach for implementation of the project. This will include creation of a development plan and implementation of a prototype. At the end of the year students will submit an outline document detailing the progress, the results of the literature research and description of a prototype. Students will also rehearse a presentation for their mentor to practice for the end of programme presentation that will take place at the end of the final year.
The university will offer regular open lectures by professors, experts and key figured by technology field. Students in data science program are required to attend many of the lectures and submit a write up describing what they learned during the talk. During the first year, students will be required to describe the problem statement, its significance and outline the presented approach of the solution.
A significant part of the year will be allocated to the completion of the capstone project. Through completion of the programme, students will learn to conduct data analysis on any scale, develop the software necessary for analysis and present the results in a professional and efficient ways.
The purpose of the ‘Parallel and Distributed Computing’ discipline is to acquaint students with the principles of organization, technologies, and the place and role of distributed and parallel computing in the field of information technology. Students will work with practical training elements to consolidate the information received, and to prepare for further studies in modern means of network computing and their effective application in research. There are many methods because modern analysis of big data is very deep and diverse.
A continuation of Statistical Data Analysis - 1
This course focuses on techniques for software design in the development of large and complex software systems. Topics will include software architecture, modeling (including UML), object-oriented design patterns, and processes for carrying out analysis and design.
This course will focus on gradient methods of convex optimization with certain relaxations in the possibility of gradient calculation. In particular, the course will focus on: 1. Randomized methods 2. Dual method 3. Opportunity for parallelization 4. Accounting of sparsity 5. Markov Chain Monte Carlo (MCMC) methods 6. Non-gradient methods 7. Coordinate descent methods Students will go over the applications of the aforementioned methods to solve problems of ranking web-pages and finding transport and economic equilibria in large networks.
The course will introduce students to the basic concepts of modern cryptography, and will then go over methods of synthesis and analysis of cryptographic protocols, and finally explore numerous protocols required for various applications. These applications include: the protocols for the authentication of key distribution based on private-key cryptography, protocols for key exchange based on public-key cryptography, protocols for authenticated key establishment based on password information, and more.
This course will discuss methods of working with big data using programming model MapReduce.
This course will go over modern distributed databases, which are necessary for both the storage and processing of big data. In particular, this course will cover HBase, Apache Cassandra etc.
How does one teach a computer to determine that one text is about sports and another about politics? What if we want to go ahead and implement the search for similar texts and have an automatic detection of keywords?
A lot of interesting tasks are related to text mining. This is the classic problem of building spam-filters and more extravagant undertakings, such as prediction of quotations on the stock exchange based on Twitter messages. This course will discuss the various problems of text mining and the mathematics behind them. We will also learn how to solve some of these problems through practice examples.
One of the important objectives of modern data analysis is understanding complex networks including social networks, which are the most important type of these complex networks. This course will discuss the principles and models for the formation of social networks as well as the dissemination of information within these social networks.
This course focuses on the presentation of the methods and results related to the theory of modern robust (stable) optimization, which is the building block of many practical applications. This course will pay special attention to the examination of robust linear optimization problems, conic duality, and robust conic programming.
This is an important and modern course in which students will spend two semesters examining several different algorithms for image and video recognition. In particular, the course will focus on image processing, Internet vision, computer vision, optical flow etc.
This course will be go over the modern problems and methods of information search and retrieval. One of the main topics covered in this course will be the ranking of documents based on the search query.
This course will give students an introduction to the modern auction theory. The program includes the following topics: reserve price, revenue equivalence theorem, mechanisms, optimal mechanism, effective mechanism, price-driven auctions, multi-unit auctions, generalized second-price auction.
A continuation of Statistical Data Analysis - 1,2
The many issues examined in this course, which is important both for theory and practice, include: Hartley function, topics on sorting, topics on communication protocols, application of the rectangle method, Shannon entropy, the logic of knowledge, conditional Shannon entropy and the amount of information, coding with a small average code length, information inequality, Shannon limit, text encryption, use of Shannon entropy in statistics, forecasting, Kolmogorov complexity, conditional complexity, PAC learning, Vapnik-Chervonenkis (VC) dimension.
This is an important and modern course in which students will spend two semesters examining several different algorithms for image and video recognition. In particular, the course will focus on image processing, Internet vision, computer vision, optical flow etc.
This course will explore the following topics: literal translation model, language model, rule-based machine translation, syntax-oriented analysis and synthesis, phrase-based translation models, hierarchical and syntax-based translation models etc.
The course introduces popular data visualization packages. We discuss methods and approaches to ad-hoc data visualizations as well as factors that make visualization clear, informative and attractive.
For the analysis of character sequences that occur in a variety of practical problems, it is useful to apply the concept of bioinformatics, which essentially deals with the analysis of protein sequences in DNA. The various topics covered in this course include: introduction to the analysis of character sequences, dynamic programming in graphs, dynamic programming for hypergraphs, pairwise alignment of sequences, other methods of comparison sequences in general, search for local similarities, multiple sequence alignment, search for multiple local similarities, examples of multiple sequence comparisons, probabilistic model families of sequences, examples of the use of cell mixture model (CMM).
This course will cover the following topics: Laplacian matrix, adjacency matrix, eigenvalues of the graph and their estimates, graph conductance, Cheeger inequality, combinatorial and geometric expanders, random walks, pseudo- random generators and random walks on expander graphs, concentration inequalities for graph spectra, spectra of random graphs, basic concepts about codes, expander codes.
For data analysis on the Internet, it is extremely important to be able to work with the Internet as a graph, where the vertices are represented by web-pages and the edges are represented by hyperlinks. As it happens, this graph has a definite "topology." The course will discuss this topic, and how to use this knowledge to analyze the Internet and other similar complex networks, including social, biological, and inter-bank etc. In addition, the course will be discuss modern algorithms on a large graphs such as PageRank, which ranks search pages by relevance to the search query and the epidemics on graphs i.e. the spread of real epidemics as well as information on social networks.
Students will complete the project in their final year. By the end of year two, they will finish the development of software, testing, deployment, data acquisition and analysis, preparation of their project report and documentation and the final presentation. The project will be presented to peers, mentors, the programme director, academic and industrial partners as well as venture capital organisations.
The university will offer regular open lectures by professors, experts and key figured by technology field. Students in data science program are required to attend many of the lectures and submit a write up describing what they learned during the talk. During the first year, students will be required to describe the problem statement, its significance and outline the presented approach of the solution.