BIG DATA

Vincenzo DEUFEMIA BIG DATA

0522700002
COMPUTER SCIENCE
EQF7
CYBERSECURITY AND CLOUD TECHNOLOGIES
2023/2024

OBBLIGATORIO
YEAR OF COURSE 1
YEAR OF DIDACTIC SYSTEM 2023
SPRING SEMESTER
CFUHOURSACTIVITY
756LESSONS
216LAB
Objectives
The course aims to provide students with the methodological and technological skills necessary for the design of large data collections, including collections and streams of data generated by sensors or social networks, possibly using distributed architectures and distributed paradigms for their processing . This also in order to provide students with the necessary tools to perform complex analyzes of such data, by means of data and sequence mining algorithms, as well as machine learning models, including neural networks, and to make them understand the main techniques for visualizing data. results.

Knowledge and Understanding
The student:
• will know the methodological and technological foundations for acquiring, organizing and processing large data collections distributed in the cloud, also in compliance with data privacy regulations;
• will understand which are the most appropriate techniques to use in different real world applications to extract knowledge from large collections of more or less structured data, including sensor data and social network data;
• will know the main data preparation techniques and the main supervised and unsupervised machine learning models, acquiring the ability to use them in real-world applications;
• will understand the main concepts related to datasets, to address problems in emerging disciplines such as anomaly detection and recommender systems;
• will know the main techniques for visual communication of the results of big data analytics, data mining and machine learning applications.

Applying knowledge and understanding
The student will be able to:
• design and implement systems for processing large collections of complex, distributed and heterogeneous data, choosing the most appropriate architectures and models for the problem under examination;
• inspect large datasets in order to identify privacy problems;
• preformat large datasets to prepare them for use in predictive model training;
• use languages and tools to perform complex data analysis, including streaming data, datasets, as well as to train predictive models, including models based on neural networks.

AUTONOMY OF JUDGMENT
The student will acquire theoretical and practical skills that will allow him to develop useful solutions in various disciplinary fields. In any context in which large amounts or series of data are generated, the student will be able to identify the most appropriate architectural solutions, models and algorithms to extract useful knowledge from such data and solve problems of privacy, security, identify anomalies, monitor patients, etc. This being able to count not only on the wealth of methods covered during the course, but also on one's ability to independently explore further methods present in the literature, thanks to an acquired mastery of the fundamental concepts.

Communication skills
- The student will acquire the necessary communication skills to obtain the active involvement of stakeholders from various application domains in design activities, in order to make shared design choices, as well as to obtain useful information for the design process, such as labeling datasets for supervised training.
- The student will also acquire communication skills to make stakeholders of various application domains understand the results of complex data analysis activities and predictive models, while acquiring the ability to choose the visual notations and terminologies most appropriate to the stakeholder's know-how of a specific application domain.

Learning skills
The student will be able to learn the dynamics of complex data processing processes in various application fields. Thanks to an acquired mastery of the techniques studied during the course, the student will be able to understand the limits of sustainability or ethics in their applicability, as well as to understand the functioning of any technological solutions of the specific application context, understanding how the methods and techniques studied can improve their use.
Prerequisites
STUDENTS SHOULD BE FAMILIAR WITH FUNDAMENTALS OF DATA MANAGEMENT, DISTRIBUTED SYSTEMS, OBJECT-ORIENTED PARADIGM, AND A PROGRAMMING LANGUAGE.
Contents
AFTER INTRODUCING THE NEW APPLICATION SCENARIOS RELATED TO THE MANAGEMENT OF DISTRIBUTED AND HETEROGENEOUS DATA, INCLUDING THE POTENTIALS OF NEW TECHNOLOGIES CAPABLE OF EXTRACTING KNOWLEDGE FROM DATA, THE COURSE WILL FOCUS ON THE FOLLOWING TOPICS:

BIG DATA (8 HOURS OF THEORY)
•BIG DATA ISSUES (2 HOURS OF THEORY)
•TECHNOLOGIES FOR BIG DATA (1 HOUR OF THEORY)
•MAPREDUCE (3 HOURS OF THEORY)
•HADOOP DISTRIBUTED FILE SYSTEM (HDFS) (2 HOURS OF THEORY)

DISTRIBUTED DATABASE CONCEPTS (8 HOURS OF THEORY)
•DATA FRAGMENTATION, REPLICATION, AND ALLOCATION TECHNIQUES FOR DISTRIBUTED DATABASE DESIGN (2 HOURS OF THEORY)
•CONCURRENCY CONTROL IN DISTRIBUTED DATABASES (1 HOUR OF THEORY)
•TRANSACTION MANAGEMENT IN DISTRIBUTED DATABASES (1 HOUR OF THEORY)
•QUERY PROCESSING IN DISTRIBUTED DATABASES (1 HOUR OF THEORY)
•TYPES OF DISTRIBUTED DATABASE SYSTEMS (1 HOUR OF THEORY)
•DISTRIBUTED DATABASE ARCHITECTURES (1 HOUR OF THEORY)
•DISTRIBUTED CATALOG MANAGEMENT (1 HOUR OF THEORY)

NOSQL DATABASES AND BIG DATA STORAGE SYSTEMS (6 HOURS OF THEORY)
•INTRODUCTION TO NOSQL SYSTEMS (2 HOURS OF THEORY)
•DOCUMENT-BASED NOSQL SYSTEMS AND MONGODB (1 HOUR OF THEORY)
•NOSQL KEY-VALUE STORES (1 HOUR OF THEORY)
•COLUMN-BASED OR WIDE COLUMN NOSQL SYSTEMS (1 HOUR OF THEORY)
•NOSQL GRAPH DATABASES AND NEO4J (1 HOUR OF THEORY)

DATABASE SECURITY (8 HOURS OF THEORY)
•INTRODUCTION TO DATABASE SECURITY ISSUES (1 HOUR OF THEORY)
•ACCESS CONTROL IN DATABASES (2 HOURS OF THEORY)
•SQL INJECTION (1 HOUR OF THEORY)
•INTRODUCTION TO STATISTICAL DATABASE SECURITY (1 HOUR OF THEORY)
•INTRODUCTION TO FLOW CONTROL (1 HOUR OF THEORY)
•PRIVACY ISSUES AND PRESERVATION (1 HOUR OF THEORY)
•CHALLENGES TO MAINTAINING DATABASE SECURITY (1 HOUR OF THEORY)

DATA STREAMS (5 HOURS OF THEORY)
•THE DATA STREAM MODEL (1 HOUR OF THEORY)
•SAMPLING DATA STREAMS (1 HOUR OF THEORY)
•MAIN ALGORITHMS FOR DATA STREAM ANALYSIS (3 HOURS OF THEORY)

DATA SERIES ANALYSIS (5 HOURS OF THEORY)
•INTRODUCTION TO DATA SERIES (1 HOUR OF THEORY)
•DATA SERIES REPRESENTATION (1 HOUR OF THEORY)
•DISTANCE MEASURES FOR DATA SERIES (2 HOURS OF THEORY)
•ANALYSIS METHODS (1 HOUR OF THEORY)

MACHINE LEARNING (16 HOURS OF THEORY)
•INTRODUCTORY CONCEPTS (1 HOUR OF THEORY)
•DATA PREPARATION (2 HOURS OF THEORY)
•TRAINING MODELS (1 HOUR OF THEORY)
•CLASSIFICATION/REGRESSION (1 HOUR OF THEORY)
•DATA MINING AND APRIORY ALGORITHM (2 HOURS OF THEORY)
•DECISION TREES (2 HOURS OF THEORY)
•ENSEMBLE LEARNING AND RANDOM FOREST (1 HOUR OF THEORY)
•CLUSTERING (2 HOURS OF THEORY)
•OVERVIEW ON DIMENSIONALITY REDUCTION TECHNIQUES (1 HOUR OF THEORY)
•INTRODUCTION TO NEURAL NETWORKS (1 HOUR OF THEORY)
•MULTILEVEL PERCEPTRONS AND DEEP NEURAL NETWORKS (2 HOURS OF THEORY)


LABORATORY (16 HOURS)
•FRAMEWORKS FOR DISTRIBUTED DATA PROCESSING (4 HOURS)
•PYTHON LANGUAGE (4 HOURS)
•SCIKIT-LEARN (2 HOURS)
•TOOLS FOR DATA VISUALIZATION (2 HOURS)
•TENSOR FLOW (2 HOURS)
•CASE STUDIES (2 HOURS)
Teaching Methods
THE COURSE INCLUDES 56 HOURS OF LECTURES ON THEORETICAL TOPICS AND 16 HOURS ON TOOLS, AIMING TO INTRODUCE CONCEPTS AND TO DEVELOP ABILITIES TO ANALYZE BIG DATA PROBLEMS, WITH PARTICULAR EMPHASIS ON DATA PRIVACY AND SECURITY ASPECTS, AND TO IMPLEMENT EFFECTIVE SOLUTIONS. COURSE CONTENTS ARE PRESENTED THROUGH POWERPOINT SLIDES, STIMULATING CRITICAL DISCUSSIONS WITH THE STUDENTS. FOR EACH PRESENTED TOPIC, THE INSTRUCTORS WILL ILLUSTRATE POTENTIAL TASKS ON WHICH A STUDENT OR A GROUP CAN DEVELOP THE COURSE PROJECT. AS FOR TOOLS, OTHER THAN POWERPOINT SLIDES, THROUGH WHICH CONCEPTS AND POSSIBLE ADDITIONAL RESOURCES, SUCH AS LINKS TO FORUMS, MANUALS, AND OTHER SITES ARE PRESENTED, STUDENTS ARE GIVEN THE POSSIBILITY TO ASK SUPPORT, DURING OFFICE HOURS, ON SIMULATIONS THEY PERFORMED ON THEIR PERSONAL COMPUTER, TO ASK CLARIFICATIONS, AND SOLVE POSSIBLE TECHNICAL PROBLEMS WITH THE ASSISTANCE OF THE INSTRUCTORS.
Verification of learning
THE ACHIEVEMENT OF THE COURSE OBJECTIVES IS CERTIFIED BY MEANS OF AN EXAM, WHOSE FINAL GRADE IS EXPRESSED ON A SCALE OF 30. THE EXAM CONSISTS OF A COURSE PROJECT AND AN ORAL EXAMINATION.
WITH THE DEVELOPMENT OF THE COURSE PROJECT STUDENTS MUST SHOW THEIR ABILITY TO APPLY THE ACQUIRED KNOWLEDGE IN REAL SCENARIOS, AND CAN CARRY IT OUT INDIVIDUALLY OR IN GROUPS OF UP TO 3 STUDENTS. THEY CAN CHOOSE THE PROJECT FROM A RANGE OF PROPOSALS PROVIDED BY THE INSTRUCTORS. DURING THE PROJECT DEVELOPMENT STUDENTS CAN INTERACT WITH THE INSTRUCTORS IN ORDER TO COMMUNICATE THE PROJECT’S PROGRESS AND POSSIBLE CRITICAL ISSUES, DEBATING ON THE GOALS AND THE MODALITIES TO CONTINUE IT. AT THE END, STUDENTS SHOULD DELIVER A TECHNICAL REPORT CONTAINING THE PROJECT DOCUMENTATION AND MAKE A POWERPOINT PRESENTATION LASTING ABOUT 30 MINUTES, WHICH CAN BE GIVEN TOGETHER WITH THE ORAL EXAMINATION OR BEFORE THIS.
THE ORAL EXAMINATION MUST ALWAYS BE THE CONCLUDING ONE. IT CONSISTS OF AN INTERVIEW, WITH QUESTIONS ON THE THEORETICAL AND METHODOLOGICAL CONTENTS TAUGHT DURING THE COURSE, AIMING TO ASSESS THE LEVEL OF KNOWLEDGE AND UNDERSTANDING, AS WELL AS THE ABILITY TO EXPOSE CONCEPTS.
THE FINAL GRADE IS ASSIGNED THROUGH THE AVERAGE OF THE GRADES ON A SCALE OF THIRTIETHS REPORTED ON THE PROJECT AND THE ORAL EXAMINATIONS.
Texts
1.RAMEZ ELMASRI AND SHAMKANT B. NAVATHE, FUNDAMENTALS OF DATABASE SYSTEMS, 7^ EDITION, PEARSON, 2016.
2.P. ATZENI, S. CERI, P. FATERNALI, S. PARABOSCHI, R. TORLONE, BASI DI DATI, 6^ EDITION, MC GRAW-HILL, 2023.
3.JURE LESKOVEC, ANAND RAJARAMAN, JEFFREY D. ULLMAN, MINING OF MASSIVE DATASETS”, 3^ EDITION, CAMBRIDGE UNIVERSITY PRESS, 2020.
4.AURÉLIEN GÉRON, " HANDS-ON MACHINE LEARNING WITH SCIKIT-LEARN, KERAS, AND TENSORFLOW: CONCEPTS, TOOLS, AND TECHNIQUES TO BUILD INTELLIGENT SYSTEMS “,2^ EDITION, O REILLY ED, 2019.
5.P. DEITEL, H. DEITEL, INTRODUZIONE A PYTHON – PER L’INFORMATICA E LA DATA SCIENCE, PEARSON 2021.
More Information
COURSE ATTENDANCE IS STRONGLY RECOMMENDED. STUDENTS MUST BE PREPARED TO SPEND A FAIR AMOUNT OF TIME IN THE STUDY OUTSIDE OF LESSONS. FOR A SATISFACTORY PREPARATION STUDENTS NEED TO SPEND AN AVERAGE OF ONE HOUR OF STUDY TIME FOR EACH HOUR SPENT IN CLASS AND ABOUT 80 HOURS FOR DEVELOPING THE COURSE PROJECT.

COURSE MATERIALS WILL BE AVAILABLE FOR DOWNLOAD FROM THE DEPARTMENTAL E-LEARNING PLATFORM
HTTP://ELEARNING.INFORMATICA.UNISA.IT/EL-PLATFORM/

CONTACTS
PROF. GIUSEPPE POLESE
GPOLESE@UNISA.IT
PROF. VINCENZO DEUFEMIA
DEUFEMIA@UNISA.IT
  BETA VERSION Data source ESSE3 [Ultima Sincronizzazione: 2024-11-05]