FONDAMENTI DI DATA SCIENCE E MACHINE LEARNING

Giuseppe POLESE FONDAMENTI DI DATA SCIENCE E MACHINE LEARNING

0522500135
COMPUTER SCIENCE
EQF7
COMPUTER SCIENCE
2024/2025



YEAR OF COURSE 1
YEAR OF DIDACTIC SYSTEM 2016
SPRING SEMESTER
CFUHOURSACTIVITY
972LESSONS
Objectives
THE MAIN GOAL OF THE COURSE IS TO PROVIDE THE STUDENTS WITH THE METHODOLOGICAL AND TECHNOLOGICAL SKILLS NECESSARY TO EXTRACT KNOWLEDGE FROM BIG DATA COLLECTIONS, EXPLOITING DATA PROFILING, DATA MINING, AND MACHINE LEARNING TECHNIQUES, USING PROPER VISUALIZATION TECHNIQUES TO EXPLAIN RESULTS. IN PARTICULAR, THE COURSE AIMS TO COMPLEMENT SKILLS ACQUIRED DURING PREVIOUS DATA MANAGEMENT COURSES WITH SKILLS USEFUL FOR THE DATA SCIENTIST PROFESSIONAL PROFILE.
THE MAIN SKILLS THAT WILL BE ACQUIRED ARE:
•BIG DATA
•DATA WRANGLING
•AUTOMATIC DISCOVERY OF FUNCTIONAL DEPENDENCIES FROM BIG DATA
•DATA QUALITY E DATA CLEANSING
•DATA INTEGRATION
•DATA MINING
•MAPREDUCE
•SIMILARITY FUNCTIONS
•MACHINE LEARNING
•NEURAL NETWORKS

THE MAIN SKILLS (I.E., THE ABILITY TO APPLY THE ACQUIRED KNOWLEDGE) WILL BE:
•ACQUIRE, ORGANIZE, MANAGE AND PROCESS BIG DATA COLLECTIONS
•EXTRACT KNOWLEDGE FROM BIG DATA
•SELECT USEFUL DATA
•ORGANIZE A PROJECT BASED ON MACHINE LEARNING TECHNIQUES
•EFFECTIVELY COMMUNICATE KNOWLEDGE EXTRACTED FROM DATA THROUGH SEVERAL REPRESENTATION FORMATS, INCLUDING VISUAL ONES.
Prerequisites
STUDENTS SHOULD BE FAMILIAR WITH FUNDAMENTALS OF DATA MANAGEMENT, DISTRIBUTED SYSTEMS, OBJECT ORIENTED PARADIGM, AND A PROGRAMMING LANGUAGE.
Contents

AFTER INTRODUCING THE NEW APPLICATION SCENARIOS RELATED TO THE MANAGEMENT OF DISTRIBUTED, HETEROGENEOUS, BIG DATA COLLECTIONS, INCLUDING THE POTENTIALS OF NEW TECHNOLOGIES CAPABLE OF EXTRACTING KNOWLEDGE FROM DATA, THE COURSE WILL FOCUS ON THE FOLLOWING TOPICS:
BIG DATA (4 HOURS OF THEORY)
ISSUES RELATED TO BIG DATA (2 HOURS OF THEORY)
OVERVIEW ON TECHNOLOGIES FOR BIG DATA (2 HOURS OF THEORY)

DATA PREPARATION (12 HOURS OF THEORY)
•DATA PROFILING (4 HOURS OF THEORY)
•RELAXED FUNCTIONAL DEPENDENCIES AND THEIR USE IN DATA QUALITY (4 HOURS OF THEORY)
•DATA INTEGRATION FROM MULTIPLE DATA SOURCES (4 HOURS OF THEORY)

KNOWLEDGE DISCOVERY FROM BIG DATA SETS (12 HOURS OF THEORY)
•MAPREDUCE (4 HOURS OF THEORY)
•SIMILARITY EVALUATION (5 HOURS OF THEORY)
•INTRODUCTION TO DATA MINING (2 HOURS OF THEORY)
•APRIORI ALGORITHM (1 HOURS OF THEORY)

MACHINE LEARNING (24 HOURS OF THEORY)
•INTRODUCTORY CONCEPTS (4 HOURS OF THEORY)
•PHASES OF A MACHINE LEARNING PROJECT (5 HOURS OF THEORY)
•TRAINING MODELS (2 HOURS OF THEORY)
•CLASSIFICATION/REGRESSION (3 HOURS OF THEORY)
•DECISION TREES (2 HOURS OF THEORY)
•ENSEMBLE LEARNING AND RANDOM FOREST (2 HOURS OF THEORY)
•CLUSTERING (2 HOURS OF THEORY)
•DIMENSIONALITY REDUCTION (2 HOURS OF THEORY)
•SUPPORT VECTOR MACHINE (2 HOURS OF THEORY)

NEURAL NETWORKS (14 HOURS OF THEORY)
•INTRODUCTION TO NEURAL NETWORKS (2 HOURS OF THEORY)
•TENSOR FLOW (2 HOUR OF THEORY)
•MULTILEVEL PERCEPTRONS AND DEEP NEURAL NETWORKS (2 HOURS OF THEORY)
•CONVOLUTIONAL NEURAL NETWORKS (2 HOURS OF THEORY)
•RECURRENT NETWORKS (4 HOURS OF THEORY)
•AUTOENCODERS (2 HOURS OF THEORY)

TOOLS FOR DATA SCIENCE (6 HOURS OF THEORY)
•PYTHON LANGUAGE (4 HOURS LECTURES)
•WEKA (2 HOURS LECTURES)
Teaching Methods
THE COURSE INCLUDES 66 HOURS OF LECTURES ON THEORETICAL TOPICS AND 6 HOURS ON PROGRAMMING LANGUAGES AND TOOLS, AIMING TO INTRODUCE CONCEPTS AND TO DEVELOP ABILITIES TO DESIGN AND IMPLEMENT SOLUTIONS FOR DATA SCIENCE AND MACHINE LEARNING PROBLEMS. COURSE CONTENTS ARE PRESENTED THROUGH POWERPOINT SLIDES, STIMULATING CRITICAL DISCUSSIONS WITH THE STUDENTS. FOR EACH PRESENTED TOPIC, THE INSTRUCTOR WILL ILLUSTRATE POTENTIAL TASKS ON WHICH A STUDENT OR A GROUP CAN DEVELOP THE COURSE PROJECT. AS FOR LANGUAGES AND TOOLS, OTHER THAN POWERPOINT SLIDES, THROUGH WHICH CONCEPTS AND POSSIBLE ADDITIONAL RESOURCES, SUCH AS LINKS TO FORUMS, MANUALS, AND OTHER SITES ARE PRESENTED, DURING OFFICE HOURS STUDENTS ARE GIVEN THE POSSIBILITY TO ASK SUPPORT ON SIMULATIONS THEY PERFORMED ON THEIR PERSONAL COMPUTER, TO ASK CLARIFICATIONS, AND SOLVE POSSIBLE TECHNICAL PROBLEMS WITH THE ASSISTANCE OF THE INSTRUCTOR.
Verification of learning
THE ACHIEVEMENT OF THE COURSE OBJECTIVES IS CERTIFIED BY MEANS OF AN EXAM, WHOSE FINAL GRADE IS EXPRESSED ON A SCALE OF 30. THE EXAM CONSISTS OF A WRITTEN TEST (ALTERNATIVELY, A MIDTERM WRITTEN TEST) AND AN ORAL EXAMINATION. MOREOVER, STUDENTS CAN OPTIONALLY CHOOSE TO ALSO DEVELOP A COURSE PROJECT TO INCREASE THE GRADE ACHIEVED WITH THESE TWO EXAMINATIONS. THE WRITTEN TEST (OR THE MIDTERM TEST) AIMS TO ASSESS THE UNDERSTANDING OF THEORETICAL CONCEPTS. THE ORAL EXAMINATION CONSISTS OF AN INTERVIEW WITH QUESTIONS ON THE THEORETICAL AND METHODOLOGICAL CONTENTS TAUGHT DURING THE COURSE, AIMING TO ASSESS THE LEVEL OF KNOWLEDGE AND UNDERSTANDING, AS WELL AS THE ABILITY TO EXPOSE CONCEPTS. THE ORAL EXAMINATION REPRESENTS THE FINAL ONE, HENCE IT CAN BE TAKEN ONLY AFTER HAVE PASSED THE WRITTEN TEST, AND IN CASE THE STUDENT HAS CHOSEN TO DEVELOP A COURSE PROJECT, ALSO AFTER HAVE COMPLETED AND DISCUSSED THE PROJECT.
THE PROJECT AIMS TO ASSESS THE ABILITY TO APPLY THE ACQUIRED KNOWLEDGE IN REAL SCENARIOS. IT CAN BE CARRIED OUT INDIVIDUALLY OR IN TEAMS OF UP TO 3 STUDENTS, CHOOSING FROM A RANGE OF PROPOSALS PROVIDED BY THE INSTRUCTOR. DURING THE PROJECT DEVELOPMENT, STUDENTS CAN INTERACT WITH THE INSTRUCTOR IN ORDER TO COMMUNICATE THE PROJECT’S PROGRESS AND POSSIBLE CRITICAL ISSUES, DEBATING ON THE GOALS OF THE PROJECT AND THE MODALITIES TO CONTINUE IT. AT THE END OF THE PROJECT, STUDENTS MUST DELIVER A TECHNICAL REPORT CONTAINING THE PROJECT DOCUMENTATION, BASED ON WHICH THEY WILL RECEIVE IN FEW WEEKS A FIRST EVALUATION OF THE PROJECT, POSSIBLY WITH SOME REQUESTS FOR INTEGRATION AND/OR REVISION OF THE REPORT. THUS, IT IS NECESSARY TO SUBMIT THE PROJECT REPORT SOME WEEKS BEFORE THE PERIOD IN WHICH THE STUDENT PLANS TO TAKE THE ORAL EXAMINATION AND CONCLUDE THE EXAM, SO AS TO GRANT THE INSTRUCTOR ENOUGH TIME TO EVALUATE THE PROJECT (ALSO TAKING INTO CONSIDERATION THE POSSIBILITY THAT SEVERAL PROJECTS MIGHT BE SUBMITTED IN THE SAME PERIOD), AND THE PROJECT TEAM TO REVISE IT BASED ON THE INSTRUCTOR’S REMARKS. AFTER PROJECT COMPLETION, THE PROJECT TEAM MIGHT BE REQUESTED TO PREPARE A DISSERTATION AND POWERPOINT PRESENTATION OF ABOUT 30 MINUTES. AFTER THE PROJECT PRESENTATION STUDENTS CAN UNDERGO AN INDIVIDUAL ORAL EXAMINATION.
THE FINAL GRADE, GIVEN ON A SCALE OF THIRTIETHS, IS GENERALLY ASSIGNED THROUGH THE AVERAGE OF THE GRADES ACHIEVED ON THE WRITTEN (ALTERNATIVELY THE MIDTERM WRITTEN TEST) AND THE ORAL EXAMINATIONS, WITH THE POSSIBILITY TO OPTIONALLY INCREASE SUCH GRADE UP TO 3 POINTS THROUGH THE DEVELOPMENT OF THE COURSE PROJECT.
Texts
1.JURE LESKOVEC, ANAND RAJARAMAN, JEFFREY D. ULLMAN, “MINING OF MASSIVE DATASETS”, 2^ EDITION, CAMBRIDGE UNIVERSITY PRESS, 2014.
2.AURÉLIEN GÉRON, " HANDS-ON MACHINE LEARNING WITH SCIKIT-LEARN, KERAS, AND TENSORFLOW: CONCEPTS, TOOLS, AND TECHNIQUES TO BUILD INTELLIGENT SYSTEMS “,2^ EDITION, O REILLY ED, 2019.
3.MÜLLER, ANDREAS C., AND SARAH GUIDO. INTRODUCTION TO MACHINE LEARNING WITH PYTHON: A GUIDE FOR DATA SCIENTISTS. " O'REILLY MEDIA, INC.", 2016.
4.CHIRAG SHAH, A HANDS-ON INTRODUCTION TO DATA SCIENCE, CAMBRIDGE UNIVERSITY PRESS, 2020.
5.FOSTER PROVOST, TOM FAWCETT, DATA SCIENCE FOR BUSINESS: WHAT YOU NEED TO KNOW ABOUT DATA MINING AND DATA-ANALYTIC THINKING, O REILLY ED.
6.P. DEITEL, H. DEITEL, INTRODUZIONE A PYTHON – PER L’INFORMATICA E LA DATA SCIENCE, PEARSON 2021.
More Information
COURSE ATTENDANCE IS STRONGLY RECOMMENDED. STUDENTS MUST BE PREPARED TO SPEND A FAIR AMOUNT OF TIME IN THE STUDY OUTSIDE OF LESSONS. FOR A SATISFACTORY PREPARATION STUDENTS NEED TO SPEND AN AVERAGE OF ONE HOUR OF STUDY TIME FOR EACH HOUR SPENT IN CLASS, AND ABOUT 80 HOURS FOR DEVELOPING THE PROJECT.

COURSE MATERIALS WILL BE AVAILABLE FOR DOWNLOAD FROM THE DEPARTMENTAL
E-LEARNING PLATFORM HTTP://ELEARNING.INFORMATICA.UNISA.IT/EL-PLATFORM/

CONTACTS
PROF. GIUSEPPE POLESE
GPOLESE@UNISA.IT
  BETA VERSION Data source ESSE3 [Ultima Sincronizzazione: 2024-11-18]