ADVANCED STATISTICAL MODELING FOR BIG DATA

Michele LA ROCCA ADVANCED STATISTICAL MODELING FOR BIG DATA

0222800010
DEPARTMENT OF MANAGEMENT & INNOVATION SYSTEMS
EQF7
DATA SCIENCE E GESTIONE DELL'INNOVAZIONE
2024/2025

OBBLIGATORIO
YEAR OF COURSE 2
YEAR OF DIDACTIC SYSTEM 2022
AUTUMN SEMESTER
CFUHOURSACTIVITY
963LESSONS
ExamDate
LA ROCCA09/06/2025 - 14:30
LA ROCCA24/06/2025 - 14:30
LA ROCCA08/07/2025 - 14:30
LA ROCCA02/09/2025 - 14:30
Objectives
KNOWLEDGE AND UNDERSTANDING SKILLS

THE TEACHING AIMS TO PROVIDE THE FOLLOWING:
THE KNOWLEDGE OF THE ANALYSIS OF ADVANCED STATISTICAL MODELS USEFUL FOR UNDERSTANDING PROBLEMS AND IMPROVING DECISION-MAKING PROCESSES;
KNOWLEDGE OF ADVANCED STATISTICAL MODELS AND STATISTICAL LEARNING TOOLS USEFUL AS DECISION SUPPORT FOR PHENOMENA AND SYSTEMS IN WHICH LARGE AMOUNTS OF DATA, VARIABILITY AND UNCERTAINTY IMPLY A LEVEL OF COMPLEXITY THAT IS UNMANAGEABLE USING TRADITIONAL TECHNIQUES;
ABILITY TO ANALYSE AND INTERPRET COMPLEX DATA AND PRODUCE PREDICTIVE AND ANALYTICAL MODELS TO SUPPORT COMPANY MANAGEMENT AND CONTROL POLICIES IN THE PUBLIC AND PRIVATE SECTORS.

ABILITY TO APPLY KNOWLEDGE AND UNDERSTANDING

ALL STATISTICAL MODELS WILL BE PRESENTED AS PREDICTIVE AND ANALYTICAL/INTERPRETATIVE TOOLS TO UNDERSTAND THE PROBLEMS IN A GENERAL DECISION-MAKING PROCESS.

IN PARTICULAR, STUDENTS WILL DEVELOP THE ABILITY TO SPECIFY, ESTIMATE, AND VALIDATE A BROAD CLASS OF STATISTICAL MODELS WHEN APPLIED TO COMPLEX DATA STRUCTURES.

A SPECIFIC FOCUS WILL BE GIVEN TO THE MODERN TOOLS AVAILABLE TO MANAGE AND ANALYSE BIG DATA AND STATISTICAL PROGRAMMING LANGUAGES TO DEVELOP AND IMPLEMENT EFFECTIVE ANALYTICAL SOLUTIONS. DIFFERENT CASE STUDIES WILL BE PRESENTED AND DISCUSSED TO BUILD STUDENTS' ABILITY TO LEVERAGE THEIR KNOWLEDGE TO ANALYSE REAL PROBLEMS AND DATASETS.
Prerequisites
KNOWLEDGE OF CALCULUS AND MATRIX CALCULUS, BASIC PROGRAMMING, STATISTICAL LANGUAGE R, PROBABILITY AND STATISTICAL INFERENCE IS REQUIRED.
Contents
REGRESSION MODELS, PREDICTIVE MODELS AND ANALYTICAL MODELS. PROBABILITY MODELS FOR NON-GAUSSIAN DATA. THE EXPONENTIAL FAMILY. GENERALIZED LINEAR MODELS (GLM). MODELS FOR GAUSSIAN DATA. MODELS FOR NON-GAUSSIAN CONTINUOUS DATA. MODELS FOR BINARY DATA. MODELS FOR COUNTING DATA. TWO-PART MODELS. LINEAR AND GLM MODELS FOR BIG DATA. ESTIMATES OF MANY MODELS ON DIFFUSED DATASETS. ESTIMATE IN THE PRESENCE OF HIGH DIMENSIONALITY. PENALTY ESTIMATES FOR GLM MODELS: RIDGE AND LASSO. GENERALIZATION OF THE LASSO. ELASTIC NET. THE GROUP LASSO. THE FUSED LASSO. ESTIMATION OF STATISTICAL MODELS IN SPARK. LINEAR AND GLM MODELS FOR BIG DATA IN R. PENALTY ESTIMATES IN R. CASE STUDIES AND APPLICATIONS TO NOTABLE PROBLEMS. FOR THE STUDENTS OF DATA SCIENCE AND INNOVATION MANAGEMENT, THERE WILL BE AN ADDITIONAL LESSON (3 HOURS) TO PRESENT AND DISCUSS APPLICATIONS OF DATA SCIENCE TO MANAGEMENT PROBLEMS.
Teaching Methods
THE COURSE INCLUDES 60 HOURS OF CLASSROOM TEACHING. ALTHOUGH ATTENDANCE IS NOT MANDATORY, GIVEN THE NATURE OF THE COURSE, IT IS STRONGLY RECOMMENDED.
DURING THE LESSONS, THEORETICAL ISSUES WILL BE DISCUSSED, CONSTANTLY SUPPORTED BY THE IMPLEMENTATION OF THE METHODOLOGIES PROPOSED IN THE STATISTICAL LANGUAGE R, BY THE PRESENTATION OF CASE STUDIES THROUGH WHICH THE OPERATIONAL METHODS OF IMPLEMENTATION OF THE TECHNIQUES WILL BE ILLUSTRATED, THE CONTEXTS OF USE OF THE DIFFERENT TOOLS AND WILL BE CLARIFIED THE POSSIBLE INTERPRETATIONS OF THE RESULTS OBTAINED. THE EXERCISES WILL, THEREFORE, CONSTITUTE AN INTEGRAL PART OF THE PLANNED LESSONS. THE COURSE WILL BE DELIVERED IN ENHANCED WEB MODE. THEREFORE, THE LECTURES WILL BE ACCOMPANIED BY A WEB SPACE FOR DISTRIBUTING HANDOUTS AND SUPPLEMENTARY READINGS, USEFUL DATASETS FOR DEVELOPING CASE STUDIES AND ANY CLARIFICATIONS ON THE TOPICS COVERED IN THE COURSE. THE AVAILABILITY OF THE WEB SPACE IS NOT TO BE UNDERSTOOD AS AN ALTERNATIVE OR REPLACEMENT FOR IN-PERSON LESSONS.
Verification of learning
THE STUDENT WILL BE EVALUATED DURING THE FINAL EXAM, WHICH WILL BE HELD ON THE DEPARTMENT'S SCHEDULED EXAM DATES.
THE STUDENT MUST DISCUSS PROJECT WORK DURING THE FINAL EXAM AND TAKE AN ORAL TEST ON THE SCHEDULED TOPICS. THE PROJECT WORK MUST BE AGREED UPON WITH THE TEACHER DURING THE COURSE, CARRIED OUT INDIVIDUALLY, AND AIMS TO EVALUATE THE STUDENT'S ABILITY TO SPECIFY AND VALIDATE NEURAL MODELS TO SOLVE A SPECIFIC PROBLEM AND COMMUNICATE THE RESULTS THROUGH A STATISTICAL REPORT. THE EVALUATION OF THE PROJECTS WILL BE CARRIED OUT TAKING INTO ACCOUNT THE FOLLOWING ASPECTS:
- EFFECTIVE FORMULATION AND FRAMING OF THE PROBLEM WITH CLEAR RESEARCH QUESTIONS;
- CORRECTNESS AND EFFECTIVENESS OF THE SPECIFICATION AND VALIDATION OF THE MODELS PROPOSED FOR THE SOLUTION OF THE PROBLEMS FORMULATED IN THE PREVIOUS POINT;
- CORRECTNESS AND EFFICIENCY OF THE COMPUTATIONAL SOLUTIONS ADOPTED;
- CORRECTNESS AND EFFECTIVENESS OF THE COMMENTS ON THE RESULTS OBTAINED;
- CONTENT, STRUCTURE AND COMMUNICATIVE EFFECTIVENESS OF THE REPORT.
THE FINAL GRADE, AWARDED OUT OF THIRTY, WILL CONSIDER THE QUALITY OF THE PROJECT WORK DEVELOPED, THE LEVEL OF THEORETICAL KNOWLEDGE ACQUIRED ON THE TOPICS IN THE PROGRAM, THE AUTONOMY OF ANALYSIS AND JUDGMENT, AND THE STUDENT'S PRESENTATION SKILLS IN THE ORAL TEST.
Texts
LECTURE NOTES, WEB SITES AND SUGGESTED PAPERS WILL BE MADE AVAILABLE BY THE INSTRUCTOR DURING SCHEDULED CLASSES
- GENERALIZED LINEAR MODELS FOR INSURANCE DATA, PIET DE JONG GILLIAN HELLER, CAMBRIDGE UNIVERSITY PRESS
- MASTERING SPARK WITH R, BY JAVIER LURASCHI, KEVIN KUO, EDGAR RUIZ, O'REILLY
More Information
THE INSTRUCTOR PROVIDES FURTHER EXPLANATIONS AND METHODOLOGICAL SUPPORT TO STUDENTS DURING OFFICE HOURS.
DAYS, TIMES AND PLACE OF THE OFFICE HOURS,, AS WELL AS ANY CHANGES, ARE COMMUNICATED ON THE INSTRUCTOR’S WEB PAGE.
IT IS POSSIBLE TO ARRANGE AN APPOINTMENT OUTSIDE THE SCHEDULED RECEPTION HOURS BY SENDING AN EMAIL TO THE TEACHER'S EMAIL ADDRESS.
Lessons Timetable

  BETA VERSION Data source ESSE3 [Ultima Sincronizzazione: 2025-04-14]