Michele LA ROCCA | STATISTICAL METHODS FOR BIG DATA
Michele LA ROCCA STATISTICAL METHODS FOR BIG DATA
cod. 0212800016
STATISTICAL METHODS FOR BIG DATA
0212800016 | |
DEPARTMENT OF ECONOMICS AND STATISTICS | |
EQF6 | |
STATISTICS FOR BIG DATA | |
2024/2025 |
OBBLIGATORIO | |
YEAR OF COURSE 3 | |
YEAR OF DIDACTIC SYSTEM 2018 | |
AUTUMN SEMESTER |
SSD | CFU | HOURS | ACTIVITY | |
---|---|---|---|---|
SECS-S/01 | 5 | 30 | LESSONS |
Exam | Date | Session | |
---|---|---|---|
LA ROCCA | 10/12/2024 - 10:30 | SESSIONE ORDINARIA | |
LA ROCCA | 10/12/2024 - 10:30 | SESSIONE DI RECUPERO |
Objectives | |
---|---|
THE COURSE AIMS TO INTRODUCE THE MAIN METHODS, MODELS AND TECHNIQUES OF STATISTICAL ANALYSIS OF REAL PROBLEMS, WITH PARTICULAR REFERENCE TO CASES WHERE THE DATASET'S SIZE AT THE PROBLEM'S BASE IS SIGNIFICANT AND CANNOT BE MANAGED WITH STANDARD STATISTICAL METHODS. KNOWLEDGE AND UNDERSTANDING THE COURSE AIMS TO INTRODUCE THE STUDENT TO THE MAIN TECHNIQUES OF EXPLORATION AND ANALYSIS OF DATASETS CHARACTERIZED BY HIGH DIMENSIONALITY, BOTH IN TERMS OF THE NUMBER OF OBSERVATIONS AND IN TERMS OF THE NUMBER OF FEATURES. IN PARTICULAR, STUDENTS WILL LEARN BOTH THE BASIC THEORETICAL CONCEPTS AND THE COMPUTATIONAL SKILLS NECESSARY FOR THEIR CORRECT IMPLEMENTATION, INCLUDING THE TECHNIQUES THAT MAKE THE ANALYSIS SCALABLE AND APPLICABLE TO DISTRIBUTED DATASETS. THE TOPICS COVERED DURING THE COURSE WILL BE ACCOMPANIED BY EXERCISES ON REAL DATA DEVELOPED USING STATISTICAL SOFTWARE. ABILITY TO APPLY KNOWLEDGE AND UNDERSTANDING ATTENDANCE IN THE COURSE WILL ENABLE THE STUDENT TO ACQUIRE THE FOLLOWING SKILLS: (I) ABILITY TO USE DESCRIPTIVE-EXPLORATIVE AND INFERENTIAL METHODS NECESSARY TO SUPPORT DECISIONS RELATING TO PHENOMENA AND/OR SYSTEMS WHERE LARGE QUANTITIES OF DATA, VARIABILITY AND UNCERTAINTY DETERMINES A LEVEL OF COMPLEXITY THAT CANNOT BE ADDRESSED WITH OTHER TECHNIQUES; (II) THE CAPABILITY TO ANALYZE AND INTERPRET QUANTITATIVE INFORMATION, AND TO PRODUCE INDICATORS, STATISTICAL MODELS AND REPORTS TO SUPPORT DECISION-MAKING PARTICULARLY USEFUL IN AREAS CHARACTERIZED BY HIGH DIMENSIONALITY. |
Prerequisites | |
---|---|
KNOWLEDGE OF NOTIONS OF MATRIX CALCULUS, OF BASIC PROGRAMMING, OF THE STATISTICAL LANGUAGE R, OF PROBABILITY MODELS AND STATISTICAL INFERENCE, OF REGRESSION MODELS IS REQUIRED. |
Contents | |
---|---|
DATA ANALYSIS AND BIG DATA. BIG DATA: POTENTIAL AND PROBLEMS. THE FEATURES OF BIG DATA AND THE CONSEQUENCES ON DATA ANALYSIS METHODS (6H). MAPREDUCE, DIVIDE & CONQUER, SPLIT-APPLY-COMBINE ANALYSIS APPROACHES. FUNCTIONAL PROGRAMMING IN R. HADOOP AND SPARK. USING SPARK IN R. THE SPARKLYR PACKAGE (6H). DATA WRANGLING ON LARGE DATASETS AND DISTRIBUTED DATASETS. VISUALIZATION TECHNIQUES FOR BIG DATA IN R. CASE STUDIES IN R (8H). REGRESSION MODELS FOR BIG DATA. PENALTY ESTIMATES: RIDGE, LASSO AND ELASTIC NET. REGRESSION MODELS FOR BIG DATA WITH SPARKLYR. CASE STUDIES WITH R (10). |
Teaching Methods | |
---|---|
THE COURSE INCLUDES 30 HOURS OF CLASSROOM TEACHING. ALTHOUGH NOT MANDATORY, GIVEN THE NATURE OF THE COURSE, ATTENDANCE IS STRONGLY RECOMMENDED. DURING CLASSES, THEORETICAL ISSUES WILL BE ADDRESSED, CONSTANTLY SUPPORTED BY THE PRESENTATION OF CASE STUDIES THROUGH WHICH THE METHODS OF IMPLEMENTATION OF THE TECHNIQUES, THE CONTEXTS OF USE OF THE VARIOUS TOOLS AND THE POSSIBLE INTERPRETATIONS OF THE RESULTS OBTAINED WILL BE CLARIFIED. AS A CONSEQUENCE, EXERCISES WILL FORM AN INTEGRAL PART OF THE SCHEDULED LESSONS. |
Verification of learning | |
---|---|
THE STUDENT WILL BE ASSESSED DURING THE FINAL TEST TO BE HELD ON THE EXAM DATES SCHEDULED BY THE DEPARTMENT. DURING THE FINAL TEST, THE STUDENT WILL HAVE TO TAKE A WRITTEN TEST (ASSESSED IN THIRTIETHS) AND AN ORAL TEST WHICH WILL BE HELD, TYPICALLY, IN THE DAYS IMMEDIATELY FOLLOWING. THE DATE OF THE WRITTEN TEST IS SCHEDULED BY THE DEPARTMENT, AND THE DAY OF THE ORAL TEST IS COMMUNICATED TO THE STUDENTS AT THE END OF THE WRITTEN TEST. THE WRITTEN TEST (DURATION OF ABOUT 90 MINUTES HOURS) IS AIMED AT ASCERTAINING THE STUDENT'S ABILITY TO USE THE SOFTWARE TOOLS COVERED BY THE COURSE, THE STATISTICAL TECHNIQUES OF BOTH EXPLORATORY AND INFERENTIAL TYPES STUDIED, TO INTERPRET AND COMMENT ON THE STATISTICAL RESULTS OBTAINED. DURING THE WRITTEN TEST, THE STUDENT WILL RECEIVE AN EXAM TRACK AND WILL BE ASKED TO ANSWER 5 QUESTIONS (EACH WITH A MAXIMUM SCORE OF 6 POINTS) ON THE ENTIRE COURSE PROGRAM. THE ORAL TEST (LASTING ABOUT 30 MINUTES) CONSISTS OF AN INTERVIEW WITH QUESTIONS AND DISCUSSION OF THE WRITTEN PAPER. THE FINAL MARK (MIN 18, MAX 30 WITH POSSIBLE HONORS) IS ATTRIBUTED BY EVALUATING THE RESULTS OF THE WRITTEN AND ORAL TESTS IN WHICH THE MASTERY OF THE COURSE CONTENT WILL BE ASSESSED, APPROPRIATENESS OF THE DEFINITIONS AND THEORETICAL REFERENCES, CLARITY OF THE ARGUMENT, DOMAIN OF SPECIALIZED LANGUAGE. |
Texts | |
---|---|
LECTURE NOTES, WEB RESOURCES AND ARTICLES SUGGESTED BY THE TEACHER DURING THE COURSE WILL BE MADE AVAILABLE TO ALL ATTENDING STUDENTS R FOR DATA SCIENCE (2E), HADLEY WICKHAM, MINE CETINKAYA-RUNDEL & GARRETT GROLEMUND, O REILLY MASTERING SPARK WITH R, JAVIER LURASCHI, KEVIN KUO, EDGAR RUIZ, O REILLY TO RESPOND FLEXIBLY TO THE SPECIFIC NEEDS OF EACH INDIVIDUAL STUDENT, THE TEACHER CAN SUGGEST ALTERNATIVE OR ADDITIONAL READINGS UPON STUDENTS’ REQUEST. |
More Information | |
---|---|
THE INSTRUCTOR PROVIDES FURTHER EXPLANATIONS AND METHODOLOGICAL SUPPORT TO THE STUDENTS DURING OFFICE HOURS. DAYS, TIMES AND PLACES, AS WELL AS ANY CHANGES, ARE COMMUNICATED ON THE TEACHER'S WEB PAGE. IN ANY CASE, IT IS POSSIBLE TO ARRANGE AN APPOINTMENT OUTSIDE THE SCHEDULED TIMES FOR OFFICE HOURS BY SENDING AN EMAIL TO THE INSTRUCTOR’S EMAIL ADDRESS. |
BETA VERSION Data source ESSE3 [Ultima Sincronizzazione: 2024-11-18]