MINING AND PROFILING DATA STREAMS

LOREDANA CARUCCIO MINING AND PROFILING DATA STREAMS

8860100006
COMPUTER SCIENCE
Corso di Dottorato (D.M.226/2021)
COMPUTER SCIENCE
2024/2025

YEAR OF COURSE 1
YEAR OF DIDACTIC SYSTEM 2024
AUTUMN SEMESTER
CFUHOURSACTIVITY
318LESSONS
Objectives
THE GOAL OF THIS COURSE IS TO PROVIDE STUDENTS WITH METHODOLOGICAL AND TECHNOLOGICAL SKILLS TO ANALYZE IN REAL TIME BIG DATA STREAMS. SPECIFICALLY, THE COURSE AIMS TO EXPLORE METHODS AND TECHNIQUES FOR PROFILING DATA AND SEARCHING FOR USEFUL INFORMATION FROM STREAMS IN ORDER TO EXTRACT METADATA AND PERFORM UNSUPERVISED LEARNING ACTIVITIES.

KNOWLEDGE AND UNDERSTANDING:
PROVIDE THE STUDENT WITH KNOWLEDGE ON THE ALGORITHMS, THE MODELS, AND THE TECHNOLOGIES TO MANAGE BIG DATA STREAMS AND DATA SERIES, AIMING TO AUTOMATICALLY EXTRACT USEFUL INFORMATION, DATA CORRELATIONS, AND PROPERTIES. MORE SPECIFICALLY, THE COURSE AIMS TO PROVIDE STUDENTS WITH THE FOLLOWING SKILLS:

- DATA PROPERTIES
- ALGORITHMS FOR EXTRACTING PROFILING METADATA
- CONTINUOUS DATA PROFILING
- SEQUENCE DATA MINING
- ANALYSIS OF DATA SERIES

APPLYING KNOWLEDGE AND UNDERSTANDING:
THE COURSE AIMS TO PROVIDE STUDENTS WITH THE FOLLOWING ABILITIES:
• KNOW HOW TO MANAGE AND ANALYZE BIG DATA STREAMS AND THEIR PROPERTIES.
• KNOW HOW TO SELECT THE MOST APPROPRIATE DATA PROFILING AND MINING TECHNIQUES FOR ANALYZING DATA IN SCENARIOS AT DIFFERENT LEVELS OF COMPLEXITY.
Prerequisites
STUDENTS SHOULD BE FAMILIAR WITH FUNDAMENTALS OF DATA MANAGEMENT, DISTRIBUTED SYSTEMS, OBJECT-ORIENTED PARADIGM, AND A PROGRAMMING LANGUAGE.
Contents
THE COURSE WILL FOCUS ON THE FOLLOWING TOPICS:

INTRODUCTION TO THE DATA PROFILING (2 HOURS OF THEORY)
• DATA PROFILING TASKS AND TOOLS (1 HOUR OF THEORY)
• OPEN CHALLENGES (1 HOUR OF THEORY)

DISCOVERY TASK: UCCS & INDS (2 HOURS OF THEORY)
• UNIQUE COLUMN COMBINATIONS (1 HOUR OF THEORY)
• INCLUSION DEPENDENCIES (1 HOUR OF THEORY)

FUNCTIONAL DEPENDENCIES AND THEIR DISCOVERY ALGORITHMS (2 HOURS OF THEORY)
• DEFINITION AND PROPERTIES (1 HOUR OF THEORY)
• THE TANE ALGORITHM (1 HOUR OF THEORY)

RELAXED FUNCTIONAL DEPENDENCIES (2 HOURS OF THEORY)
• DEFINITION AND RELAXATION CRITERIA (1 HOUR OF THEORY)
• DIME AND DOMINO ALGORITHMS (1 HOUR OF THEORY)

CONTINUOUS PROFILING (2 HOURS OF THEORY)
• PROBLEM ISSUES (1 HOUR OF THEORY)
• ALGORITHMIC SOLUTIONS (1 HOUR OF THEORY)

REAL-TIME SEQUENCE MINING (6 HOURS OF THEORY)
• THE STREAM DATA MODEL (1 HOUR OF THEORY)
• SAMPLING DATA STREAMS (1 HOUR OF THEORY)
• FILTERING STREAMS: THE BLOOM FILTER (1 HOUR OF THEORY)
• COUNTING DISTINCT ELEMENTS IN A STREAM (1 HOUR OF THEORY)
• DECAYING WINDOWS (1 HOUR OF THEORY)
• MINING SEQUENCIAL PATTERNS (1 HOUR OF THEORY)

DATA SERIE ANALYTICS (2 HOURS OF THEORY)
Teaching Methods
THE COURSE INCLUDES 18 HOURS OF LECTURES ON THEORETICAL TOPICS, AIMING TO INTRODUCE CONCEPTS AND TO DEVELOP ABILITIES TO DESIGN AND IMPLEMENT SOLUTIONS FOR REAL-TIME ANALYSIS OF DATA STREAMS. COURSE CONTENTS ARE PRESENTED THROUGH POWERPOINT SLIDES, STIMULATING CRITICAL DISCUSSIONS WITH THE STUDENTS. FOR EACH PRESENTED TOPIC, THE INSTRUCTORS WILL ILLUSTRATE POTENTIAL TASKS ON WHICH A STUDENT OR A GROUP CAN DEVELOP THE COURSE PROJECT.
Verification of learning
THE ACHIEVEMENT OF THE COURSE OBJECTIVES IS CERTIFIED BY MEANS OF AN EXAM, WHOSE FINAL GRADE IS EXPRESSED ON A SCALE OF 30. THE EXAM CONSISTS OF THE DEVELOPMENT OF A COURSE PROJECT THROUGH WHICH STUDENTS MIGHT SHOW THEIR ABILITY TO APPLY THE ACQUIRED KNOWLEDGE IN REAL SCENARIOS. IT CAN BE CARRY OUT INDIVIDUALLY OR IN GROUPS OF UP TO 2 STUDENTS, WHO CAN CHOOSE FROM A RANGE OF PROPOSALS PROVIDED BY THE INSTRUCTORS. DURING THE PROJECT DEVELOPMENT, STUDENTS CAN INTERACT WITH THE INSTRUCTORS IN ORDER TO COMMUNICATE THE PROJECT’S PROGRESS AND POSSIBLE CRITICAL ISSUES, DEBATING ON THE GOALS OF THE PROJECT AND THE MODALITIES TO CONTINUE IT. AT THE END OF THE PROJECT, STUDENTS SHOULD DELIVER A TECHNICAL REPORT CONTAINING THE PROJECT DOCUMENTATION, AND MAKE A POWERPOINT PRESENTATION LASTING ABOUT 30 MINUTES.
Texts
1. MINING OF MASSIVE DATASETS. ANAND RAJARAMAN, JEFFREY DAVID ULLMAN. 2011. CAMBRIDGE UNIVERSITY PRESS.
2. PROFILING RELATIONAL DATA: A SURVEY. ZIAWASCH ABEDJAN, LUKASZ GOLAB, FELIX NAUMANN, VLDB JOURNAL, VOL. 24(4):557-581, 2015.
3. RELAXED FUNCTIONAL DEPENDENCIES: A SURVEY OF APPROACHES. LOREDANA CARUCCIO, VINCENZO DEUFEMIA, GIUSEPPE POLESE . IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 28.1 (2015): 147-165.
More Information
COURSE ATTENDANCE IS STRONGLY RECOMMENDED. STUDENTS MUST BE PREPARED TO SPEND A FAIR AMOUNT OF TIME IN THE STUDY OUTSIDE OF LESSONS. FOR A SATISFACTORY PREPARATION STUDENTS NEED TO SPEND AN AVERAGE OF ONE HOUR OF STUDY TIME FOR EACH HOUR SPENT IN CLASS. FOR THE PROJECT STUDENTS SHOULD SPEND ABOUT 30 HOURS FOR DEVELOPING IT.

COURSE MATERIALS WILL BE AVAILABLE FOR DOWNLOAD FROM THE DEPARTMENTAL
E-LEARNING PLATFORM
HTTP://ELEARNING.INFORMATICA.UNISA.IT/EL-PLATFORM/

CONTACTS
PROF. GIUSEPPE POLESE
GPOLESE@UNISA.IT

PROF. LOREDANA CARUCCIO
LCARUCCIO@UNISA.IT
Lessons Timetable

  BETA VERSION Data source ESSE3 [Ultima Sincronizzazione: 2024-12-13]