Heterogeneous Multi-Domain Dataset Synthesis to Facilitate Privacy and Risk Assessments in Smart City IoT
- Univ. of Nebraska, Lincoln, NE (United States); Oregon State Univ., Corvallis, OR (United States)
- Univ. of Nebraska, Lincoln, NE (United States)
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
The emergence of the Smart Cities paradigm and the rapid expansion and integration of Internet of Things (IoT) technologies within this context have created unprecedented opportunities for high-resolution behavioral analytics, urban optimization, and context-aware services. However, this same proliferation intensifies privacy risks, particularly those arising from cross-modal data linkage across heterogeneous sensing platforms. To address these challenges, this paper introduces a comprehensive, statistically grounded framework for generating synthetic, multimodal IoT datasets tailored to Smart City research. The framework produces behaviorally plausible synthetic data suitable for preliminary privacy risk assessment and as a benchmark for future re-identification studies, as well as for evaluating algorithms in mobility modeling, urban informatics, and privacy-enhancing technologies. As part of our approach, we formalize probabilistic methods for synthesizing three heterogeneous and operationally relevant data streams—cellular mobility traces, payment terminal transaction logs, and Smart Retail nutrition records—capturing the behaviors of a large number of synthetically generated urban residents over a 12-week period. The framework integrates spatially explicit merchant selection using K-Dimensional (KD)-tree nearest-neighbor algorithms, temporally correlated anchor-based mobility simulation reflective of daily urban rhythms, and dietary-constraint filtering to preserve ecological validity in consumption patterns. In total, the system generates approximately 116 million mobility pings, 5.4 million transactions, and 1.9 million itemized purchases, yielding a reproducible benchmark for evaluating multimodal analytics, privacy-preserving computation, and secure IoT data-sharing protocols. To show the validity of this dataset, the underlying distributions of these residents were successfully validated against reported distributions in published research. We present preliminary uniqueness and cross-modal linkage indicators; comprehensive re-identification benchmarking against specific attack algorithms is planned as future work. This framework can be easily adapted to various scenarios of interest in Smart Cities and other IoT applications. By aligning methodological rigor with the operational needs of Smart City ecosystems, this work fills critical gaps in synthetic data generation for privacy-sensitive domains, including intelligent transportation systems, urban health informatics, and next-generation digital commerce infrastructures.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 3018132
- Journal Information:
- Electronics, Journal Name: Electronics Journal Issue: 3 Vol. 15; ISSN 2079-9292
- Publisher:
- MDPICopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Investigating Users’ Privacy Concerns of Internet of Things (IoT) Smart Devices
Privacy by Design in Distributed Edge Systems: Innovating Secure Workflows for Smart Cities
TSDC: Transportation Secure Data Center: Real-World Data for Planning, Modeling, and Analysis
Conference
·
Sat Oct 01 00:00:00 EDT 2022
·
OSTI ID:1975369
Privacy by Design in Distributed Edge Systems: Innovating Secure Workflows for Smart Cities
Journal Article
·
Mon Sep 30 20:00:00 EDT 2024
· IEEE Smart Cities
·
OSTI ID:2491435
TSDC: Transportation Secure Data Center: Real-World Data for Planning, Modeling, and Analysis
Program Document
·
Mon Jan 26 19:00:00 EST 2026
·
OSTI ID:3015898