Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Heterogeneous Multi-Domain Dataset Synthesis to Facilitate Privacy and Risk Assessments in Smart City IoT

Journal Article · · Electronics
The emergence of the Smart Cities paradigm and the rapid expansion and integration of Internet of Things (IoT) technologies within this context have created unprecedented opportunities for high-resolution behavioral analytics, urban optimization, and context-aware services. However, this same proliferation intensifies privacy risks, particularly those arising from cross-modal data linkage across heterogeneous sensing platforms. To address these challenges, this paper introduces a comprehensive, statistically grounded framework for generating synthetic, multimodal IoT datasets tailored to Smart City research. The framework produces behaviorally plausible synthetic data suitable for preliminary privacy risk assessment and as a benchmark for future re-identification studies, as well as for evaluating algorithms in mobility modeling, urban informatics, and privacy-enhancing technologies. As part of our approach, we formalize probabilistic methods for synthesizing three heterogeneous and operationally relevant data streams—cellular mobility traces, payment terminal transaction logs, and Smart Retail nutrition records—capturing the behaviors of a large number of synthetically generated urban residents over a 12-week period. The framework integrates spatially explicit merchant selection using K-Dimensional (KD)-tree nearest-neighbor algorithms, temporally correlated anchor-based mobility simulation reflective of daily urban rhythms, and dietary-constraint filtering to preserve ecological validity in consumption patterns. In total, the system generates approximately 116 million mobility pings, 5.4 million transactions, and 1.9 million itemized purchases, yielding a reproducible benchmark for evaluating multimodal analytics, privacy-preserving computation, and secure IoT data-sharing protocols. To show the validity of this dataset, the underlying distributions of these residents were successfully validated against reported distributions in published research. We present preliminary uniqueness and cross-modal linkage indicators; comprehensive re-identification benchmarking against specific attack algorithms is planned as future work. This framework can be easily adapted to various scenarios of interest in Smart Cities and other IoT applications. By aligning methodological rigor with the operational needs of Smart City ecosystems, this work fills critical gaps in synthetic data generation for privacy-sensitive domains, including intelligent transportation systems, urban health informatics, and next-generation digital commerce infrastructures.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
3018132
Journal Information:
Electronics, Journal Name: Electronics Journal Issue: 3 Vol. 15; ISSN 2079-9292
Publisher:
MDPICopyright Statement
Country of Publication:
United States
Language:
English

References (21)

Reality mining: sensing complex social systems journal November 2005
Generating information for small data sets with a multi-modal distribution journal October 2014
Antecedents and consequences of data breaches: A systematic review journal June 2022
Unique in the metro system: The likelihood to re-identify a metro user with limited trajectory points journal October 2023
The dimensions of global urban expansion: Estimates and projections for all countries, 2000–2050 journal February 2011
Understanding individual human mobility patterns journal June 2008
Unique in the Crowd: The privacy bounds of human mobility journal March 2013
Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record journal August 2017
A Secure and Privacy Preserved Parking Recommender System Using Elliptic Curve Cryptography and Local Differential Privacy journal January 2022
Membership Inference Attacks Against Machine Learning Models conference May 2017
Unique in the shopping mall: On the reidentifiability of credit card metadata journal January 2015
Complete trajectory reconstruction from sparse mobile phone data journal October 2019
Human mobility modeling at metropolitan scales conference June 2012
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures conference October 2015
Deep Learning with Differential Privacy
  • Abadi, Martin; Chu, Andy; Goodfellow, Ian
  • CCS'16: 2016 ACM SIGSAC Conference on Computer and Communications Security, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security https://doi.org/10.1145/2976749.2978318
conference October 2016
Trajectory Recovery From Ash conference April 2017
Synthesizing credit card transactions conference November 2021
Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing journal July 2019
Characterizing Temporal Patterns of Intra-Urban Human Mobility in Bike-Sharing through Trip Analysis: A Case Study of Shanghai, China journal September 2024
Privacy Risk Assessment of Travel Trajectories in Metro AFC Data conference February 2025
A Unified Framework for Quantifying Privacy Risk in Synthetic Data journal April 2023

Similar Records

Investigating Users’ Privacy Concerns of Internet of Things (IoT) Smart Devices
Conference · Sat Oct 01 00:00:00 EDT 2022 · OSTI ID:1975369

Privacy by Design in Distributed Edge Systems: Innovating Secure Workflows for Smart Cities
Journal Article · Mon Sep 30 20:00:00 EDT 2024 · IEEE Smart Cities · OSTI ID:2491435

TSDC: Transportation Secure Data Center: Real-World Data for Planning, Modeling, and Analysis
Program Document · Mon Jan 26 19:00:00 EST 2026 · OSTI ID:3015898