Data Optimization through Compression Methods Using Information Technology

Compact Data Representation; Compressed Copy of Tabular Data; Data Similarity; Information Technology; Control and Systems Engineering; Information Systems; Computer Science Applications; Computer Networks and Communications; Computational Theory and Mathematics; Artificial Intelligence

Abstract :

[en] Efficient comparison of heterogeneous tabular datasets is difficult when sources are unknown or weakly documented. We address this problem by introducing a unified, type-aware framework that builds compact data representations (CDRs)—concise summaries sufficient for downstream analysis—and a corresponding similarity graph (and tree) over a data corpus. Our novelty is threefold: (i) a principled vocabulary and procedure for constructing CDRs per variable type (factor, time, numeric, string), (ii) a weighted, type-specific similarity metric we call Data Information Structural Similarity (DISS) that aggregates distances across heterogeneous summaries, and (iii) an end-to-end, cloud-scalable real-ization that supports large corpora. Methodologically, factor variables are summarized by frequency tables; time variables by fixed-bin histograms; numeric variables by moment vectors (up to the fourth order); and string variables by TF–IDF vectors. Pairwise similarities use Hellinger, Wasserstein (p=1), total variation, and L1/L2 distances, with MAE/MAPE for numeric summaries; the DISS score combines these via learned or user-set weights to form an adjacency graph whose minimum-spanning tree yields a similarity tree. In experiments on multi-source CSVs, the approach enables accurate retrieval of closest datasets and robust corpus-level structuring while reducing storage and I/O. This contributes a repro-ducible pathway from raw tables to a similarity tree, clarifying terminology and providing algorithms that practitioners can deploy at scale.

Disciplines :

Mathematics

Author, co-author :

Malyk, Igor V. ^✱; Department of Mathematical Problems of Control and Cybernetics, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, Ukraine

Kyrychenko, Yevhen ; Department of Mathematical Modeling, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, Ukraine

Gorbatenko, Mykola ; Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, Luxembourg

LUKASHIV, Taras ^✱; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Clinical and Translational Informatics ; Department of Mathematical Problems of Control and Cybernetics, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, Ukraine

^✱ These authors have contributed equally to this work.

External co-authors :

yes

Language :

English

Title :

Data Optimization through Compression Methods Using Information Technology

Publication date :

October 2025

Journal title :

International Journal of Information Technology and Computer Science

ISSN :

2074-9007

eISSN :

2074-9015

Publisher :

Modern Education and Computer Science Press

Volume :

Issue :

Pages :

84 - 99

Peer reviewed :

Peer reviewed

Additional URL :

https://www.mecs-press.org/ijitcs/ijitcs-v17-n5/IJITCS-V17-N5-7.pdf

Available on ORBilu :

since 19 January 2026

Statistics

Number of views

62 (3 by Unilu)

Number of downloads

13 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

R. Kannan, G. Bayraksan, and J. R. Luedtke, “Technical Note: Data-Driven Sample Average Approximation with Covariate Information”, Operations Research, 2025.
C.R. Siddharth and S.B. Vishwadeepak, “Enhancing Cloud Migration Efficiency with Automated Data Pipelines and AI-Driven Insights”, International Journal of Innovative Science and Research Technology, vol. 9(11), 2025.
U. Sarmah, P. Borah, and D. K. Bhattacharyya, “Ensemble Learning Methods: An Empirical Study”, SN COMPUT. SCI., vol. 5, 924, 2024.
D. Zhang, C. Yin, J. Zeng, X. Yuan, and P. Zhang, “Combining structured and unstructured data for predictive models: a deep learning approach”, BMC Med Inform Decis Mak, vol. 20, 280, 2020.
J. Errasti, I. Amigo, and M. Villadangos, “Emotional Uses of Facebook and Twitter: Its Relation With Empathy, Narcissism, and Self-Esteem in Adolescence”, Psychol Rep, vol. 120(6), pp. 997-1018, 2017.
J. Cui, M. Cui, B. Xiao, and G. Li, “Compact and Discriminative Representation of Bag-of-Features”, Neurocom-puting, vol. 169, 2015.
M. J. Goswami, “Leveraging AI for Cost Efficiency and Optimized Cloud Resource Management”, International Journal of New Media Studies, vol. 7(1), 2020.
C. Zheng, R. Zheng, R. Wang, S. Zhao, and H. Bao, “A Compact Representation of Measured BRDFs Using Neural Processes”, ACM Transactions on Graphics, vol. 41, pp. 1–15, 2022.
E. Zohner, E. Gunning, G. Hooker, and J. Morris, “CLaRe: Compact near-lossless Latent Representations of High-Dimensional Object Data”, 2025. 10.48550/arXiv.2502.07084.
Z. Long, H. Meng, T. Li, and S. Li, “Compact geometric representation of qualitative directional knowledge”, Knowledge-Based Systems, vol. 195, 105616, 2020.
J. Chen, K. Liao, Y. Wan, D. Chen, and J. Wu, “DANets: Deep Abstract Networks for Tabular Data Classification and Regression”, 2021. 10.48550/arXiv.2112.02962.
K. Labunets, F. Massacci, F. Paci, S. Marczak, and F. Oliveira, “Model comprehension for security risk assessment: an empirical comparison of tabular vs. graphical representations”, Empirical Software Engineering, vol. 2(6), pp. 3017–3056, 2017.
S. Abrar, and M. Samad, “Perturbation of deep autoencoder weights for model compression and classification of tabular data”, Neural Networks, vol. 156(C). pp. 160–169, 2022.
C. Bordenave and B. Collins, “Strong asymptotic freeness for independent uniform variables on compact groups associated to nontrivial representations”, Inventiones mathematicae, vol. 237, pp. 221–273, 2024.
J. Wu, S. Chen, Q. Zhao, R. Sergazinov, C. Li, S. Liu, C. Zhao, T. Xie, H. Guo, C. Ji, D. Cociorva, and H. Brunzell, “SwitchTab: Switched Autoencoders Are Effective Tabular Learners”, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 15924–15933, 2024.
S. Á lvarez-Garc´ıa, B. Freire Castro, S. Ladra, and O. Pedreira, “Compact and Efficient Representation of General Graph Databases”, Knowledge and Information Systems, vol. 60, pp. 1479–1510, 2019.
X. Liang, Y. Qian, Q. Guo, and K. Zheng, “A data representation method using distance correlation”, Frontiers of Computer Science, vol. 19, 191303, 2024.
C. Troisemaine, J. Flocon-Cholet, S. Gosselin, S. Vaton, A. Reiffers-Masson, and V. Lemaire, “A Method for Discov-ering Novel Classes in Tabular Data”, IEEE International Conference on Knowledge Graph (ICKG), pp. 265–274, 2022.
S.-K. Kim, “Compact Data Learning for Machine Learning Classification”, Axioms, vol. 13, 137, 2024.
Y. Zhu, T. Brettin, F. Xia, A. Partin, M. Shukla, H. Yoo, Y. Evrard, J. Doroshow, and R. Stevens, “Converting tabular data into images for deep learning with convolutional neural networks”, Scientific Reports, vol. 11, 11325, 2021.
Z. Zhao, A. Kunar, R. Birke, H. Scheer, and L. Chen, “CTAB-GAN+: enhancing tabular data synthesis”, Frontiers in big data, vol. 6, 1296508, 2024.
Amazon Web Services. Amazon EMR Developer Guide. Amazon Web Services Documentation 2023. Available online: https://docs.aws.amazon.com/emr/(accessed on 19.05.2025)
Apache Software Foundation. Apache Airflow Documentation. Apache Airflow. Available online: https://airflow.apache.org/docs/(accessed on 19.05.2025).
Amazon Web Services. Storage Best Practices for Data & Analytics. AWS Whitepaper 2022. Available on-line: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html (accessed on 19.05.2025).
University of Illinois. Data Science Discovery: Data Types. University of Illinois. Available online: https://discovery.cs.illinois.edu/(accessed on 19.05.2025).
Mercadier, Y. Distance Measures for Probability Distributions. Distancia Library Documentation 2022. Available online: https://distancia.readthedocs.io/en/latest/(accessed on 19.05.2025).
Apache Software Foundation. Apache Parquet. Apache Parquet. Available online: https://parquet.apache.org/(ac-cessed on 19.05.2025).
Amazon Web Services. AWS Glue Documentation. Amazon Web Services Documentation 2023. Available online: https://docs.aws.amazon.com/glue/(accessed on 19.05.2025).
Amazon Web Services. Amazon Athena Documentation. Amazon Web Services Documentation 2023. Available online: https://docs.aws.amazon.com/athena/(accessed on 19.05.2025).
M. Moazeni, “Automating Stock Market Data Pipeline with Airflow”, Spark, Postgres. Medium 2023. Available on-line: https://medium.com/@mehran1414/automating-stock-market-data-pipeline-with-apache-airflow-minio-spark-and-postgres-b67f7379566a (accessed on 19.05.2025).