Adaptive Sparsity Level during Training for Efficient Time Series  Forecasting with Transformers

Atashgahi, Zahra; Pechenizkiy, Mykola; Veldhuis, Raymond; MOCANU, Decebal Constantin

doi:10.1007/978-3-031-70341-6_1

Télécharger

Communication publiée dans un ouvrage (Colloques, congrès, conférences scientifiques et actes)

Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

Atashgahi, Zahra; Pechenizkiy, Mykola; Veldhuis, Raymond et al.

2024 • In ECMLPKDD 2024: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

Peer reviewed

Permalien
https://hdl.handle.net/10993/59769

DOI
10.1007/978-3-031-70341-6_1

arXiV
2305.18382v1

Documents (1)Envoyer vers Détails Statistiques Bibliographie Publications similaires

Documents

Texte intégral

2305.18382.pdf

Postprint Auteur (1.93 MB)

Télécharger

Tous les documents dans ORBilu sont protégés par une licence d'utilisation.

Envoyer vers

RIS BibTex APA Chicago Permalink X Linkedin

Détails

Mots-clés :

Computer Science - Learning; Sparse Training; Sparse Neural Networks; Time Series

Résumé :

[en] Efficient time series forecasting has become critical for real-world applications, particularly with deep neural networks (DNNs). Efficiency in DNNs can be achieved through sparse connectivity and reducing the model size. However, finding the sparsity level automatically during training remains a challenging task due to the heterogeneity in the loss-sparsity tradeoffs across the datasets. In this paper, we propose \enquote{\textbf{P}runing with \textbf{A}daptive \textbf{S}parsity \textbf{L}evel} (\textbf{PALS}), to automatically seek an optimal balance between loss and sparsity, all without the need for a predefined sparsity level. PALS draws inspiration from both sparse training and during-training methods. It introduces the novel "expand" mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level. In this paper, we focus on achieving efficiency in transformers known for their excellent time series forecasting performance but high computational cost. Nevertheless, PALS can be applied directly to any DNN. In the scope of these arguments, we demonstrate its effectiveness also on the DLinear model. Experimental results on six benchmark datasets and five state-of-the-art transformer variants show that PALS substantially reduces model size while maintaining comparable performance to the dense model. More interestingly, PALS even outperforms the dense model, in 12 and 14 cases out of 30 cases in terms of MSE and MAE loss, respectively, while reducing 65% parameter count and 63% FLOPs on average. Our code will be publicly available upon acceptance of the paper.

Disciplines :

Sciences informatiques

Auteur, co-auteur :

Atashgahi, Zahra; University of Twente [NL]

Pechenizkiy, Mykola; Eindhoven University of Technology [NL]

Veldhuis, Raymond; University of Twente [NL]

MOCANU, Decebal Constantin ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) ; University of Twente [NL] ; Eindhoven University of Technology [NL]

Co-auteurs externes :

yes

Langue du document :

Anglais

Titre :

Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers

Date de publication/diffusion :

2024

Nom de la manifestation :

ECMLPKDD 2024: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

Lieu de la manifestation :

Vilnius, Lithuanie

Date de la manifestation :

from 9 to 13 September 2024

Manifestation à portée :

International

Titre de l'ouvrage principal :

ECMLPKDD 2024: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

Maison d'édition :

LNCS

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Objectif de développement durable (ODD) :

9. Industrie, innovation et infrastructure

Disponible sur ORBilu :

depuis le 15 janvier 2024

Statistiques

Nombre de vues

185 (dont 9 Unilu)

Nombre de téléchargements

74 (dont 1 Unilu)

Voir plus de statistiques

citations Scopus^®

citations Scopus^®
sans auto-citations

citations OpenAlex

Bibliographie

Atashgahi, Z., et al.: Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Mach. Learn. 1–38 (2022)
Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Forecasting and Control. Wiley, New York (2015)
Challu, C., Olivares, K.G., Oreshkin, B.N., Garza, F., Mergenthaler, M., Dubrawski, A.: N-hits: neural hierarchical interpolation for time series forecasting. arXiv preprint arXiv:2201.12886 (2022)
Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. Adv. Neural. Inf. Process. Syst. 34, 19974–19988 (2021)
Chen, T., et al.: The lottery ticket hypothesis for pre-trained BERT networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15834–15846 (2020)
Curci, S., Mocanu, D.C., Pechenizkiyi, M.: Truly sparse neural networks at scale. arXiv preprint arXiv:2102.01732 (2021)
Dietrich, A.S.D., Gressmann, F., Orr, D., Chelombiev, I., Justus, D., Luschi, C.: Towards structured dynamic sparse pre-training of BERT (2022)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Evci, U., Gale, T., Menick, J., Castro, P.S., Elsen, E.: Rigging the lottery: making all tickets winners. In: International Conference on Machine Learning (2020)
Franceschi, J.Y., Dieuleveut, A., Jaggi, M.: Unsupervised scalable representation learning for multivariate time series. In: Advances in Neural Information Processing Systems (2019)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2018)
Furuya, T., Suetake, K., Taniguchi, K., Kusumoto, H., Saiin, R., Daimon, T.: Spectral pruning for recurrent neural networks. In: International Conference on Artificial Intelligence and Statistics (2022)
Ganesh, P., et al.: Compressing large-scale transformer-based models: a case study on BERT. Trans. Assoc. Comput. Linguist. 9, 1061–1080 (2021)
Han, S., et al.: DSD: dense-sparse-dense training for deep neural networks. In: International Conference on Learning Representations (2017)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (2015)
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res. 22(241), 1–124 (2021)
Hyndman, R.J., Lee, A.J., Wang, E.: Fast computation of reconciled forecasts for hierarchical and grouped time series. Comput. Stat. Data Anal. 97, 16–32 (2016)
Jayakumar, S., Pascanu, R., Rae, J., Osindero, S., Elsen, E.: Top-KAST: Top-K always sparse training. In: Advances in Neural Information Processing Systems (2020)
Jin, X., Park, Y., Maddix, D., Wang, H., Wang, Y.: Domain adaptation for time series forecasting via attention sharing. In: International Conference on Machine Learning, pp. 10280–10297. PMLR (2022)
Kieu, T., Yang, B., Guo, C., Jensen, C.S.: Outlier detection for time series with recurrent autoencoder ensembles. In: IJCAI, pp. 2725–2732 (2019)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2020)
Lai, G., Chang, W.C., Yang, Y., Liu, H.: Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR Conference on Research Development in Information Retrieval, pp. 95–104 (2018)
Lee, N., Ajanthan, T., Torr, P.: SNIP: single-shot network pruning based on connection sensitivity. In: International Conference on Learning Representations (2019)
Li, S., et al.: Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Li, Y., Lu, X., Wang, Y., Dou, D.: Generative time series forecasting with diffusion, denoise, and disentanglement. In: Advances in Neural Information Processing Systems (2022)
Li, Z., et al.: Train big, then compress: rethinking model size for efficient training and inference of transformers. In: International Conference on Machine Learning (2020)
Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Phil. Trans. R. Soc. A 379(2194), 20200209 (2021)
Liu, S., et al.: Sparse training via boosting pruning plasticity with neuroregeneration. Adv. Neural Inf. Process. Syst. 34, 9908–9922 (2021)
Liu, S., et al.: Topological insights into sparse neural networks. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds.) ECML PKDD 2020. LNCS (LNAI), vol. 12459, pp. 279–294. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67664-3_17
Liu, S., Mocanu, D.C., Pei, Y., Pechenizkiy, M.: Selfish sparse RNN training. In: International Conference on Machine Learning (2021)
Liu, S., Wang, Z.: Ten lessons we have learned in the new “Sparseland”: a short handbook for sparse neural network researchers. arXiv preprint arXiv:2302.02596 (2023)
Liu, S., Yin, L., Mocanu, D.C., Pechenizkiy, M.: Do we actually need dense over-parameterization? In-time over-parameterization in sparse training. In: International Conference on Machine Learning (2021)
Liu, S., et al.: Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. In: International Conference on Learning Representations (2021)
Liu, Y., Wu, H., Wang, J., Long, M.: Non-stationary transformers: rethinking the stationarity in time series forecasting. arXiv preprint arXiv:2205.14415 (2022)
Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through L_0 regularization. In: International Conference on Learning Representations (2018)
Ma, X., et al.: Effective model sparsification by scheduled grow-and-prune methods. In: International Conference on Learning Representations (2022)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Mocanu, D.C., Mocanu, E., Nguyen, P.H., Gibescu, M., Liotta, A.: A topological insight into restricted Boltzmann machines. Mach. Learn. 104(2), 243–270 (2016)
Mocanu, D.C., et al.: Sparse training theory for scalable and efficient agents. In: 20th International Conference on Autonomous Agents and Multiagent Systems (2021)
Mocanu, D.C., Mocanu, E., Stone, P., Nguyen, P.H., Gibescu, M., Liotta, A.: Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun. 9(1), 1–12 (2018)
Oreshkin, B.N., Carpov, D., Chapados, N., Bengio, Y.: N-BEATS: neural basis expansion analysis for interpretable time series forecasting. In: International Conference on Learning Representations (2019)
Prasanna, S., Rogers, A., Rumshisky, A.: When BERT plays the lottery, all tickets are winning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
Qin, Y., Song, D., Cheng, H., Cheng, W., Jiang, G., Cottrell, G.W.: A dual-stage attention-based recurrent neural network for time series prediction. In: International Joint Conference on Artificial Intelligence, pp. 2627–2633 (2017)
Rakthanmanon, T., et al.: Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 262–270 (2012)
Salinas, D., Flunkert, V., Gasthaus, J., Januschowski, T.: DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36(3), 1181–1191 (2020)
Schlake, G.S., Hwel, J.D., Berns, F., Beecks, C.: Evaluating the lottery ticket hypothesis to sparsify neural networks for time series classification. In: International Conference on Data Engineering Workshops (ICDEW), pp. 70–73 (2022)
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE (2021)
Talagala, T.S., Hyndman, R.J., Athanasopoulos, G., et al.: Meta-learning how to forecast time series. Monash Econometrics and Business Statistics Working Papers, vol. 6(18), p. 16 (2018)
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. ACM Comput. Surv. 55(6), 1–28 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Wang, Z., Xu, X., Zhang, W., Trajcevski, G., Zhong, T., Zhou, F.: Learning latent seasonal-trend representations for time series forecasting. In: Advances in Neural Information Processing Systems (2022)
Wen, Q., et al.: Transformers in time series: a survey. arXiv preprint arXiv:2202.07125 (2022)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Woo, G., Liu, C., Sahoo, D., Kumar, A., Hoi, S.: ETSformer: exponential smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381 (2022)
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: TimesNet: temporal 2D-variation modeling for general time series analysis. In: International Conference on Learning Representations (2023)
Wu, H., Xu, J., Wang, J., Long, M.: AutoFormer: decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural. Inf. Process. Syst. 34, 22419–22430 (2021)
Xiao, Q., et al.: Dynamic sparse network for time series classification: learning what to “see”. In: Advances in Neural Information Processing Systems (2022)
Yuan, G., et al.: MEST: accurate and fast memory-economic sparse training framework on the edge. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504 (2022)
Zhang, T., et al.: Less is more: fast multivariate time series forecasting with light sampling-oriented MLP structures. arXiv preprint arXiv:2207.01186 (2022)
Zhang, Y., Yan, J.: Crossformer: transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: International Conference on Learning Representations (2023)
Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11106–11115 (2021)
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., Jin, R.: FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. arXiv preprint arXiv:2201.12740 (2022)
Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)