How to Cite?
Senhadji sarra, MEGAIZ Samia, SADOK Riad Mustapha, "A Hybrid Approach For Fault Tolerance In Datagrid," International Journal of Computer Trends and Technology, vol. 68, no. 11, pp. 53-58, 2020. Crossref, 10.14445/22312803/IJCTT-V68I11P107
Abstract
In recent years, we observe a considerable growth of data that needs to be stored, analyzed, and exploited. In response to these needs, grid systems appear to offer large-scale networks and geographic sharing resources around the world. However, grids are extremely dynamic where nodes are heterogeneous and volatile which increases the probability of failure. Two main solutions handle this problem: masking and no masking technique. For the masking one, the fault and its resolution are hidden from the client and the system still being operational. Contrarily to the no masking solution, the fault can stop the execution for a while until the fault is resolved. In this paper, we propose a hybrid solution that combines two fault-tolerance methods, one masking and the other non-masking using respectively recovery and replication techniques.
Reference
[1] Abawajy, Jemal H. "Fault-tolerant scheduling policy for grid computing systems", 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. IEEE, 2004.
[2] Amoon, Mohammed. "Fault tolerance in grids using job replication", International Journal of Computing 11.2 (2014): 115-121.
[3] Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2005
[4] Chandy, K. Mani, and Leslie Lamport. "Distributed snapshots: Determining global states of distributed systems." ACM Transactions on Computer Systems (TOCS) 3.1 (1985): 63-75.
[5] Chtepen, Maria, et al. “Evaluation of Replication and Rescheduling Heuristics for Grid Systems with Varying Resource Availability.” Proceedings of the 18th IASTED International Conference on Parallel and Distributed Computing and Systems, ACTA Press Anaheim, 2006, pp. 622–27.
[6] Erciyes, Kayhan. "A replication-based fault tolerance protocol using group communication for the grid." International Symposium on Parallel and Distributed Processing and Applications. Springer, Berlin, Heidelberg, 2006.
[7] Garg Ritu, Singh Kumar Awadhesh, “Fault Tolerance in Grid Computing: State of the Art and Open Issues,” International Journal of Computer Science & Engineering Survey (IJCSES) Vol.2, No.1, Feb 2011.
[8] Hwang, S., and C. Kesselman. “Grid workflow: a flexible failure handling framework for the grid.” High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on (2003): 126-137.
[9] Leili Mohammad Khalil, Maryam Etminan and Far Amir Masoud Rahman, ”RFOH: A New Fault-Tolerant Job Scheduler in Grid Computing”, In Second International Conference on Computer Engineering and Applications (2010).
[10] A. Nguyen-Tuong, “Integrating fault-tolerance techniques in Grid applications”, Ph.D. Dissertation, University of Virginia, August 2000.
[11] Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M.: “Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems”, In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Washington, 2005.
[12] https://sourceforge.net/projects/optorsim/
[13] Qureshi, K., Khan, F.G., Manuel, P. et al., “A hybrid fault tolerance technique in grid computing system”. J Supercomput 56, 106–128 (2011).
[14] Eric Roman, “A survey of Checkpoint/Restart Implementations”, Lawrence Berkley National Laboratory, CA, 2002.
[15] Townend, Paul, and Jie Xu. "Fault tolerance within a grid environment." Timeout 1.S2 (2003): S3.
[16] Wrzesi?ska, Gosia, et al. "Fault-tolerant scheduling of fine-grained tasks in grid environments." The International Journal of High-Performance Computing Applications 20.1 (2006): 103-114.
[17] Amin, Zeeshan, Harshpreet Singh, and Nisha Sethi. "Review on fault tolerance techniques in cloud computing." International Journal of Computer Applications 116.18 (2015).
[18] Matarneh, Feras & Matarneh, Rami. “Enhancing Fault-Tolerance in Ring Topology Based on Waiting Queue and Timestamp”. International Journal of Computer Trends and Technology. (2017)
Keywords
fault tolerance, recovery, replication, Datagrid