Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing. Learn fundamental components such as MapReduce, HDFS, and YARN Explore MapReduce in depth, including steps for developing applications with it Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN Learn two data formats: Avro for data serialization and Parquet for nested data Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop Learn the HBase distributed database and the ZooKeeper distributed configuration service
Efficient Computational Solutions for Protein Structures
Author: Dariusz Mrozek
This book presents a focus on proteins and their structures. The text describes various scalable solutions for protein structure similarity searching, carried out at main representation levels and for prediction of 3D structures of proteins. Emphasis is placed on techniques that can be used to accelerate similarity searches and protein structure modeling processes. The content of the book is divided into four parts. The first part provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of the technologies used in the solutions presented in the book. The second part of the book discusses Cloud services that are utilized in the development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction. The third part of the book shows the utilization of scalable Big Data computational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures. The fourth part of the book focuses on finding 3D protein structure similarities, accelerated with the use of GPUs and the use of multithreading and relational databases for efficient approximate searching on protein secondary structures. The book introduces advanced techniques and computational architectures that benefit from recent achievements in the field of computing and parallelism. Recent developments in computer science have allowed algorithms previously considered too time-consuming to now be efficiently used for applications in bioinformatics and the life sciences. Given its depth of coverage, the book will be of interest to researchers and software developers working in the fields of structural bioinformatics and biomedical databases.
Release on 2014-06-25 | by Sherif Sakr,Mohamed Gaber
Processing and Management
Author: Sherif Sakr,Mohamed Gaber
Pubpsher: CRC Press
Large Scale and Big Data: Processing and Management provides readers with a central source of reference on the data management techniques currently available for large-scale data processing. Presenting chapters written by leading researchers, academics, and practitioners, it addresses the fundamental challenges associated with Big Data processing tools and techniques across a range of computing environments. The book begins by discussing the basic concepts and tools of large-scale Big Data processing and cloud computing. It also provides an overview of different programming models and cloud-based deployment models. The book’s second section examines the usage of advanced Big Data processing techniques in different domains, including semantic web, graph processing, and stream processing. The third section discusses advanced topics of Big Data processing such as consistency management, privacy, and security. Supplying a comprehensive summary from both the research and applied perspectives, the book covers recent research discoveries and applications, making it an ideal reference for a wide range of audiences, including researchers and academics working on databases, data mining, and web scale data processing. After reading this book, you will gain a fundamental understanding of how to use Big Data-processing tools and techniques effectively across application domains. Coverage includes cloud data management architectures, big data analytics visualization, data management, analytics for vast amounts of unstructured data, clustering, classification, link analysis of big data, scalable data mining, and machine learning techniques.
This book constitutes the refereed proceedings of the 27th IFIP TC 11 International Information Security Conference, SEC 2012, held in Heraklion, Crete, Greece, in June 2012. The 42 revised full papers presented together with 11 short papers were carefully reviewed and selected from 167 submissions. The papers are organized in topical sections on attacks and malicious code, security architectures, system security, access control, database security, privacy attitudes and properties, social networks and social engineering, applied cryptography, anonymity and trust, usable security, security and trust models, security economics, and authentication and delegation.
Release on 2016-10-27 | by Florin Pop,Joanna Kołodziej,Beniamino Di Martino
Algorithms, Modelling, and High-Performance Computing Techniques
Author: Florin Pop,Joanna Kołodziej,Beniamino Di Martino
Serving as a flagship driver towards advance research in the area of Big Data platforms and applications, this book provides a platform for the dissemination of advanced topics of theory, research efforts and analysis, and implementation oriented on methods, techniques and performance evaluation. In 23 chapters, several important formulations of the architecture design, optimization techniques, advanced analytics methods, biological, medical and social media applications are presented. These chapters discuss the research of members from the ICT COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet). This volume is ideal as a reference for students, researchers and industry practitioners working in or interested in joining interdisciplinary works in the areas of intelligent decision systems using emergent distributed computing paradigms. It will also allow newcomers to grasp the key concerns and their potential solutions.