Hadoop (Big Data) სისტემური ადმინისტრირება
კურსში განხილული საკითხები

კურსში განხილულია ისეთი სისტემები როგორებიცაა: HADOOP – HDFS, YARN; Ambari, Hive, Apache Spark, Hawq, Apache Slider, Apache Oozie, Zookeeper, Apache Falcon, Apache Ranger, Cluster Management and Monitoring, Security in Hadoop (Authentication, Authorization, Auditing, etc), Kerberos, Backup/Restore, GUI & CLI Tools, etc.

დეტალური ინფორმაციისთვის იხილეთ ქვემოთ მოყვანილი ინფორმაცია:

 Understanding Hadoop
  • Why Big Data
  • Structured vs Semi-Structured vs Unstructured Data
  • Apache Hadoop, What is Hadoop (Libraries/Frameworks)?
  • Hadoop Design (Concepts)
  • Apache Software Framework
  • Data Management, Operations, Data Access, Integration and Security Frameworks
  • Version Compatibility and Interoperability
  • Hadoop Cluster Management Options
  • Puppet, Chef, Ansible and Others regarding Hadoop
  • Cluster GUI and CLI tools for Management

 Hadoop Installation
  • Hadoop Deployment Options
  • Hadoop Deployment Modes
  • Planning Hadoop Cluster and for Cluster Workload
  • Storage Calculation
  • Hardware for Master and Slave nodes
  • Network Design and Hardware Testing
  • OS Pre-configuration
  • Database configuration

 Apache Ambari
  • Apache Ambari and Cluster Architecture
  • Ambari Server Architecture
  • Web UI (Permissions, Privileges, Types, Create, Modify, etc)
  • Ambari User and Groups, Permissions
  • Local Users vs LDAP/AD Users
  • Configuration, Synchronization and Verification

 Hadoop Services
  • Core/Main Hadoop Configuration Files, Properties and Precedence
  • Configuration Management Options
  • Hadoop Services – Listing, Launching, Monitoring, Heat Maps, Maintenance Mode, Properties, Options, Configuration Core/Advanced, Setting Values, Configuration Revisions, Configuration History and Shifting Among Versions
  • REST API (Cluster Management, Advanced API Calls, etc)
  • Hadoop Memory Configuration (Calculations and Utility Usage)

 HDFS (Hadoop Distributed File System)
  • Hadoop and Storage
  • HDFS Architecture and Characteristics
  • How to Access HDFS
  • Namenode and Datanode
  • HDFS Superuser and File System (Permissions, Directories, Accounts)
  • HDFS Command Line (create, delete, permissions, .Trash, copy, moving, viewing, etc)
  • Ambari Files View, Browsing through Namenode UI, 
  • Java Native API and WebHDFS
  • HDFS ACL (Create, modify, delete, Masking, etc) and Quotas.
  • HDFS Architecture (Write/Read Operations, Block Creation, Block ID, Checksum, etc)
  • Data Replication, Block Placement, Block Management, Reporting and Data Integrity
  • Disk Management (fsimage, change log, checkpoints, etc)
  • Block or/and Disk or/and Datanode or/rack Rack Failure
  • HDFS Management and Monitoring Options and Tools (HDFS Health, Status, Usage)
  • HDFS Command Line (fsck, dfsadmin, blocks, corrupt blocks, mis-replicated blocks, etc)
  • HDFS Health, Status and Usage
  • HDFS Storage Design and Storage Types
  • Storage Preferences and Archive
  • Data Block Replica
  • HDFS Storage Policy Management and HDFS Mover
  • HDFS and NFS Gateway Architecture (Gateway Planning, Deployment, Testing)
  • HDFS Centralized Cache Usage, Operation and Configuration

 YARN – Yet Another Resource Negotiator
  • YARN Architecture, Operation and Resource Allocation
  • Resource Manager and Node Manager Operation
  • Application Master and Container Management
  • YARN UI Configuration and Monitoring
  • YARN Advanced Configuration
  • YARN Resource Manager UI and Application ID tracking
  • YARN Command Line (jar, classpath, application, node, yarns logs, rmadmin, curl)
  • YARN Failure (Properties Check, Troubleshooting)
  • YARN Log Aggregation
  • Running simple YARN Applications
  • YARN and MapReduce, Hive, Pig
  • Resource Availability Calculation
  • Organizing Tree and Planning SLA (Single Queue, FIFO, etc)
  • Resource Reservation for Application
  • Creating and Managing of Resource Pools (Limits, Capacity Scheduling, etc)
  • Queue ACL, Queue Mappings, User Limit Configurations and Administration

 Node Management & Rack Awareness
  • Node Adding, Deletion, Replacing and Installation Work Nodes
  • Ambari Agents Installation, Configuration and Management
  • Rebalancing Cluster (GUI and CLI)
  • Decommissioning and Recommissioning
  • Start/Stop Cluster Components and Configuration Check 
  • Rack Awareness Benefits and Concept
  • Rack Awareness Configuration and Testing
  • Rack Awareness Command-Line Verification

 HDFS and YARN High Availability
  • Namenode HA (Active/Standby)
  • The Role of Zookeeper
  • JorunalNodes Management
  • HA configuration and Verification
  • ResourceManager HA (Active/Standby) Configuration

 Monitoring a Cluster
  • Ambari Metrics
  • Embedded and Distributed Mode
  • Ambari Widgets
  • Ambari Alerts (Configuration, Groups, SNMP, Email)

 Backup the Cluster
  • What to Backup and Why to Backup Big Data
  • HDFS Snapshots
  • Distributed Copy (DistCp)
  • Restoring Process
  • Apache Blueprints Deployment

 Hadoop Security
  • Authentication in Hadoop Explained (Hosts and Services)
  • Authentication Using Kerberos
  • How Kerberos Protocol Works and Kerberos Deep Dive
  • Planning and Enabling Kerberos Protocol on Most Components
  • Kerberos Client vs Keytab (Usage, Best Practices)
  • Authorization in Hadoop Explained
  • Authorization Using Apache Ranger
  • Apache Ranger Design, Planning and Installation
  • Apache Ranger Configuration (LDAP, Rules, Policies, Audit, Logging)
  • Apache Ranger HA Configuration
  • Ranger KMS and Encryption
  • Security Tools to Use
  • Enabling SSL/TLS on Most Components
  • Java Truststore and KeyStore Configuration (Generation, Storing, etc)
  • Audit & Logging (Debugging applications, What and Where to look for logs)
  • Knox Installation and Configuration

 Data Compression
  • Cache Usage/Operation, File Formats and Compression
  • Compression Benefits, Ration, Pros & Cons
  • Serde Library (XML, JAVA)
  • Compression Formats (Record, Block, Split)
  • Compression Algorithms and Compression Configuration

 Apache Hive
  • Structuring Unstructured Data
  • Hive Architecture and Hive Queries Explained
  • Hive Command-Line (Beeline vs HiveCLI)
  • Hiveserver2 and WebHCat HA Architecture
  • Install and Configure Hive HA
  • Batch vs Interactive Processing
  • Hive Performance Tuning
  • Interactive Queries Optimization
  • Multiple Hiveserver2 Instances and Queue Approaches
  • Tez Idle and Held Container

 Apache Spark
  • Spark Architecture
  • Spark Planning and Deployment
  • Spark Configuration (Properties,  Performance Optimization, Security)
  • Spark Command-Line (Spark-Submit)
  • Running Jobs and Performance Comparison with Other Components

 Apache Slider
  • With And Without Slider
  • Slider Components and Logical View
  • Deploy, Launch and Monitor Slider

 Apache Oozie
  • Intro to Oozie, Workflow, Coordinator, Bundle and Benefits
  • Oozie Architecture
  • Oozie Installation and Configuration 
  • Deploy Oozie Workflow
  • Oozie Command-Line and UI
  • Apache Oozie HA Architecture, Components and Benefits
  • Apache Oozie HA Installation and Configuration

 Apache Falcon
  • Introduction to Falcon
  • Data Lifecycle and Data Pipeline
  • Data lifecycle and Data Pipeline Development and Management
  • Falcon Architecture, Installation and Configuration
  • Falcon UI