程序代写 Cloud Computing INFS3208/INFS7208 – cscodehelp代写 – INTERVIEW&CODEHELP

Cloud Computing INFS3208/INFS7208
Re-cap – Lecture 7
• Database Background
• Relational Data Bases
– Revisit Relational DBs
– ACID Properties *
– Clustered RDBMs
• Non-relational Data Bases
– NoSQL concepts
– CAP Theorem ***
– MongoDB
– Cassandra
– HBase
CRICOS code 00025B 2

Outline
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
CRICOS code 00025B 3

Issues of huge volume of data generated daily!!
Structured
Data Warehouse
Transaction Data
Unstructured
Hadoop and Streams
Multimedia
Web Logs
Internal App
Data Structured
MainframeRepeatable Data Linear
OLTP System Data
ERP Traditional data Sources
DATA
Unstructured Exploratory Dynamic
Social Data
Text Data: emails
Sensor data: images
FID Sources
CRICOS code 00025B
P. 4

Issues of huge volume of data generated daily!!
We are facing
PB
every
https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article
s of data
day!
CRICOS code 00025B
P. 5

Measuring the Size of Data
Bytes (8 bits) Kilobyte
1,024 bytes; 210; approx. 1,000 or 10 3 2 Kilobytes: Typewritten page
Megabyte
1,048,576 bytes; 220;
approx 1,000,000 or 10 6
5 Megabytes: Complete works of Shakespeare Gigabyte
1,073,741,824 bytes; 230;
approx 1,000,000,000 or 10 9
20 Gigabytes: Audio collection of the works of Beethoven Terabyte
1,099,511,627,776 or 240;
approx. 1,000,000,000,000 or 10 12
10 Terabytes: Printed collection of the U. S. Library of Congress
with 130 million items on about 530 miles of bookshelves, including 29 million book, 2.7 million recordings, 12 million photographs, 4.8 million maps, and 58 million manuscripts
https://en.wikipedia.org/wiki/Floppy_disk
Petabyte
1,125,899,906,842,624 bytes or 250
approx. 1,000,000,000,000,000 or 10 15
2 Petabytes: All U. S. academic research libraries
Exabyte
1,152,921,504,606,846,976 bytes or 260
approx. 1,000,000,000,000,000,000 or 10 18
5 Exabytes: All words ever spoken by human beings.
Zettabyte
1,180,591,620,717,411,303,424 bytes or 270
approx. 1,000,000,000,000,000,000,000 or 10 21
Yottabyte
1,208,925,819,614,629,174,706,176 bytes or 280 approx. 1,000,000,000,000,000,000,000,000 or 10 24
549,755,813,888 X
≈ 43,980,465 km
384,402 km between earth and moon (114.413 times)
UQ INFS1200/INFS7900 Week 1 Lecture Notes
CRICOS code 00025B
P. 6

Big Data Statistics
How much data are we generating?
• By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth and there will be around 40 trillion gigabytes of data (40 zettabytes).
• 90% of all data has been created in the last two years.
• In 2018, internet users spent 2.8 million years online.
• Social media accounts for 33% of the total time spent online.
• Today it would take a person approximately 181 million years to download all the data from the internet.
• 97.2% of organizations are investing in big data and AI
• Using big data, Netflix saves $1 billion per year on customer retention.
https://www.socialmediatoday.com/news/how-much-data-is-generated-every-minute-infographic-1/525692/ https://techjury.net/stats-about/big-data-statistics/
CRICOS code 00025B
P. 7

Data Centre Technology
A data centre is a specialised IT infrastructure that houses centralised IT resources
• Servers (rack in cabinet);
• Databases and software systems;
• Networking and telecommunication devices.
Typical technologies and components
• Virtualisation
• Standardisation and Modularity
• Remote Operation and Management
• High Availability
• Security-Aware Design, Operation and Management
• Facilities
CRICOS code 00025B

Hardware: Array of Hard Disks
https://www.wikihow.com/Recover-a-Dead-Hard-Disk
http://www.comnetdrc.com/
CRICOS code 00025B P. 9

Hardware: Network Switch
Network Switch Can Route 15 Terabits Per Second
http://exd-int.com/my-product/space-mind/ CRICOS code 00025B P. 10

Google Data Centre
CRICOS code 00025B 11

What is a File System?
• A file system is an abstraction: enable users to manipulate and organize data.
• Typically, FS is in a hierarchical tree: files and directories.
• FS enables a uniform view, independent of the underlying storage devices: floppy/optical drives, hard drives and USB stick, etc.
• The connection between the logical file system and the storage device was typically a one-to-one mapping.
• Examples:
– Windows: NTFS, FAT32, FAT
– MacOS: System (AFS),
– Linux: Ext4, etc.
CRICOS code 00025B
P. 13

What is a Distributed File System (DFS)?
• DFS is a distributed implementation of the classical time-sharing model of a file system, where multiple users share files and storage resources.
• A distributed file system spreads over multiple, autonomous computers.
• A distributed file system should have following characteristics:
– Access transparency
– Location transparency
– Concurrency transparency
– Failure transparency
– Replication transparency
– Migration transparency
– Heterogeneity
– Scalability
https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
CRICOS code 00025B
P. 14

What is a Distributed File System (DFS)?
• In general, files in a DFS can be located in “any” system.
– Servers: data holders (the “source(s)” of files)
– Clients: data users who are accessing the servers.
• Potentially, a server for a file can become a client for another file.
• However, most distributed systems distinguish between clients and servers
in more strict way:
– Clients simply access files and do not have/share local files.
– Even if clients have disks, they (disks) are used for swapping, caching, loading the OS, etc.
– Servers are the actual sources of files.
– In most cases, servers are more powerful machines (in terms of CPU,
physical memory, disk bandwidth, ..)
https://goo.gl/images/Nwq6tV https://www.slideshare.net/AnamikaSingh211/distributed-file-system-72294718
CRICOS code 00025B
P. 15

Sun Network File System (NFS)
• Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems in 1984.
• NFS allows a user on a client computer to access files over a computer network much like local storage is accessed.
• NFS is a client-server application, where a user can view, store and update the files on a remote computer.
• NFS builds on the Remote Procedure Call (RPC) system to route requests between clients and servers.
• NFS protocol is designed to be independent of computer, operating system, network architecture, and transport protocol.
• NFS allows the user or system administrator to mount complete or a partial file system on a server.
• The portion of the file system that is mounted can be accessed by clients with different privileges (e.g. read-only or read-write)
CRICOS code 00025B
P. 16

Sun Network File System (NFS)
• Design:
Client 1
Client 2
Network
Server
Client 3
Advantages:
• easy sharing of data across clients
• centralized administration (backup done on multiple servers instead of many clients) • security (put server behind firewall)
http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
CRICOS code 00025B
P. 17

Sun Network File System (NFS)
• Each file server presents a standard view of its local file system
• Transparent access to remote files
• Compatibility with multiple operating systems and
platforms.
• Easy crash recovery and backup at server
• A software component namely, Virtual File System (VFS) is available in most OS (Operating Systems) as the interface to different local and distributed file systems.
• Virtual File System (VFS) in an OS (Operating System) acts as an interface between the system-call layer and all files in network nodes.
• The user interface to NFS is the same as the interface to local file systems. The calls go to the VFS layer, which passes them either to a local file system or to the NFS client.
Making remote files as if local to the client
http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
CRICOS code 00025B
P. 18
For Windows, the NFS is available via SMB – Server Message Block, or a version namely, CIFS – Common Internet File System.

A Real Example of NFS
• A big hard drive (30 TB) on a windows server:
• The driver can be shared by users in Linux
CRICOS code 00025B P. 19

A Real Example of NFS
• File Sharing
Experimental data
Linux Server 7
App 7
Experimental programs User 1
CRICOS code 00025B
Windows Server
Linux Server 1
App 1 User 1
Linux Server 2
App 2 User 1
P. 20

Why using Clustered File System (CFS)?
• For bigger scale of data storage, a cluster of thousands of data servers is required.
• Scalability and availability should be provided for big data storage.
• Resiliency and load balancing are also very essential.
CRICOS code 00025B P. 22

What is a Clustered File System (CFS)?
• Clustered File System (CFS) is not a single server with a set of clients, but instead a cluster of servers that all work together to provide high performance service to their clients.
• To the clients of CFS, the cluster is transparent.
• A CFS can organize the storage and access data across all clusters.
https://gitlab.com/arm-hpc/packages/wikis/packages/BeeGFS
CRICOS code 00025B P. 23

CFS is Managed at a Block Level
The difference between
(a) distributing whole files across several servers and (b) striping files for parallel access.
File A can be divided into blocks
A
A
A
Tanenbaum & , Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
CRICOS code 00025B
P. 24

What are the differences among DFS, NFS, CFS?
Distributed File System (DFS): A file system that has its components spread across multiple systems. On the other hand, an NFS is inherently a distributed file system as well. The client component is on a different system than the underlying physical storage or its management.
Client-Server Architecture Cluster Server Architecture
Network File system (NFS): Files are not local. They are served over a network, with the physical storage units and their management hosted by a different entity.
https://www.quora.com/What-is-the-difference-between-a-distributed-file-system-clustered-file-system-and-a-network-file-system
P. 25
Clustered File System (CFS): It is built by pooling several different discrete components, typically multiple servers, multiple disks, working together to provide a unified namespace. A client is not aware of the physical boundaries that make up the file system.
CRICOS code 00025B

Google File System
• Google File System (GFS or GoogleFS) is a
scalable distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware.
• Shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability.
• GFS was internally used and became the creation basis of Hadoop Distributed File System (HDFS)
https://en.wikipedia.org/wiki/Google_File_System#/media/File:GoogleFileSystemGFS.svg
CRICOS code 00025B

Design Assumptions
• Thousands of commodity computers – cheap hardware but distributed
• GFS has high component failure rates
– System is built from many inexpensive commodity components
• Modest number of huge files
– A few million files, each typically 100MB or larger (Multi-GB files are common) – No need to optimize for small files
• Workloads : two kinds of reads and writes
– Large streaming reads (1MB or more) and small random reads (a few KBs) – Sequential appends to files by hundreds of data producers
• High sustained throughput is more important than latency – Response time for individual read and write is not critical
CRICOS code 00025B 28

GFS Design Overview
• Files stored as chunks
– With a fixed size of 64MB each. – With a 64-bit ID each.
• Reliability through replication
– Each chunk is replicated across 3 (by default) or more
chunk servers
– Replication number can be manually set
• Single Master
– Centralized management – Only store meta-data
A file with four chunks
Replicas=2
C1
123
C4
C3 C1
456 CRICOS code 00025B 29
C1
C2
C3
C4
C2
C4
C3
C2

GFS Architecture – Read
CRICOS code 00025B 30

GFS Architecture – Write
1.Client asks master which chunk server holds current lease of chunk and locations of other replicas.
2.Master replies with identity of primary and locations of secondary replicas.
3.Client pushes data to all replicas
4.Once all replicas have acknowledged receiving the data, client sends write request to primary. The primary assigns consecutive serial numbers to all the mutations it receives, providing serialization. It applies mutations in serial number order.
5.Primary forwards write request to all secondary replicas. They apply mutations in the same serial number order.
6.Secondary recplicas reply to primary indicating they have completed operation
7.Primary replies to the client with success or error message
CRICOS code 00025B 31

GFS Components – Master
• Mater maintains all system metadata
– Name space, access control info, file to chunk mappings, chunk locations, etc.
• Periodically communicates with chunk servers – Through HeartBeat messages
• Advantages:
– Simplifies the design
• Disadvantages:
– Single point of failure
• Solution
– Replication of Master state on multiple machines
– Operational log and check points are replicated on multiple machines
CRICOS code 00025B 32

GFS Components – Chunks
• Fixed size of 64MB (vs 4kb of cluster size for NTFS) • Advantages
– Size of meta data is reduced
– Involvement of Master is reduced
– Network overhead is reduced
– Lazy space allocation avoids internal fragmentation
• Disadvantages – Hot spots
Chunk size
 A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.
 Solutions: increase the replication factor and stagger application start times; allow clients to read data from other clients
https://support.microsoft.com/en-us/help/140365/default-cluster-size-for-ntfs-fat-and-exfat
Chunk numbers
CRICOS code 00025B 33

GFS Components – Metadata and Operational Log
• Three major types of metadata
– The file and chunk namespaces
– The mapping from files to chunks
– Locations of each chunk’s replicas
• All the metadata is kept in the Master’s memory
• 64MB chunk has 64 bytes of metadata
• Chunk locations
– Chunk servers keep track of their chunks and relay data to Master through HeartBeat messages
• Master “operation log”
– Consists of namespaces and file to chunk mappings – Replicated on remote machines
CRICOS code 00025B 34

Cloud Computing INFS3208/INFS7208
Outline
• Background of Distributed File Systems
– Big Data, Data Centre Technology, and Storage Hardware
• Distributed File System
– File System
– Server/Client System
– Sun’s Network File System (NFS)
• Clustered File System (CFS)
– Google File System (GFS)
– Hadoop Distributed File System (HDFS)
– HDFS Shell Commands
CRICOS code 00025B 35

Hadoop Distributed File System
• Apache Hadoop was proposed in 2010 as a collection of open-source software utilities to deal with big data problem.
• The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model.
– Hadoop splits files into large blocks and distributes them across nodes in a cluster.
– It then transfers packaged code into nodes – It takes advantage of data locality.
Shvachko, Konstantin, , , and . “The hadoop distributed file system.” In MSST, vol. 10, pp. 1-10. 2010
CRICOS code 00025B 36

Design Motivations
• Many inexpensive commodity hardware and failures are very common • Many big files: millions of files, ranging from MBs to GBs
• Two types of reads
– Large streaming reads
– Small random reads
• Once written, files are seldom modified
– Random writes are supported but do not have to be efficient • High sustained bandwidth is more important than low latency
CRICOS code 00025B 37

HDFS Component – NameNode
NameNode
• Equivilent to Master Node in GFS
• represents files and directories on the NameNode
as inode
• records attributes like permissions, modification and
access times, namespace and disk space quotas.
• maintains the namespace tree and the mapping of file blocks to DataNodes
• When writing data, NameNode nominates a suite of three DataNodes to host the block replicas.
• The client then writes data to the DataNodes in a pipeline fashion.
• keeps meta-data in RAM
CRICOS code 00025B 38

HDFS Component – Meta-Data
Meta-Data
• fsimage:
•
contains the entire filesystem namespace at the latest checkpoint
Blocks information of the file (location, timestamp, etc.)
Folder information (ownership, access, etc.)
stored as an image file in the NameNode’s local file system.
– –
Meta-data
•
• editlog:
• contains all the recent modifications made to
the file system on the most recent fsImage.
• create/update/delete requests from the client
CRICOS code 00025B 39

HDFS Component – Checkpoint Node
Checkpoint Node (Secondary NameNode)
• “Secondary” does not mean “2nd” NameNode that acts as same or similar as the primary one.
• Regularly query for fsimage and editlogs (2)
• Primary NameNode will stop to write editlogs and
copy edits and fsimage to secondary NameNode
• All new edits after that will be fed into edits.new (1).
• Copied edits and fsimage on secondary NameNode will be merged (3)
• Copy the merged fsimage.ckpt back to primary NameNode and use it as the new fsImage (edits.new will be used as the latest editlogs)
• Finally, editlogs file gets smaller and fsimage gets updated.
CRICOS code 00025B 40

HDFS Component – DataNode
DataNode
• Equivilent to chunck server in GFS
• Each block replica contains two files:
– data itself
– block’s meta-data
• When startup, handshakes with NameNode
– Verify namespace ID & software version
• Namespace ID is persistently stored on DataNodes,
thus preserves the integrity.
• Internal storage ID is an identifier of the DataNode within the cluster (IP:port changed) and will be never changed after registration.
CRICOS code 00025B 41

HDFS Component – DataNode
DataNode
• During a normal operation, the available block replicas are also sent to NameNode.
• sends heartbeats (by default every 3 secs) to the NameNode
• No heartbeats from a DataNode in ten minutes:
– DataNode is out of service
– the block replicas hosted by that DataNode are
unavailable
– NameNode schedules creation of new replicas of those blocks on other DataNodes.
• receives maintenance commands from the NameNode indirectly (in replies to heartbeats).
– replicate blocks to other nodes;
– remove local block replicas;
– re-register or to shut down the node;
– send an immediate block report.
CRICOS code 00025B 42

HDFS Component – Client
HDFS Client
• User applications access the file system using the HDFS client
• User application knows nothing about data storage.
• Reading:
– first asks the NameNode for the list of DataNodes.
– then contacts a DataNode directly and requests the
transfer.
• Writing:
– first asks the NameNode to choose DataNodes to host replicas of the first block of the file and organizes a pipeline from node-to-node and sends the data.
– then requests new DataNodes to be chosen to host replicas of the next block as well as a new pipeline.
– Each choice of DataNodes is likely to be different.
application
CRICOS code 00025B 43

HDFS – Architecture Revisit
Master/Slave (worker) architecture
fsImage
HDFS Client
NameNode
Secondary NameNode
Replication, balancing, heartbeats, etc.
DataNode
DataNode
DataNode
DataNode
DataNode
Local Local Local Local disks disks disks disks
Local disks
CRICOS code 00025B
44

HDFS Block Placement
• For a large cluster, it may not be practical to connect all nodes in a flat topology.
• A common practice is to spread the nodes across multiple racks.
• Nodes of a rack share a switch, and rack switches are connected by one or more core switches.
• Communication between two nodes in different racks has to go through multiple switches.
• In most cases, network bandwidth between nodes in the same rack is greater than network bandwidth between nodes in different racks.
CRICOS code 00025B 45

HDFS Block Placement Policy
• The default HDFS block placement policy provides a tradeoff between minimizing the write cost, and maximizing data reliability, availability and aggregate read bandwidth.
• When a new block is created, the policy is as follows:
– HDFS places the first replica on the node where the writer is located,
– the second and the third replicas on two different nodes in a different rack,
– and the rest are placed on random nodes with restrictions
 no more than one replica is placed at one node
 no more than two replicas are placed in the same rack when the number of replicas is less than twice the number of racks.
CRICOS code 00025B 46

HDFS Replication – Over-replicated
•
The NameNode detects that a block has become under- or over-replicated.
When a block becomes over-replicated, the NameNode chooses a replica to remove.
– Firstly prefer not to reduce the number of racks that host replicas,
– Secondly prefer to remove a replica from the DataNode with the least amount of available disk space.
– The goal is to balance storage utilization across DataNodes without reducing the block’s availability.
Replicas=3
•
AA
CRICOS code 00025B 47

HDFS Replication – Under-replicated
• The NameNode detects that a block has become under- or over-replicated.
• When a block becomes under-replicated, it is put in the replication priority queue.
– Replication priority will be decided according to the number of replicas
– E.g. A block with only one replica has the highest priority.
– A background thread periodically scans the head of the replication queue to decide where to place new replicas.
Replicas=3
E
DE CRICOS code 00025B 48

HDFS vs GFS
• Differences
Hadoop Distributed File System (HDFS)
Google File System (GFS)
Platform
Cross Platform (Linux, Mac, Windows)
Linux
Development
Developed in Java environment
Developed in C,C++ environment
Chunk Size
128 MB
64 MB
Node
NameNode / DataNodes
Master node & Chunk Server
Log
Editlog
Operational log
Write Operation
No more than one writer at one time
Can have multiple writers to one file at one time.
File Deletion Deleted files are renamed into particular folder and then it will removed via garbage
GFS vs HDFS
Deleted files are not reclaimed immediately and are renamed in hidden name space and it will deleted after three days if it’s not in use
CRICOS code 00025B 49

HDFS – Shell Commands
There are two types of shell commands • User Commands
– hdfs dfs – runs filesystem commands on the HDFS (ls, du, df, etc.)
Use “ls” to display files and directories
Use “du” to display disk usage information
hdfs dfs –ls
hdfs dfs -ls /
hdfs dfs -ls -R /dir hdfs dfs -du -h /
– hdfs fsck – runs a HDFS filesystem checking command
• Administration Commands
– hdfs dfsadmin – runs HDFS administration commands
Use “df” to display disk free information CRICOS code 00025B 51

HDFS – Shell Commands: COPY
• From Local to HDFS:
– hdfs dfs -copyFromLocal [path_to_local_file] [path_to_hdfs_location] – Example:
 Make a director in /home (this is the home directory on HDFS)
 Copy a data file on local machine (VM) to the created folder on HDFS  Check the copied file
CRICOS code 00025B 52

HDFS – Shell Commands: COPY
• From HDFS to Local:
– hdfs dfs -copyToLocal [path_to_hdfs_location] [path_to_local_file] – Example:
 Copy ”sales.txt” back to a new location on Local machine (VM)  Check the copied file on Local (VM)
 Check the md5sum of ”sales.txt” on HDFS
 Check the md5sum of “sales_new.txt” on Local (VM)_
CRICOS code 00025B 53

HDFS – Shell Commands DELETE
• To remove a file on HDFS:
– hdfs dfs –rm [option] [path] – Exmaple:
 Create a copy of sales.txt as sales.txt on HDFS  Show the copy
 Delete the copy version and show the result.
CRICOS code 00025B 54

HDFS – Shell Commands
• You can use fsck to display some file information:
– hdfsfsck[path][option] – Exmaple:
 hdfs fsck /home/data/sales
CRICOS code 00025B 55

HDFS – Shell Commands Administration
• You can use dfsadmin to check status of HDFS status:
– hdfs dfsadmin [option] – Exmaple:
 hdfs dfsadmin –report
hdfs dfsadmin –printTopology
CRICOS code 00025B 56

HDFS – Shell Commands Administration
• You can dumpe the NameNode fsimage to XML file:
– hdfs oiv –i [fsimage file] –o [output file] –p XML – Exmaple:
CRICOS code 00025B 57

HDFS – Graphical User Interface
• You can visit the GUI using HTTP protocal: – http://IP:50070
CRICOS code 00025B 58

References
1.HDFS Tutorial – A Complete Hadoop HDFS Overview. https://data-flair.training/blogs/hadoop-hdfs- tutorial/
2.HDFS Overview. https://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm
3.Machine Learning Library (Mllib) Guide, https://spark.apache.org/docs/latest/ml-guide.html 4.https://www.bmc.com/blogs/using-logistic-regression-scala-spark/ 5.http://www.cse.chalmers.se/~tsigas/Courses/DCDSeminar/Files/afs_report.pdf 6.https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/gfs.pdf 7.https://www.slideshare.net/YuvalCarmel/gfs-vs-hdfs 8.https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
CRICOS code 00025B

程序代写 Cloud Computing INFS3208/INFS7208 – cscodehelp代写

Published by admin

Leave a Reply Cancel reply