Big DataHDFS讲义（5）-阿南达文事网

Big DataHDFS讲义（5）

文章目录

7.HDFS的javaAPI操作
- 创建maven工程并导入jar包
- - 使用文件系统方式访问数据（掌握）
  - 获取FileSystem的几种方式
  - 递归遍历文件系统当中的所有文件
  - 官方提供的API直接遍历
  - - 下载文件到本地
    - hdfs上创建文件夹
    - hdfs文件上传
  - javaAPI基本操作
- HDFS权限问题以及伪造用户
- HDFS的小文件合并
- HDFS-Web界面介绍
- - Overview
  - Summary
  - Datanodes

7.HDFS的javaAPI操作

由于cdh版本的所有的软件涉及版权的问题，所以并没有将所有的jar包托管到maven仓库当中去，而是托管在了CDH自己的服务器上面，所以我们默认去maven的仓库下载不到，需要自己手动的添加repository去CDH仓库进行下载，以下两个地址是官方文档说明，请仔细查阅

官方文档说明：CDH 5 Maven存储库

CDH仓库下载地址：Maven Artifacts for CDH 5.14.x

创建maven工程并导入jar包

<repositories><repository><id>cloudera</id><url>/</url></repository>
</repositories>
<dependencies><dependency><groupId>org.apache.Hadoop</groupId><artifactId>Hadoop-client</artifactId><version>2.6.0-mr1-cdh5.14.0</version></dependency><dependency><groupId>org.apache.Hadoop</groupId><artifactId>Hadoop-common</artifactId><version>2.6.0-cdh5.14.0</version></dependency><dependency><groupId>org.apache.Hadoop</groupId><artifactId>Hadoop-hdfs</artifactId><version>2.6.0-cdh5.14.0</version></dependency><dependency><groupId>org.apache.Hadoop</groupId><artifactId>Hadoop-mapreduce-client-core</artifactId><version>2.6.0-cdh5.14.0</version></dependency><!--  --><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency><dependency><groupId>org.testng</groupId><artifactId>testng</artifactId><version>RELEASE</version></dependency>
</dependencies>
<build><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>3.0</version><configuration><source>1.8</source><target>1.8</target><encoding>UTF-8</encoding><!--    <verbal>true</verbal>--></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-shade-plugin</artifactId><version>2.4.3</version><executions><execution><phase>package</phase><goals><goal>shade</goal></goals><configuration><minimizeJar>true</minimizeJar></configuration></execution></executions></plugin><!--  <plugin><artifactId>maven-assembly-plugin </artifactId><configuration><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs><archive><manifest><mainClass>cn.itcast.Hadoop.db.DBToHdfs2</mainClass></manifest></archive></configuration><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin>--></plugins>
</build>

使用url的方式访问数据（了解）

@Test
public void demo1()throws  Exception{//第一步：注册hdfs 的url，让java代码能够识别hdfs的url形式URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());InputStream inputStream = null;FileOutputStream outputStream =null;//定义文件访问的url地址String url = "hdfs://192.168.52.100:8020/test/input/install.log";//打开文件输入流try {inputStream = new URL(url).openStream();outputStream = new FileOutputStream(new File("c:\\hello.txt"));IOUtils.copy(inputStream, outputStream);} catch (IOException e) {e.printStackTrace();}finally {IOUtils.closeQuietly(inputStream);IOUtils.closeQuietly(outputStream);}
}

如果执行出现以下错误，可以参见资料如何解决，也可以不用理会，不会影响程序的执行。记得配置完成环境变量之后重启开发工具

使用文件系统方式访问数据（掌握）

在 java 中操作 HDFS，主要涉及以下 Class：
Configuration：该类的对象封转了客户端或者服务器的配置;
FileSystem：该类的对象是一个文件系统对象，可以用该对象的一些方法来对文件进行操作，通过 FileSystem 的静态方法 get 获得该对象。
FileSystem fs = FileSystem.get(conf)
get 方法从 conf 中的一个参数 fs.defaultFS 的配置值判断具体是什么类型的文件系统。如果我们的代码中没有指定 fs.defaultFS，并且工程 classpath下也没有给定相应的配置，conf中的默认值就来自于Hadoop的jar包中的core-default.xml ，默认值为： file:/// ，则获取的将不是一个DistributedFileSystem 的实例，而是一个本地文件系统的客户端对象

获取FileSystem的几种方式

第一种方式获取FileSystem
@Test
public void getFileSystem() throws URISyntaxException, IOException {Configuration configuration = new Configuration();FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), configuration);System.out.println(fileSystem.toString());
}第二种获取FileSystem类的方式
@Test
public void getFileSystem2() throws URISyntaxException, IOException {Configuration configuration = new Configuration();configuration.set("fs.defaultFS","hdfs://192.168.52.100:8020");FileSystem fileSystem = FileSystem.get(new URI("/"), configuration);System.out.println(fileSystem.toString());
}第三种获取FileSystem类的方式
@Test
public void getFileSystem3() throws URISyntaxException, IOException {Configuration configuration = new Configuration();FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://192.168.52.100:8020"), configuration);System.out.println(fileSystem.toString());
}第四种获取FileSystem类的方式
@Test
public void getFileSystem4() throws  Exception{Configuration configuration = new Configuration();configuration.set("fs.defaultFS","hdfs://192.168.52.100:8020");FileSystem fileSystem = FileSystem.newInstance(configuration);System.out.println(fileSystem.toString());
}

递归遍历文件系统当中的所有文件

通过递归遍历hdfs文件系统

@Test
public void listFile() throws Exception{FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/"));for (FileStatus fileStatus : fileStatuses) {if(fileStatus.isDirectory()){Path path = fileStatus.getPath();listAllFiles(fileSystem,path);}else{System.out.println("文件路径为"+fileStatus.getPath().toString());}}
}public void listAllFiles(FileSystem fileSystem,Path path) throws  Exception{FileStatus[] fileStatuses = fileSystem.listStatus(path);for (FileStatus fileStatus : fileStatuses) {if(fileStatus.isDirectory()){listAllFiles(fileSystem,fileStatus.getPath());}else{Path path1 = fileStatus.getPath();System.out.println("文件路径为"+path1);}}
}

官方提供的API直接遍历

/*** 递归遍历官方提供的API版本* @throws Exception*/
@Test
public void listMyFiles()throws Exception{//获取fileSystem类FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());//获取RemoteIterator 得到所有的文件或者文件夹，第一个参数指定遍历的路径，第二个参数表示是否要递归遍历RemoteIterator<LocatedFileStatus> locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);while (locatedFileStatusRemoteIterator.hasNext()){LocatedFileStatus next = locatedFileStatusRemoteIterator.next();System.out.println(next.getPath().toString());}fileSystem.close();
}

下载文件到本地

/*** 拷贝文件的到本地* @throws Exception*/
@Test
public void getFileToLocal()throws  Exception{FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());FSDataInputStream open = fileSystem.open(new Path("/test/input/install.log"));FileOutputStream fileOutputStream = new FileOutputStream(new File("c:\\install.log"));IOUtils.copy(open,fileOutputStream );IOUtils.closeQuietly(open);IOUtils.closeQuietly(fileOutputStream);fileSystem.close();
}

hdfs上创建文件夹

@Test
public void mkdirs() throws  Exception{FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());boolean mkdirs = fileSystem.mkdirs(new Path("/hello/mydir/test"));fileSystem.close();
}

hdfs文件上传

@Test
public void putData() throws  Exception{FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration());fileSystem.copyFromLocalFile(new Path("file:///c:\\install.log"),new Path("/hello/mydir/test"));fileSystem.close();
}

javaAPI基本操作

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;import java.net.URI;
import java.net.URISyntaxException;public class demo {//该类的对象封转了客户端或者服务器的配置static Configuration conf=new Configuration();//获取指定路径所有文件public  static void  listStatus()  throws Exception  {//该类的对象是一个文件系统对象FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);//获取某一目录下的所有文件FileStatus stats[]=hdfs.listStatus(new Path("/"));//遍历输出for(int i = 0; i < stats.length; ++i)System.out.println(stats[i].getPath().toString());hdfs.close();}//重命名public  static void  rename()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);Path frpaht=new Path("/aaa");Path topath=new Path("/aaaaaaa");boolean isRename=hdfs.rename(frpaht, topath);String result=isRename?"修改成功！":"修改失败！";System.out.println(result);}//获取文件日期public  static void  GetTime()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);FileStatus fileStatus=hdfs.getFileStatus(new Path("/yarn-daemons.txt"));long modiTime=fileStatus.getModificationTime();System.out.println(modiTime);}//删除文件public  static void  deletefile()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);boolean isDeleted=hdfs.delete(new Path("/user/new"),true);System.out.println("Delete?"+isDeleted);}//创建文件夹public  static void  mkdir ()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);boolean bool2=hdfs.mkdirs(new Path("/user/new"));if (bool2){System.out.println("创建成功！！");}else{System.out.println("创建失败！！");}}//创建数据public  static void  AddFile()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);byte[] buff="hello hadoop world!\r\n hadoop ".getBytes();FSDataOutputStream outputStream=hdfs.create(new Path("/tmp/file.txt"));outputStream.write(buff,0,buff.length);outputStream.close();}//上传数据public  static void  put()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);Path src =new Path("C:/123.py");Path dst =new Path("/");hdfs.copyFromLocalFile(src, dst);}//检查目录是否存在public  static void  check()  throws Exception  {FileSystem hdfs=FileSystem.get(new URI("hdfs://192.168.100.201:8020"),conf);Path findf=new Path("/abc");boolean isExists=hdfs.exists(findf);System.out.println("Exist?"+isExists);}

HDFS权限问题以及伪造用户

首先停止hdfs集群，在node01机器上执行以下命令
[root@node01 ~]# cd /export/servers/hadoop-2.6.0-cdh5.14.0
[root@node01 hadoop-2.6.0-cdh5.14.0]# sbin/stop-dfs.sh修改node01机器上的hdfs-site.xml当中的配置文件
[root@node01 ~]# cd /export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop[root@node01 hadoop]# vim hdfs-site.xml
在hdfs-site.xml添加如下配置  开启权限认证 （每个节点都要配置）
<property><name>dfs.permissions</name><value>true</value>
</property>有权限，必然有管理员
	linux  超级管理员是    root
	HDFS  超级管理员是    hdfs(部分大数据平台)修改完成之后配置文件发送到其他机器上面去scp hdfs-site.xml node02:$PWDscp hdfs-site.xml node03:$PWD重启hdfs集群
[root@node01 ~]# cd /export/servers/hadoop-2.6.0-cdh5.14.0
[root@node01 hadoop-2.6.0-cdh5.14.0]# sbin/start-dfs.sh随意上传一些文件到我们Hadoop集群当中准备测试使用
[root@node01 ~]# cd /export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop
[root@node01 hadoop]# hdfs dfs -mkdir /config
[root@node01 hadoop]# hdfs dfs -put *.xml /config
[root@node01 hadoop]# hdfs dfs -chmod 600 /config/core-site.xml

使用代码准备下载文件

@Test
public void getConfig()throws  Exception{FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration(),"root");fileSystem.copyToLocalFile(new Path("/config/core-site.xml"),new Path("file:///c:/core-site.xml"));fileSystem.close();
}

HDFS的小文件合并

由于Hadoop擅长存储大文件，因为大文件的元数据信息比较少，如果Hadoop集群当中有大量的小文件，那么每个小文件都需要维护一份元数据信息，会大大的增加集群管理元数据的内存压力，所以在实际工作当中，如果有必要一定要将小文件合并成大文件进行一起处理。
在我们的hdfs 的shell命令模式下，可以通过命令行将很多的hdfs文件合并成一个大文件下载到本地，命令如下

[root@node01 ~]# cd /export/servers
[root@node01 servers]# hdfs dfs -getmerge /config/*.xml  ./hello.xml

既然可以在下载的时候将这些小文件合并成一个大文件一起下载，那么肯定就可以在上传的时候将小文件合并到一个大文件里面去
代码如下：

/*** 将多个本地系统文件，上传到hdfs，并合并成一个大的文件* @throws Exception*/
@Test
public void mergeFile() throws  Exception{//获取分布式文件系统FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.52.100:8020"), new Configuration(),"root");FSDataOutputStream outputStream = fileSystem.create(new Path("/bigfile.xml"));//获取本地文件系统LocalFileSystem local = FileSystem.getLocal(new Configuration());//通过本地文件系统获取文件列表，为一个集合FileStatus[] fileStatuses = local.listStatus(new Path("file:///F:\\上传小文件合并"));for (FileStatus fileStatus : fileStatuses) {FSDataInputStream inputStream = local.open(fileStatus.getPath());IOUtils.copy(inputStream,outputStream);IOUtils.closeQuietly(inputStream);}IOUtils.closeQuietly(outputStream);local.close();fileSystem.close();
}

总结
HDFS -> Local 使用HDFS提供的-getmerge 方法
Local -> HDFS 遍历每个小文件追加到一个文件，在进行上传（这个文件不在HDFS）

HDFS-Web界面介绍

当我们启动HDFS集群后，然后通过http://master:50070/去访问HDFS WEB UI的时候，我们会经常使用Utilities下的Browse the file system去查看HDFS中的文件，如下：

然后就会出现HDFS中的根目录下所有的文件：

上面的方式是我们常见的访问HDFS文件的方式之一，这种使用的方式也是很方便的。

当我们启动HDFS集群后，我们可以通过http://master:50070来访问HDFS集群，其中，master是NameNode所在机器的名称。下面的就是HDFS WEB UI的七个大模块：

这篇文章，我们重点分别来详细看一下Overview、Datanodes以及Utilities三个模块

Overview

第1处的master:9999表示当前HDFS集群的基本路径。这个值是从配置core-site.xml中的fs.defaultFS获取到的。
第2处的Started表示集群启动的时间
第3处的Version表示我们使用的Hadoop的版本，我们使用的是2.7.5的Hadoop
第4处的Compiled表示Hadoop的安装包(hadoop-2.7.5.tar.gz)编译打包的时间，以及编译的作者等信息
第5处的Cluster ID表示当前HDFS集群的唯一ID
第6处的Block Pool ID表示当前HDFS的当前的NameNode的ID，我们知道通过HDFS Federation （联盟）的配置，我们可以为一个HDFS集群配置多个NameNode，每一个NameNode都会分配一个Block Pool ID

Summary

第1处的Security is off表示当前的HDFS集群没有启动安全机制
第2处的Safemode is off表示当前的HDFS集群不在安全模式，如果显示的是Safemode is on的话，则表示集群处于安全模式，那么这个时候的HDFS集群是不能用的
第3处表示当前HDFS集群包含了3846个文件或者目录，以及1452个数据块，那么在NameNode的内存中肯定有3846 + 1452 = 5298个文件系统的对象存在
第4处表示NameNode的堆内存(Heap Memory)是312MB，已经使用了287.3MB，堆内存最大为889MB，对
第5处表示NameNode的非堆内存的使用情况，有效的非堆内存是61.44MB，已经使用了60.36MB。没有限制最大的非堆内存，但是非堆内存加上堆内存不能大于虚拟机申请的最大内存(默认是1000M)
第6处的Configured Capacity表示当前HDFS集群的磁盘总容量。这个值是通过：Total Disk Space - Reserved Space计算出来的。Total Disk Space表示所在机器所在磁盘的总大小，而Reserved Space表示一个预留给操作系统层面操作的空间。Reserved space空间可以通过dfs.datanode.du.reserved(默认值是0)在hdfs-site.xml文件中进行配置。

我们这边的总容量为什么是：33.97GB呢，我们可以通过du -h看一下两个slave的磁盘使用情况，如下：

上面 17GB + 17GB = 34GB，而且我们没有配置Reserved Space，所以HDFS总容量就是33.97GB(有一点点的误差可以忽略)

第7处DFS Used表示HDFS已经使用的磁盘容量，说白了就是HDFS文件系统上文件的总大小(包含了每一个数据块的副本的大小)
第8处Non DFS Used表示在任何DataNodes节点上，不在配置的dfs.datanode.data.dir里面的数据所占的磁盘容量。其实就是非HDFS文件占用的磁盘容量

配置dfs.datanode.data.dir就是DataNode数据存储的文件目录

第9处DFS Remaining = Configured Capacity - DFS Used - Non DFS Used。这是HDFS上实际可以使用的总容量
第10处Block Pool Used表示当前的Block Pool使用的磁盘容量
第11处DataNodes usages%表示所有的DataNode的磁盘使用情况(最小/平均/最大/方差)
第12处Live Nodes表示存活的DataNode的数量。Decommissioned表示已经下线的DataNode
第13处Dead Nodes表示已经死了的DataNode的数量。Decommissioned表示已经下线的DataNode
第14处Decommissioning Nodes表示正在下线的DataNode的数量。
第15处Total Datanode Volume Failures表示DataNode上数据块的损坏大小
第16处Number of Under-Replicated Blocks表示没有达到备份数要求的数据块的数量
第17处Number of Blocks Pending Deletion表示正要被删除的数据块
第18处Block Deletion Start Time表示可以删除数据块的时间。这个值等于集群启动的时间加上配置dfs.namenode.startup.delay.block.deletion.sec的时间，其中配置dfs.namenode.startup.delay.block.deletion.sec默认是0秒

Datanodes

上面有一个Admin State我们有必要说明下，Admin State可以取如下的值：

1.In Service，表示这个DataNode正常
2.Decommission In Progress，表示这个DataNode正在下线
3.Decommissioned，表示这个DataNode已经下线
4.Entering Maintenance，表示这个DataNode正进入维护状态
5.In Maintenance，表示这个DataNode已经在维护状态

我们这里详细总结下Browse the file system，对于Logs我们在HDFS日志的查看总结中讲解
当我们点击Browse the file system时，我么会进入到如下的界面：

上图每一个字段的解释如下：
Permission：表示该文件或者目录的权限，和Linux的文件权限规则是一样的
Owner：表示该文件或者目录的所有者
Group：表示该文件或者目录的所有者属于的组
Size：表示该文件或者目录的大小，如果是目录的话则一直显示0B
Last Modified：表示该文件或者目录的最后修改时间
Replication：表示该文件或者目录的备份数，如果是目录的话则一直显示0
Block Size：表示该文件的数据块的大小，如果是目录的话则一直显示0B
Name：表示文件或者目录的名字

我们可以通过鼠标点击Name来访问对应的文件目录或者文件：
当我们访问的是目录的时候，则是去查看该目录下有哪些子文件或者子目录。
当我们访问的是文件的时候，我们查看的是文件的详细信息，比如，我们访问文件/user/omneo.csv文件：

Journal Manager：Journal Node 存储EditLog数据的路径
State： Journal Node 存储EditLog数据的文件名

NameNode 存储数据的路径
NameNode存储edits的路径
/export/servers/hadoop-2.6.0-cdh5.14.0/hadoopDatas/dfs/nn/edits
NameNode存储fsimage的路径
/export/servers/hadoop-2.6.0-cdh5.14.0/hadoopDatas/namenodeDatas

Storage Type ：集群存储类型 DISK(磁盘)
Configured Capacity: 配置容量 135.01 GB
Capacity Used: 使用的容量 355.01 MB (0.26%)
Capacity Remaining:剩余容量 117.49 GB (87.02%)
Block Pool Used: 使用的块池 355.01 MB
Nodes In Service:服务中的节点 3

Datanode usage histogram: 数据节点使用率柱状图
Disk usage of each DataNode (%):每个数据节点的磁盘使用率（%）
In operation：运行中的节点

Entering Maintenance: 进入维护的节点列表
Decommissioning: 退役的节点列表

Snapshot Summary：快照摘要
Snapshottable directories : 快照目录列表：2
Snapshotted directories: 已创建的快照目录：4

Startup Progress：集群启动时加载的fsimage和edits
启动时加载的fsimage： fsimage_0000000000000000537
启动时加载的edits：edits_0000000000000000538-0000000000000000538

Big DataHDFS讲义（5）