怎么理解Docker? 自己手搓一个.
这不是吹牛, 实际上核心技术就 Kernel 里的 Namespace, CGroup, UnionFS.
实现一个 tiny-docker, 对于了解 OS 也是大有裨益的!
多说无用, show me the code!
Simply put, a container is another process on your machine that has
been isolated from all other processes on the host machine. That
isolation leverages kernel namespaces and cgroups, features that have
been in Linux for a long time. Docker has worked to make these
capabilities approachable and easy to use.
命令
ps
, pstree -pan
(树结构表示)
ps
的一些 option:
a
: 一个终端的所有 进程
u
:显示进程的归属用户 以及内存使用 情况
x
:显示出和终端没有关联 的进程
j
:显示进程归属的进程组 id,
会话 id, 父进程 id
f
:以 ascii
形式显示出进程的层次 关系
常用选项: ps aux
(关注主进程) /
ps axjf
(关注进程间的关系)
配合 grep
pipeline 食用更佳.
进程管理
程序就是一串二进制代码 . 当程序被执行时,
它就成为了进程 . 进程用一个唯一整数 pid
标记.
众所周知一个程序可以开出多个进程. fork
函数可以在保留原进程的基础上开出新进程,
实际上就是父进程开出了子进程.
执行一个文件使用 execvp
函数族. 结束一个程序, 一般是
main
函数 return 0
或者调用
exit()
子进程运行结束, 父进程需要用 wait
或
waitpid
回收子进程的资源. 不回收?
子进程会变成僵尸进程.(kernel 里的task_struct
没有释放)
或者你把它 kill
掉! 如果父进程先于子进程终止,
那么子进程就被 init
进程 pid = 1
收养.
Docker 的第一步是复读机!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 #include <iostream> #include <unistd.h> #include <string.h> #include <sys/types.h> #include <sys/wait.h> using namespace std;static string cmd (int argc, char **argv) ;static void run (int argc, char **argv) ;static void run_child (int argc, char **argv) ;int main (int argc, char **argv) { if (argc < 3 ) { cerr << "Too few arguments" << endl; exit (-1 ); } if (!strcmp (argv[1 ], "run" )) run (argc - 2 , &argv[2 ]); } static void run (int argc, char **argv) { cout << "Running " << cmd (argc, argv) << endl; pid_t child_pid = fork(); if (child_pid < 0 ) { cerr << "Failed to work" << endl; return ; } if (child_pid) { if (waitpid (child_pid, NULL , 0 ) < 0 ) cerr << "Fail to wait for child" << endl; else cout << "Child terminated" << endl; } else run_child (argc, argv); } static void run_child (int argc, char **argv) { if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; } static string cmd (int argc, char **argv) { string prompt = "" ; for (int i = 0 ; i < argc; i++) prompt.append (argv[i] + string (" " )); return prompt; }
然后我们来干正事: 实现 isolation!
UTS namespace
Linux Namespace: 内核级别隔离 系统资源 UTS: 在
container 里显示主机名(hostname)
Namespace 就好比 C++ 里的 namespace, using namespace
就是让你获取某一命名空间下的内容, Namespace
就是为了限制"你能看到什么"存在的.
UTS 是系统资源隔离机制中的一种. 此外还有
Mount
(隔离文件和挂载点), PID
(隔离
pid
), User
(隔离用户和用户组)... Namespace
类别负责限制进程在"某一方面"都能看到什么. 例如
CLONE_NEWUTS
: 用于指定UTS Namespace.
UTS Namespace 隔离了不同进程的 hostname view, 这样修改 container 的
UTS Namespace 中的主机名就和 host 进程 Namespace 中的主机名不同了.
先回答一个问题: 为什么需要对主机名和域名进行隔离呢?
因为主机名和域名可以用来代替IP地址. 如果没有这一层隔离,
同一主机上不同的容器的网络访问就可能出问题.
涉及到 Namespace 的操作接口包括:
clone()
: 创建一个独立 Namespace 的进程.
setns()
: 把进程加入到指定的 Namespace 中
unshare()
: 将进程脱离到新的 Namespace
以及 /proc
下的部分文件
通过 /proc
文件查看已存在的 Namespace 试试
ls -al /proc/$pid/ns
1 2 3 4 5 6 7 8 9 10 11 static void run_child (int argc, char **argv) { int flags = CLONE_NEWUTS; if (unshare (flags) < 0 ) { cerr << "Fail to unshare in child" << endl; exit (-1 ); } if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; }
1 2 3 4 5 6 7 8 9 10 11 12 besthope:~/mini-docker$ hostname LAPTOP-3CUODIL3 besthope:~/mini-docker$ sudo ./mocker run /bin/bash Running /bin/bash root@LAPTOP-3CUODIL3:/home/besthope/mini-docker root@LAPTOP-3CUODIL3:/home/besthope/mini-docker container root@LAPTOP-3CUODIL3:/home/besthope/mini-docker exit Child terminated besthope:~/mini-docker$ hostname LAPTOP-3CUODIL3
我们用 sethostname
给它自动化:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 static void run_child (int argc, char **argv) { int flags = CLONE_NEWUTS; if (unshare (flags) < 0 ) { cerr << "Fail to unshare in child" << endl; exit (-1 ); } if (sethostname (child_hostname, strlen (child_hostname)) < 0 ) { cerr << "Fail to change hostname" << endl; exit (-1 ); } if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; }
现在就挺有模有样了, 不是嘛...?
1 2 3 4 5 6 7 8 9 10 11 12 besthope:~/mini-docker$ sudo ./mocker run /bin/bash Running /bin/bash root@container:/home/besthope/mini-docker total 32 -rwxr-xr-x 1 besthope besthope 24656 Apr 8 20:34 mocker -rw-r--r-- 1 besthope besthope 1507 Apr 8 20:34 mocker.cc root@container:/home/besthope/mini-docker PID TTY TIME CMD 1067 pts/6 00:00:00 sudo 1068 pts/6 00:00:00 mocker 1069 pts/6 00:00:00 bash 1085 pts/6 00:00:00 ps
PID namespace
mocker
中执行进程的 pid 是从 host 中增长而来的.
如果我想要 pid 从 1 开始增长, 要怎么做呢? 这就需要隔离 pid
namespace.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 static void run (int argc, char **argv) { cout << "Parent running " << cmd (argc, argv) << "as " << getpid () << endl; if (unshare (CLONE_NEWPID) < 0 ) { cerr << "Fail to unshare PID namespace" << endl; exit (-1 ); } pid_t child_pid = fork(); if (child_pid < 0 ) { cerr << "Failed to work" << endl; return ; } if (child_pid) { if (waitpid (child_pid, NULL , 0 ) < 0 ) cerr << "Fail to wait for child" << endl; else cout << "Child terminated" << endl; } else run_child (argc, argv); }
1 2 3 4 5 6 7 8 9 besthope:~/mini-docker$ sudo ./mocker run /bin/bash Parent running /bin/bash as 1176 Child running /bin/bash as 1 root@container:/home/besthope/mini-docker PID TTY TIME CMD 1175 pts/6 00:00:00 sudo 1176 pts/6 00:00:00 mocker 1177 pts/6 00:00:00 bash 1184 pts/6 00:00:00 ps
ps
的 pid 怎么和 getpid
的不一样? 这个是
ps
的锅. 具体来说, ps
读取的是
/proc
目录下的进程信息, container
虽然放到了一个新的 pid namespace
, 但是读取
/proc
的信息依然和 host 进程是相同的.
但这不影响我们成功隔离了 pid namespace.
一个细节: unshare
创建 pid namespace 是在父进程中进行的,
而不是像 UTS namespace 在子进程中进行. 原因: Kernel 这么做!
文件系统隔离
不同的 mount namespace 本身不能提供 filesystem 的隔离.
问题: mount isolation != filesystem isolation
为什么?
假设我手头有一个 ubuntu-fs
, 里头有:
1 bin boot dev etc home init lib lib32 lib64 libx32 lost+found media mnt opt proc root run sbin snap srv sys tmp usr var
我们可以对子进程代码做简单的修改, 将 container 对整个 filesystem 的
view
限制在这个目录下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 static void run_child (int argc, char **argv) { cout << "Child running " << cmd (argc, argv) << "as " << getpid () << endl; int flags = CLONE_NEWUTS; if (unshare (flags) < 0 ) { cerr << "Fail to unshare in child" << endl; exit (-1 ); } if (chroot ("../ubuntu-fs" ) < 0 ) { cerr << "Fail to chroot" << endl; exit (-1 ); } if (chdir ("/" ) < 0 ) { cerr << "Fail to chdir to /" << endl; exit (-1 ); } if (sethostname (child_hostname, strlen (child_hostname)) < 0 ) { cerr << "Fail to change hostname" << endl; exit (-1 ); } if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; }
核心在于 chroot
: make path root directory. 后续的
absolute path 文件访问全都从这个新的根目录开始. 注意 chroot
只是 view isolation. 你可以利用相对目录访问来逃离这个根目录.
用 chdir
将目录切换到新设置的 /
: change
working directory to path
看起来已经像那么回事了...?
当然要注意到一点: /proc
目录的问题依然没有解决.
如果这时候运行 ps
会直接报错, 原因是 ubuntu-fs
下的 /proc
里是什么都没有的...
但我们可以用一步挂载解决问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 static void run_child (int argc, char **argv) { cout << "Child running " << cmd (argc, argv) << "as " << getpid () << endl; int flags = CLONE_NEWUTS; if (unshare (flags) < 0 ) { cerr << "Fail to unshare in child" << endl; exit (-1 ); } if (chroot ("../ubuntu-fs" ) < 0 ) { cerr << "Fail to chroot" << endl; exit (-1 ); } if (chdir ("/" ) < 0 ) { cerr << "Fail to chdir to /" << endl; exit (-1 ); } if (mount ("proc" , "proc" , "proc" , 0 , NULL ) < 0 ) { cerr << "Fail to mount /proc" << endl; exit (-1 ); } if (sethostname (child_hostname, strlen (child_hostname)) < 0 ) { cerr << "Fail to change hostname" << endl; exit (-1 ); } if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; }
新的问题是: 当我们回到 host,
cat /proc/mounts | grep ^proc
, 你会发现
proc /home/user/ubuntu-fs/proc proc rw,relatime 0 0
也出现在了 host 里!
Mount namespace
Mount 比较蛋疼的一点在于, 你没法简单的加一个 flag 就实现 mount point
isolation. 你需要重新设置根目录mount point的propagation type.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 static void run_child (int argc, char **argv) { cout << "Child running " << cmd (argc, argv) << "as " << getpid () << endl; int flags = CLONE_NEWUTS | CLONE_NEWNS; if (unshare (flags) < 0 ) { cerr << "Fail to unshare in child" << endl; exit (-1 ); } if (mount (NULL ,"/" , NULL , MS_SLAVE | MS_REC, NULL ) < 0 ) { cerr << "Fail to mount /" << endl; exit (-1 ); } if (chroot ("../ubuntu-fs" ) < 0 ) { cerr << "Fail to chroot" << endl; exit (-1 ); } if (chdir ("/" ) < 0 ) { cerr << "Fail to chdir to /" << endl; exit (-1 ); } if (mount ("proc" , "proc" , "proc" , 0 , NULL ) < 0 ) { cerr << "Fail to mount /proc" << endl; exit (-1 ); } if (sethostname (child_hostname, strlen (child_hostname)) < 0 ) { cerr << "Fail to change hostname" << endl; exit (-1 ); } if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; }
这样我们运行 container 的时候挂载 \proc
就不会 propagate
到 host view, container 的 \proc
mount point visibility
就被限制在了 container
的 mount container
中.
当然, 你在 container 运行着的时候在 host 中依然能看到这个
mount namespace
的存在.
完
附一个完整代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 #include <iostream> #include <unistd.h> #include <string.h> #include <sys/types.h> #include <sys/wait.h> #include <sys/mount.h> using namespace std;const char *child_hostname = "container" ;static string cmd (int argc, char **argv) ;static void run (int argc, char **argv) ;static void run_child (int argc, char **argv) ;int main (int argc, char **argv) { if (argc < 3 ) { cerr << "Too few arguments" << endl; exit (-1 ); } if (!strcmp (argv[1 ], "run" )) run (argc - 2 , &argv[2 ]); } static void run (int argc, char **argv) { cout << "Parent running " << cmd (argc, argv) << "as " << getpid () << endl; if (unshare (CLONE_NEWPID) < 0 ) { cerr << "Fail to unshare PID namespace" << endl; exit (-1 ); } pid_t child_pid = fork(); if (child_pid < 0 ) { cerr << "Failed to work" << endl; return ; } if (child_pid) { if (waitpid (child_pid, NULL , 0 ) < 0 ) cerr << "Fail to wait for child" << endl; else cout << "Child terminated" << endl; } else run_child (argc, argv); } static void run_child (int argc, char **argv) { cout << "Child running " << cmd (argc, argv) << "as " << getpid () << endl; int flags = CLONE_NEWUTS | CLONE_NEWNS; if (unshare (flags) < 0 ) { cerr << "Fail to unshare in child" << endl; exit (-1 ); } if (mount (NULL ,"/" , NULL , MS_SLAVE | MS_REC, NULL ) < 0 ) { cerr << "Fail to mount /" << endl; exit (-1 ); } if (chroot ("../ubuntu-fs" ) < 0 ) { cerr << "Fail to chroot" << endl; exit (-1 ); } if (chdir ("/" ) < 0 ) { cerr << "Fail to chdir to /" << endl; exit (-1 ); } if (mount ("proc" , "proc" , "proc" , 0 , NULL ) < 0 ) { cerr << "Fail to mount /proc" << endl; exit (-1 ); } if (sethostname (child_hostname, strlen (child_hostname)) < 0 ) { cerr << "Fail to change hostname" << endl; exit (-1 ); } if (execvp (argv[0 ], argv)) cerr << "Failed to exec" << endl; } static string cmd (int argc, char **argv) { string prompt = "" ; for (int i = 0 ; i < argc; i++) prompt.append (argv[i] + string (" " )); return prompt; }
代码并不算长(?)
参考
Containers
From Scratch • Liz Rice • GOTO 2018
mini-docker:
illustrate what docker really is in 100 lines of C/C++ 是一个 C++
的实现版本, 这是对应的中文版本 .