DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

snowman

Fire-Flyer File System（3FS）：高性能分布式文件系统

火速提升AI工作负载效率

Fire-Flyer File System (3FS) 是一个专为解决AI训练和推理工作负载挑战而设计的高性能分布式文件系统。它利用现代SSD和RDMA网络提供一个共享存储层，使得分布式应用开发变得更加简便。

性能与易用性
分离架构：结合数千个SSD的吞吐量和数百个存储节点的网络带宽，使应用能够以无关地域的方式访问存储资源。
强一致性：通过实现链式复制与分配查询（CRAQ），确保强一致性，使得应用代码更简洁且易于推理。
文件接口：采用无状态元数据服务，背后支持事务型键值存储（如FoundationDB）。熟悉的文件接口，无需学习新的存储API。

多样化工作负载支持
数据准备：将数据分析管道的输出组织成层次化目录结构，有效管理大量中间输出。
数据加载器：通过支持跨计算节点随机访问训练样本，消除预读取或数据洗牌的需求。
检查点：支持大规模训练的高吞吐量并行检查点。
推理KV缓存：提供一种基于成本效益的替代方案，替代DRAM缓存，提供更高吞吐量并大大提高容量。
smallpond：轻量级数据处理框架，基于DuckDB和3FS构建
轻松处理PB级数据

smallpond 是一个轻量级的数据处理框架，基于 DuckDB 和 3FS，提供高效的数据处理能力，专为大规模数据集而设计。它无须长时间运行的服务，支持大数据集的高效处理，轻松集成到您的工作流中。

主要特点
高性能数据处理：由 DuckDB 提供强大支持，确保数据处理过程高效且快速。
PB级数据集处理能力：无论数据量多大，smallpond 都能轻松应对。
简易操作：无需长时间运行的服务，简单易用的API，节省开发时间。

性能评估
通过 GraySort 基准测试，smallpond 在一个由50个计算节点和25个存储节点构成的集群上运行 3FS 存储系统，成功将110.5TiB的数据排序，仅用了30分钟14秒，达到了每分钟3.66TiB的平均吞吐量！

midlander · 帖子由 **midlander** » 2025年 2月 27日 20:56

这不可能吧，无数大厂都有flash storage server,
有上千phd,每个月工资上亿

但是居然被十几个PhD的小厂弄出来了？

hci

整合开源库弄的。foundationdb，duckdb这些在举重。

可能的，我老单人，一路打敗lucene postgresql SQLite ，有啥不可能的。

我早说了，cs是个人可以輕鬆打敗一支军队的行业。因为这是门手艺，是艺术，不是工程，不是堆人有用的。

midlander 写了： 2025年 2月 27日 20:56 这不可能吧，无数大厂都有flash storage server,
有上千phd,每个月工资上亿

但是居然被十几个PhD的小厂弄出来了？

sjtuvincent

midlander 写了： 2025年 2月 27日 20:56 这不可能吧，无数大厂都有flash storage server,
有上千phd,每个月工资上亿

但是居然被十几个PhD的小厂弄出来了？

Deepseek不是小厂，是新时代的大厂

新时代，1个人就能干1000个人的活，所以他们是上万人的团队，哈哈

drifter · 帖子由 **drifter** » 2025年 2月 27日 21:41

当年ReiserFS比ext3牛逼多了
就是一个人搞的后来杀人入狱被称为KillerFS

starbox · 帖子由 **starbox** » 2025年 2月 27日 21:42

我感觉deepseek在5次开源之后，

还有给几个彩蛋

midlander · 帖子由 **midlander** » 2025年 2月 27日 21:43

hci 写了： 2025年 2月 27日 21:26 整合开源库弄的。foundationdb，duckdb这些在举重。

可能的，我老单人，一路打敗lucene postgresql SQLite ，有啥不可能的。

我早说了，cs是个人可以輕鬆打敗一支军队的行业。因为这是门手艺，是艺术，不是工程，不是堆人有用的。

今天还在dell emc网站浏览它最高级的ssd storage server

和各种fs

midlander · 帖子由 **midlander** » 2025年 2月 27日 21:43

starbox 写了： 2025年 2月 27日 21:42 我感觉deepseek在5次开源之后，

还有给几个彩蛋

收了神通吧

我期待儿童节发布r2

drifter · 帖子由 **drifter** » 2025年 2月 27日 21:43

感觉背后应该是集团作战 ds只是冲在前面的旗帜

snowman

midlander 写了： 2025年 2月 27日 21:43 收了神通吧

我期待儿童节发布r2

提前到五月以前了

Miraboreasu · 帖子由 **Miraboreasu** » 2025年 2月 27日 21:48

但是我不知道为什么我搭建agent
llama3.2比deepseek好用我真不知道为什么
而且是3.2都不是3.3

Mountainlion · 帖子由 **Mountainlion** » 2025年 2月 27日 21:48

这个就是穷人节源开流的做法吧。
不差钱，估计不需要做。

drifter · 帖子由 **drifter** » 2025年 2月 27日 21:58

To compare the performance of 3FS (Fire-Flyer File System) with other filesystem products for AI training use cases, we’ll evaluate it against some prominent filesystems commonly used in high-performance computing (HPC) and AI workloads: Lustre, CephFS, GPFS (IBM Spectrum Scale), and JuiceFS. Since 3FS is a relatively new distributed filesystem from DeepSeek, designed specifically for AI training and inference workloads, its performance metrics will be contextualized using available data and typical characteristics of these competitors. Here's the breakdown:
3FS Overview
3FS is a high-performance, parallel, distributed filesystem built by DeepSeek to leverage modern SSDs and RDMA (Remote Direct Memory Access) networks. It’s optimized for AI workloads, offering features like strong consistency via Chain Replication with Apportioned Queries (CRAQ), a disaggregated architecture combining thousands of SSDs, and stateless metadata services backed by a transactional key-value store (e.g., FoundationDB).
Reported Performance:
Aggregate read throughput of 6.6 TiB/s (approximately 6,758 GB/s) on a 180-node cluster, each with 2×200Gbps InfiniBand NICs and sixteen 14TiB NVMe SSDs, alongside 500+ client nodes.
Throughput of 3.66 TiB/min (about 62.5 GB/s) in a specific test scenario (possibly GraySort benchmark).
Use Case Fit: Tailored for AI training, emphasizing high throughput for large datasets and low-latency access to massive numbers of files.
Comparison with Other Filesystems
1. Lustre
Overview: A widely-used parallel filesystem in HPC and supercomputing, known for scalability and high throughput. It separates metadata and data services, using object storage targets (OSTs) and metadata servers (MDS).
Performance:
Example: A large Lustre deployment (e.g., at ORNL’s Summit supercomputer) achieves read/write throughput of 2.5–5 TB/s (2,560–5,120 GB/s) with thousands of nodes and disks, though this varies by configuration.
Per-disk throughput is often cited around 50–100 MB/s/HDD in HDD-based setups, but SSD-based Lustre can hit 1–2 GB/s per OST with NVMe.
AI Training Fit:
Strengths: Excellent for sequential, large-file I/O common in HPC and some AI datasets (e.g., large video or image files).
Weaknesses: Metadata performance can bottleneck with millions of small files (common in AI preprocessing), and setup complexity is high.
Comparison to 3FS:
3FS’s 6.6 TiB/s exceeds typical Lustre deployments, likely due to its SSD+RDMA optimization. Lustre can scale similarly with enough nodes, but 3FS seems to outperform per-node due to its disaggregated design and modern hardware focus.
2. CephFS
Overview: Part of the Ceph ecosystem, a distributed filesystem with a unified object, block, and file interface. It uses a dynamic metadata server (MDS) cluster and relies on object storage daemons (OSDs).
Performance:
Throughput varies widely: A well-tuned CephFS cluster with SSDs might achieve 1–3 TB/s (1,024–3,072 GB/s) in large setups, but smaller clusters often see 100–500 GB/s.
Metadata operations lag with small files; benchmarks show 10–50K IOPS per MDS, scaling with more MDS nodes.
AI Training Fit:
Strengths: Unified storage (object+file) is versatile for AI pipelines; good for mixed workloads.
Weaknesses: Slower metadata handling and lower throughput compared to 3FS or Lustre for massive parallel reads.
Comparison to 3FS:
3FS’s 6.6 TiB/s dwarfs CephFS’s typical throughput, especially in read-heavy AI training scenarios. CephFS struggles with the extreme parallelism and small-file intensity that 3FS targets.
3. GPFS (IBM Spectrum Scale)
Overview: A high-performance, parallel filesystem from IBM, used in enterprise and HPC environments. It supports distributed metadata and scales across thousands of nodes.
Performance:
Example: IBM Elastic Storage Server (ESS) GL4S model (HDD-based) delivers 24 GB/s with 334 drives (~72 MB/s/drive), while SSD configs can hit 50–100 GB/s in smaller clusters. Large-scale setups reach 1–3 TB/s.
NVMe-optimized GPFS can exceed 5 TB/s with enough nodes.
AI Training Fit:
Strengths: Strong consistency and scalability; good for large-scale, enterprise-grade AI training.
Weaknesses: Centralized metadata can bottleneck with small files; less optimized for RDMA compared to 3FS.
Comparison to 3FS:
3FS’s 6.6 TiB/s outpaces most GPFS deployments, especially in SSD+RDMA contexts. GPFS can match or approach this with massive infrastructure, but 3FS’s design gives it an edge in AI-specific throughput.
4. JuiceFS
Overview: An open-source distributed filesystem optimized for cloud and AI workloads, using a metadata engine (e.g., Redis) and object storage (e.g., S3) backend with local caching.
Performance:
Throughput depends on the backend: With SSD caching and high-bandwidth networks, it can achieve 10–50 GB/s in small clusters. Large-scale tests are less documented but suggest 100–500 GB/s with optimization.
Excels with small files due to metadata engine; read performance scales with cache hit rates.
AI Training Fit:
Strengths: Handles massive small-file datasets (e.g., millions of images) well; cost-effective with cloud storage.
Weaknesses: Throughput caps out lower than 3FS due to reliance on object storage and network latency.
Comparison to 3FS:
3FS’s 6.6 TiB/s far exceeds JuiceFS’s practical limits, as JuiceFS prioritizes flexibility over raw throughput. For AI training requiring terabytes-per-second reads, 3FS is superior.
Performance Metrics Summary
Filesystem
Peak Throughput (Example)
AI Training Strengths
AI Training Weaknesses
3FS
6.6 TiB/s (6,758 GB/s)
Extreme throughput, RDMA+SSD optimized
Limited real-world deployment data
Lustre
2.5–5 TB/s (2,560–5,120 GB/s)
Scalable, large-file I/O
Metadata bottlenecks with small files
CephFS
1–3 TB/s (1,024–3,072 GB/s)
Unified storage, versatile
Lower throughput, metadata scaling
GPFS
1–5 TB/s (1,024–5,120 GB/s)
Enterprise-grade, consistent
Less RDMA focus, metadata limits
JuiceFS
100–500 GB/s (estimated)
Small-file handling, cloud-friendly
Throughput limited by backend
Analysis for AI Training Use Case
Throughput: 3FS’s 6.6 TiB/s is a standout, likely due to its focus on NVMe SSDs and RDMA, making it ideal for feeding massive datasets to GPU clusters in AI training. Lustre and GPFS can approach this in large-scale HPC setups, but 3FS seems to achieve it with fewer nodes.
Small Files: AI training often involves millions of small files (e.g., images, text). 3FS’s stateless metadata and key-value store likely give it an edge over Lustre and GPFS, which struggle with metadata-heavy workloads. JuiceFS competes here but lacks 3FS’s throughput.
Scalability: All systems scale well, but 3FS’s disaggregated architecture and CRAQ consistency could simplify scaling for AI compared to Lustre’s complex OST/MDS setup or CephFS’s MDS limits.
Ease of Use: 3FS’s appliance-like design (per DeepSeek’s claims) may reduce complexity compared to Lustre or GPFS, though JuiceFS wins for cloud integration.
Conclusion
For AI training use cases requiring extreme read throughput and parallelism (e.g., large-scale deep learning with terabyte-sized datasets), 3FS appears to outperform Lustre, CephFS, GPFS, and JuiceFS based on its reported 6.6 TiB/s metric. Its design leverages modern hardware (NVMe SSDs, RDMA) more aggressively than competitors, giving it a theoretical edge in feeding data-hungry GPU clusters. However, its real-world adoption is less documented compared to established players like Lustre or GPFS, so practical performance may depend on specific configurations and workloads. If your AI training prioritizes small-file access over raw throughput, JuiceFS could be a contender, but for peak performance, 3FS looks like the leader among these options.
If you’d like me to search for more specific benchmarks or X posts about 3FS in AI contexts, let me know!

drifter · 帖子由 **drifter** » 2025年 2月 27日 21:59

Xai说吊打其他fs

xiaxia · 帖子由 **xiaxia** » 2025年 2月 27日 22:06

其实发了这些东西，就是为了证明自己没有走私芯片，全用的合法芯片

hci

没有。

ds干的事，是这样的一个团队合理输出。都合情合理，目前还没有让我吃惊的。

这3 F文件系统是个专用文件系统，不是通用的，只适合训练模型。去除文件cache，直接用io_ring。

这个东西他们都用了几年了。

drifter 写了： 2025年 2月 27日 21:43 感觉背后应该是集团作战 ds只是冲在前面的旗帜

xexz · 帖子由 **xexz** » 2025年 2月 27日 23:41

保姆级开源，生怕全世界人民有人掉队，

都这样啦，你们还训练不出自己的‘人工弱智’？

新未名空间

DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#1 DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#2 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#3 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#4 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#5 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#6 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#7 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#8 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#9 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#10 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#11 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#12 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#13 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#14 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#15 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#16 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）

#17 Re: DeepSeek 开源周第 5 弹 - 3FS 和 smallpond（转贴）