Firecrawl 是一项 API 服务。它能够获取 URL,对相应的网页进行抓取操作,并将抓取到的内容转换成格式规范的 Markdown 格式文本或者结构化数据。

可以实现对所有能够访问到的子页面进行抓取,并且为每个子页面提供格式规范、清晰的数据。使用该服务无需站点地图。

官方服务地址:https://www.firecrawl.dev/

开源地址:https://github.com/mendableai/firecrawl

一、为什么选择 Firecrawl 作为 Web Search 实现?

在 Diry 平台的工具列表中,我们可以看到很多关于 搜索 相关的工具:

其中大家熟悉的 google 和 bing 都有相关 API 提供,接口速度较快,但都有一定费用。

其中 SearXNG 是一个不错的开源互联网搜索方案,经过测试后发现国内环境下稳定性较差,且目前和 Dify 整合不够好,有不少 bug。

而 Firecrawl,是一款强大的结构化爬虫工具,底层使用了 playwright,可以通过爬取互联网搜索网站(比如百度)实现 Web Search 插件的能力。

二、私有化部署 Firecrawl

官方部署教程:https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md

本文使用 Docker Compose 进行部署。

1. 克隆仓库

git clone https://github.com/mendableai/firecrawl.git

cd firecrawl

2. 配置环境

直接复制官方配置案例。

cp apps/api/.env.example apps/api/.env

3. 启动容器

前台运行:

docker compose up

后台运行:

docker compose up -d

4. 测试接口

curl -X GET http://localhost:3002/test

成功返回:Hello, world!

三、接入 Diry 平台

只需配置 Firecrawl 的接口地址,端口号默认是 3002。

私有化部署的情况下,密钥随意填写即可。

四、实现一个具有网络搜索能力的 ChatFlow

1. 添加单页面抓取节点

抓取 https://www.baidu.com/s?wd={用户输入} 的内容,并只获取 class 属性包含 result 的标签,最终以 links 的格式返回。

通过这个节点,我们可以获取百度搜索结果页面的链接列表,这些链接指向结果网站页面。

调试结果如下结构:

{

"text": "",

"files": [],

"json": [

{

"success": true,

"data": {

"metadata": {

"theme-color": "#ffffff",

"referrer": "always",

"title": "全球票房最高的动画片是什么_百度搜索",

"favicon": "https://www.baidu.com/favicon.ico",

"scrapeId": "a6b1bed8-7ac7-4bb8-b37f-ddc90d936f5d",

"sourceURL": "https://www.baidu.com/s?wd=全球票房最高的动画片是什么",

"url": "https://www.baidu.com/s?wd=全球票房最高的动画片是什么",

"statusCode": 200

},

"links": [

"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKiIaBT5bldOg8WKw5Yl2kzyTAdhDjXg8fhOAUuSR0fvpwQ7kTPFB-00BOS0TaSwsO",

"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFNvD9E44qfYsc_CRAkAaDSLLL_OZbEFrFQiDs1vzQdsHKVoczUckq3tgOnP4TvzGfy",

"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFMLSKanxg5wetFwpRfsvMLFqPjubf0Q79hiCwwLk0XKJ2ZV3M64hHqIGPfscUPLZue",

"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKDBzUshbz_zll4SvSx-EID7IndCL1Xm3QNdL78B9KuvTNq9Az180HZ6F8vySsvmuS",

"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFLu1lPaS1L3aIN164b5IAB_0iDZV-XzWC9A5kqVBECNegvVSYKko9wD_0ZxQfl5c4S",

"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKDBzUshbz_zll4SvSx-EID-vOAlRTrK0bQRF_mm5qBOav-iqkxX7OlvWt5htLOP3y"

]

}

}

]

}

2. 添加代码执行节点

改节点编写一个简单的 JS 脚本,用于获取 links 数组,便于后续节点遍历。

3. 添加迭代节点

迭代抓取每个结果页面内容,这里设置只抓取主要内容,并移除了一些无用标签(例如:style,script,img,svg,a)。

4. 添加大模型节点

改节点汇总之前抓取的页面内容,并通过指令要求大模型进行总结回复。

5. 添加直接回复节点

6. 完整 DSL 文件

可在 Dify 流程设计中,右键导入。

app:

description: ''

icon: 🤖

icon_background: '#FFEAD5'

mode: advanced-chat

name: Web Search Bot

use_icon_as_answer_icon: false

kind: app

version: 0.1.5

workflow:

conversation_variables: []

environment_variables: []

features:

file_upload:

allowed_file_extensions:

- .JPG

- .JPEG

- .PNG

- .GIF

- .WEBP

- .SVG

allowed_file_types:

- image

allowed_file_upload_methods:

- local_file

- remote_url

enabled: false

fileUploadConfig:

audio_file_size_limit: 50

batch_count_limit: 5

file_size_limit: 15

image_file_size_limit: 10

video_file_size_limit: 100

workflow_file_upload_limit: 10

image:

enabled: false

number_limits: 3

transfer_methods:

- local_file

- remote_url

number_limits: 3

opening_statement: ''

retriever_resource:

enabled: true

sensitive_word_avoidance:

enabled: false

speech_to_text:

enabled: false

suggested_questions: []

suggested_questions_after_answer:

enabled: false

text_to_speech:

enabled: false

language: ''

voice: ''

graph:

edges:

- data:

isInIteration: false

sourceType: start

targetType: tool

id: 1740103540241-source-1740121470685-target

selected: false

source: '1740103540241'

sourceHandle: source

target: '1740121470685'

targetHandle: target

type: custom

zIndex: 0

- data:

isInIteration: false

sourceType: tool

targetType: code

id: 1740121470685-source-1740122039767-target

selected: false

source: '1740121470685'

sourceHandle: source

target: '1740122039767'

targetHandle: target

type: custom

zIndex: 0

- data:

isInIteration: false

sourceType: code

targetType: iteration

id: 1740122039767-source-1740122637931-target

source: '1740122039767'

sourceHandle: source

target: '1740122637931'

targetHandle: target

type: custom

zIndex: 0

- data:

isInIteration: true

iteration_id: '1740122637931'

sourceType: iteration-start

targetType: tool

id: 1740122637931start-source-1740122872165-target

source: 1740122637931start

sourceHandle: source

target: '1740122872165'

targetHandle: target

type: custom

zIndex: 1002

- data:

isInIteration: false

sourceType: llm

targetType: answer

id: 1740122936994--1740122982132-target

source: '1740122936994'

sourceHandle: source

target: '1740122982132'

targetHandle: target

type: custom

zIndex: 0

- data:

isInIteration: false

sourceType: iteration

targetType: llm

id: 1740122637931-source-1740122936994-target

source: '1740122637931'

sourceHandle: source

target: '1740122936994'

targetHandle: target

type: custom

zIndex: 0

nodes:

- data:

desc: ''

selected: false

title: 开始

type: start

variables: []

height: 54

id: '1740103540241'

position:

x: 30

y: 406

positionAbsolute:

x: 30

y: 406

selected: false

sourcePosition: right

targetPosition: left

type: custom

width: 244

- data:

desc: ''

provider_id: firecrawl

provider_name: firecrawl

provider_type: builtin

selected: false

title: 单页面抓取

tool_configurations:

excludeTags: null

formats: links

headers: null

includeTags: .result

onlyMainContent: 1

prompt: null

schema: null

systemPrompt: null

timeout: 30000

waitFor: 0

tool_label: 单页面抓取

tool_name: scrape

tool_parameters:

url:

type: mixed

value: https://www.baidu.com/s?wd={{#sys.query#}}

type: tool

height: 324

id: '1740121470685'

position:

x: 334

y: 406

positionAbsolute:

x: 334

y: 406

selected: false

sourcePosition: right

targetPosition: left

type: custom

width: 244

- data:

code: "\nfunction main({arg1}) {\n return {\n result: arg1[0].data.links\n\

\ }\n}\n"

code_language: javascript

desc: ''

outputs:

result:

children: null

type: array[string]

selected: false

title: 代码执行

type: code

variables:

- value_selector:

- '1740121470685'

- json

variable: arg1

height: 54

id: '1740122039767'

position:

x: 638

y: 406

positionAbsolute:

x: 638

y: 406

selected: false

sourcePosition: right

targetPosition: left

type: custom

width: 244

- data:

desc: ''

error_handle_mode: terminated

height: 412

is_parallel: false

iterator_selector:

- '1740122039767'

- result

output_selector:

- '1740122872165'

- text

output_type: array[string]

parallel_nums: 10

selected: false

start_node_id: 1740122637931start

title: 迭代

type: iteration

width: 692

height: 412

id: '1740122637931'

position:

x: 942

y: 406

positionAbsolute:

x: 942

y: 406

selected: false

sourcePosition: right

targetPosition: left

type: custom

width: 692

zIndex: 1

- data:

desc: ''

isInIteration: true

selected: false

title: ''

type: iteration-start

draggable: false

height: 48

id: 1740122637931start

parentId: '1740122637931'

position:

x: 24

y: 68

positionAbsolute:

x: 966

y: 474

selectable: false

sourcePosition: right

targetPosition: left

type: custom-iteration-start

width: 44

zIndex: 1002

- data:

desc: ''

isInIteration: true

iteration_id: '1740122637931'

provider_id: firecrawl

provider_name: firecrawl

provider_type: builtin

selected: false

title: 单页面抓取

tool_configurations:

excludeTags: style,script,img,svg,a

formats: null

headers: null

includeTags: body

onlyMainContent: 1

prompt: null

schema: null

systemPrompt: null

timeout: 30000

waitFor: 0

tool_label: 单页面抓取

tool_name: scrape

tool_parameters:

url:

type: mixed

value: '{{#1740122637931.item#}}'

type: tool

height: 324

id: '1740122872165'

parentId: '1740122637931'

position:

x: 283.42857142857133

y: 66.57142857142856

positionAbsolute:

x: 1225.4285714285713

y: 472.57142857142856

selected: false

sourcePosition: right

targetPosition: left

type: custom

width: 244

zIndex: 1002

- data:

context:

enabled: false

variable_selector: []

desc: ''

model:

completion_params:

temperature: 0.7

mode: chat

name: deepseek-r1:70b

provider: ollama

prompt_template:

- edition_type: basic

id: 10d89ba1-6562-486a-83c3-6cdb53359a7a

jinja2_text: ''

role: system

text: '你是一个乐于助人的助手。

XML标记中使用以下上下文作为您学到的知识。这些知识来源于网络搜索,不是用户提供给你的。

{{#1740122637931.output#}}

回答用户时,避免提到你是从上下文中获得信息的。

并根据用户提问的语言进行回答。'

selected: false

title: LLM

type: llm

variables: []

vision:

enabled: false

height: 98

id: '1740122936994'

position:

x: 1694

y: 406

positionAbsolute:

x: 1694

y: 406

selected: true

sourcePosition: right

targetPosition: left

type: custom

width: 244

- data:

answer: '{{#1740122936994.text#}}'

desc: ''

selected: false

title: 直接回复

type: answer

variables: []

height: 103

id: '1740122982132'

position:

x: 1998

y: 406

positionAbsolute:

x: 1998

y: 406

selected: false

sourcePosition: right

targetPosition: left

type: custom

width: 244

viewport:

x: -629

y: -20

zoom: 0.7

五、对话体验

Copyright © 2088 2017乒乓球世界杯_世界杯体彩 - uzhiqu.com All Rights Reserved.
友情链接