Arcana-backed enterprise RAG tools and plugins for CMDC.

cmdc_rag_arcana 是 CMDC 的独立 RAG 扩展包。它不把 Arcana 依赖塞进 cmdc core,而是通过标准 CMDC.Tool / CMDC.Plugin 边界接入企业知识库。

能力范围

模块用途
CMDCRAGArcana.Tool.Searchrag_search 只读检索,返回 chunks / citations / scores
CMDCRAGArcana.Tool.Answerrag_answer 基于 Arcana 生成带引用答案
CMDCRAGArcana.Tool.IngestStatusrag_ingest_status 只读查询索引状态
CMDCRAGArcana.Plugin.AccessControlcollection ACL,在 :before_tool fail closed
CMDCRAGArcana.Plugin.CitationAuditcitation 访问事件,在 :after_tool emit
CMDCRAGArcana.IngestionOban worker 可调用的导入 adapter contract
CMDCRAGArcana.Ingestion.ParsedDocumentOCR / parser sidecar 的解析产物契约
CMDCRAGArcana.CitationSpanpage / table / bbox / char offset 级引用定位
CMDCRAGArcana.ProgressEventingestion / reembed / graph 统一进度事件 payload
CMDCRAGArcana.Eval.ArcanaAdapterArcana Evaluation → cmdc_eval adapter
CMDCRAGArcana.Eval.TelemetryBridgeArcana Evaluation telemetry → CMDC EventBus
CMDCRAGArcana.Eval.GateRAG Eval 发布门禁配方与阈值检查
CMDCRAGArcana.MaintenanceArcana maintenance wrapper,统一 progress telemetry/event
CMDCRAGArcana.BackendArcana 调用 behaviour,便于测试和替换

rag_search / rag_answer 默认 fail closed:即使 Agent 传入了 collections,也必须通过 allowed_collectionscollection_policies 或显式 default_allow?: true 放行。

明示不含

  • 不让 Agent 直接 ingest / delete 企业知识库文档。
  • 不默认暴露 Arcana Loop,避免 Agent 套 Agent 后削弱 CMDC trace / 成本 / 审批控制。
  • 不在 cmdc core 引入 Arcana / pgvector / Nx / Bumblebee 依赖。
  • 不在当前版本做完整 Knowledge UI / GraphRAG / 数据飞轮。

Knowledge Control Plane

v0.2 开始,本包提供企业知识库控制面接缝,但不持有企业 Ecto schema 或 Oban 依赖。生产平台应在 Phoenix app 中维护:

  • KnowledgeCollection / KnowledgeDocument / DocumentVersion
  • IngestionRun / IndexStatus / SourceMapping
  • 租户、ACL、审批、保留期、敏感级别、active version 切换

详细 schema 草案、Oban worker skeleton、Arcana dashboard 边界和 maintenance 用法见 Knowledge Control Plane guide

Parser / OCR Sidecar

Arcana 内置 parser 适合 txt/md/pdf 文本抽取。复杂 OCR、版面解析、表格抽取 应由 Python sidecar 或企业 parser 输出 ParsedDocument:

%CMDCRAGArcana.Ingestion.ParsedDocument{
  text: "制度正文...",
  content_type: "application/pdf",
  checksum: "sha256:...",
  source_uri: "kb://policies/approval.pdf",
  pages: [
    %CMDCRAGArcana.Ingestion.ParsedPage{
      page_number: 3,
      text: "高风险操作需要审批",
      section: "审批制度",
      bbox: %{x: 10, y: 20, width: 300, height: 80}
    }
  ],
  tables: [
    %CMDCRAGArcana.Ingestion.ParsedTable{
      id: "tbl-approval",
      page_number: 3,
      markdown: "| 风险 | 审批 |\n| L3 | 经理审批 |"
    }
  ]
}

Ingestion.run/2 会把该 artifact 归一化成 Arcana ingest text + document metadata。后续 Citation 可输出 span:

{
  "source_uri": "kb://policies/approval.pdf",
  "span": {
    "page_number": 3,
    "section": "审批制度",
    "table_id": "tbl-approval",
    "bbox": {"x": 10, "y": 20, "width": 300, "height": 80}
  }
}

安装

defp deps do
  [
    {:cmdc, "~> 0.6"},
    {:cmdc_eval, "~> 0.2"},
    {:cmdc_rag_arcana, "~> 0.3"}
  ]
end

Arcana 本身需要 Ecto Repo、PostgreSQL + pgvector 以及 embedder 配置。生产项目应按 Arcana 官方安装流程完成迁移和 supervision tree 配置。

Agent 集成

{:ok, session} =
  CMDC.create_agent(
    model: "anthropic:claude-sonnet-4-5",
    tools: [
      CMDCRAGArcana.Tool.Search,
      CMDCRAGArcana.Tool.Answer,
      CMDCRAGArcana.Tool.IngestStatus
    ],
    plugins: [
      {CMDCRAGArcana.Plugin.AccessControl,
       allowed_collections: ["policies", "sop"]},
      CMDCRAGArcana.Plugin.CitationAudit
    ],
    user_data: %{
      tenant_id: "tenant-a",
      user_id: "alice",
      roles: ["ops"],
      cmdc_rag_arcana: %{
        repo: MyApp.Repo,
        llm: "openai:gpt-4o-mini",
        status_backend: MyApp.Knowledge.RAGStatusBackend,
        allowed_collections: ["policies", "sop"]
      }
    }
  )

Agent 调用 rag_search 时应传入 collection:

{
  "query": "高风险操作需要几级审批?",
  "collections": ["policies"],
  "top_k": 5,
  "mode": "hybrid"
}

返回值是 JSON 字符串,包含 resultscitationsmetadataCitationAudit 会额外 emit:

  • :rag_retrieved
  • :rag_answered
  • :rag_citation_used

Agent 调用 rag_ingest_status 时只读查询状态:

{
  "collection": "policies",
  "document_id": "doc-1",
  "version_id": "ver-2026-05"
}

返回值包含 status.statusstatus.graph_statusstatus.stale?status.chunk_count 等字段。该工具不会触发 ingest/delete/rebuild。

后台导入和重嵌入进度使用统一 payload:

%CMDCRAGArcana.ProgressEvent{
  kind: :ingestion,
  event: :progress,
  status: :running,
  tenant_id: "tenant-a",
  collection: "policies",
  document_id: "doc-1",
  version_id: "ver-2026-05",
  current: 10,
  total: 100,
  percent: 10.0
}

测试替换 backend

defmodule MyMockRAG do
  @behaviour CMDCRAGArcana.Backend

  def search(_query, _opts), do: {:ok, [%{id: "c1", text: "policy", score: 0.9}]}
  def answer(_question, _opts), do: {:ok, "answer", [%{id: "c1", text: "policy"}]}
end

然后在 user_data 或直接调用中配置:

cmdc_rag_arcana: %{backend: MyMockRAG, allowed_collections: ["policies"]}

开发环境如需临时放开 collection ACL,可以显式配置:

cmdc_rag_arcana: %{backend: MyMockRAG, default_allow?: true}

生产环境应使用 allowed_collectionscollection_policies,不要依赖 default_allow?: true

RAG Eval 与发布门禁

v0.3 复用 Arcana 内置 Evaluation,并把结果接到 cmdc_eval:

alias CMDCRAGArcana.Eval.{ArcanaAdapter, Gate, TelemetryBridge}

handler_id = TelemetryBridge.attach(session_id: "release-run-1")

{:ok, result} =
  ArcanaAdapter.run(
    repo: MyApp.Repo,
    mode: :hybrid,
    evaluate_answers: true,
    llm: &MyApp.LLM.complete/4,
    target: :ask
  )

Gate.check(result.cmdc_metadata,
  recall_at_5: 0.85,
  faithfulness: 0.8,
  correctness: 0.8,
  unauthorized_source_count: 0
)

TelemetryBridge.detach(handler_id)

Gate.recipe/2 给 AgentSpec 发布前的推荐顺序: RAG Eval → Tool Calling Eval → Safety Eval。报告字段覆盖 recall、citation、 faithfulness、correctness、unauthorized source、cost 和 latency。

License

Apache 2.0