UTC :: --:--:-- RUST :: stable :: 1.96.0 CLIENT :: browser :: detecting PYPI :: status :: operational CLIENT :: AWS/REGION :: us-east-2 LINUX :: stable_kernel :: 7.0.10 CLOUDFLARE :: pages :: degraded_performance NODE :: lts :: 24.16.0 CLIENT :: os :: detecting CRATES.IO :: crates :: 275k+ GITHUB :: actions :: operational CLIENT :: ip :: masked PYTHON :: stable :: 3.14.x UTC :: --:--:-- RUST :: stable :: 1.96.0 CLIENT :: browser :: detecting PYPI :: status :: operational CLIENT :: AWS/REGION :: us-east-2 LINUX :: stable_kernel :: 7.0.10 CLOUDFLARE :: pages :: degraded_performance NODE :: lts :: 24.16.0 CLIENT :: os :: detecting CRATES.IO :: crates :: 275k+ GITHUB :: actions :: operational CLIENT :: ip :: masked PYTHON :: stable :: 3.14.x
docs::rolethread :: Getting Started
~/docs/rolethread/docs/help/03_dataset_formats.md

Dataset Formats

RoleThread Lite Docs

./view_on_github
repo
Lattice-Foundry/RoleThread-Lite
path
docs/help/03_dataset_formats.md
ver
1.4.45
commit
3fbdfa7320
synced
May 29, 2026, 03:35 AM UTC

RoleThread Lite works with JSONL datasets for narrative AI training. The two main conversation formats are ChatML and ShareGPT.

You do not need to memorize every detail, but it helps to understand what RoleThread expects and what it changes during import/export.

JSONL Basics

JSONL means "JSON Lines." Each line is one JSON object.

A small dataset might look like this:

{"messages":[{"role":"system","content":"You are a helpful narrator."},{"role":"user","content":"Describe the old house."},{"role":"assistant","content":"The house leaned against the hill, windows dark and watchful."}],"tags":["setting_description"]}
{"messages":[{"role":"system","content":"You are a helpful narrator."},{"role":"user","content":"What does Mara notice?"},{"role":"assistant","content":"Mara notices the candle smoke moving against the draft."}],"tags":["observation"]}

Each entry is separate. RoleThread Lite loads those entries into a browser so you can inspect, edit, tag, validate, and export them.

ChatML

ChatML is RoleThread Lite's main working format. A ChatML-style entry has a messages list with standard roles:

{
  "messages": [
    {"role": "system", "content": "You are writing a grounded fantasy scene."},
    {"role": "user", "content": "What does the traveler see?"},
    {"role": "assistant", "content": "The road bends toward a watchtower half swallowed by ivy."}
  ],
  "tags": ["roleplay", "descriptive"]
}

The usual roles are:

  • system: instructions or framing for the entry
  • user: the prompt or human-side turn
  • assistant: the response being trained

Most RoleThread editing tools work directly with this structure.

ShareGPT

ShareGPT-style records usually use a conversations list instead of messages:

{
  "conversations": [
    {"from": "human", "value": "Say hello as Assistant."},
    {"from": "gpt", "value": "Hi, User. I missed you."}
  ],
  "tags": ["greeting"]
}

When RoleThread Lite loads ShareGPT data, it converts it into ChatML-style entries for editing. Common role names are mapped into standard roles.

For example:

  • human becomes user
  • gpt becomes assistant
  • system-like turns become system

If a ShareGPT record has no system prompt, RoleThread may inject a simple internal system prompt so the entry has a valid ChatML shape.

RoleThread-Native Metadata

When RoleThread saves entries, it may add a _rolethread metadata block. That block can include:

  • app version
  • native/trusted save marker
  • entry UUID
  • dataset UUID
  • validation timestamp

This metadata helps RoleThread preserve identity across edits, merges, sidecars, and selection workflows.

It is not meant to be part of the conversational training text.

Entry UUID and Dataset UUID

An entry UUID is a stable identity for one entry. It helps RoleThread keep track of the same entry even after filtering, pagination, editing, or merging.

A dataset UUID identifies a saved RoleThread dataset identity. It helps RoleThread check that nearby sidecar metadata belongs with the dataset being loaded.

Most users do not need to manage UUIDs manually. They exist so the app can be safer and more predictable.

Group Chat Mode Still Exports Standard Roles

Group Chat mode lets you assign characters to individual turns while editing. This is display and metadata support.

The JSONL training roles still remain:

  • system
  • user
  • assistant

RoleThread does not export custom role names as training roles. Character names are preserved separately as metadata and sidecar information.

That distinction matters. It keeps exported datasets compatible with standard training expectations while still letting you organize multi-character scenes.

Clean Export

Clean export removes RoleThread-specific metadata from the exported training records.

Clean export is useful when you want a plain dataset for training or sharing without:

  • _rolethread metadata
  • internal identity fields
  • tags or other non-message fields, depending on export settings

Use clean export when the target system should only receive conversation records.

Practical Guidance

  • Use ChatML as the normal editing format in RoleThread Lite.
  • Import ShareGPT when you already have data in that shape.
  • Run Validation after import or conversion.
  • Use clean export for final training files when you do not want RoleThread metadata included.
  • Keep sidecar files with datasets when you want tags, characters, and prompt metadata to travel with them.