ragrank.dataset

ragrank.dataset.base

Contain all of the base classes for dataset

ragrank.dataset.reader

Reader module for Ragrank

Contains all of modules related to dataset

class ragrank.dataset.ColumnMap(*, question: str = 'question', context: str = 'context', response: str = 'response')
Represents a mapping of column names to their

corresponding names in a dataset.

question

The name of the column containing questions.

Type:

str

context

The name of the column containing contexts.

Type:

str

response

The name of the column containing responses.

Type:

str

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'context': FieldInfo(annotation=str, required=False, default='context', description='The name of the column containing contexts'), 'question': FieldInfo(annotation=str, required=False, default='question', description='The name of the column containing questions'), 'response': FieldInfo(annotation=str, required=False, default='response', description='The name of the column containing responses')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class ragrank.dataset.DataNode(*, question: str, context: List[str], response: str)

Represents a single data point in a dataset.

question

The question associated with the data point.

Type:

str

context

The context or background nformation related to the question.

Type:

List[str]

response

The response or answer to the question.

Type:

str

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'context': FieldInfo(annotation=List[str], required=True, description='The context information related to the question'), 'question': FieldInfo(annotation=str, required=True, description='The question associated with the data point'), 'response': FieldInfo(annotation=str, required=True, description='The response or answer to the question')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

to_dataset() Dataset

Convert the data node to a Dataset instance.

Returns:

A Dataset instance containing the current data node.

Return type:

Dataset

class ragrank.dataset.Dataset(*, question: List[str], context: List[List[str]], response: List[str])
Represents a dataset containing questions, contexts,

and responses.

question

A list of questions.

Type:

List[str]

context

A list of contexts, each represented as a list of strings.

Type:

List[List[str]]

response

A list of responses corresponding to the questions.

Type:

List[str]

append(data_node: DataNode) None

Append a DataNode to the dataset.

Parameters:

data_node (DataNode) – The DataNode to append.

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'context': FieldInfo(annotation=List[List[str]], required=True, description='A list of contexts, each represented as a list of strings'), 'question': FieldInfo(annotation=List[str], required=True, description='A list of questions, each represented as a string'), 'response': FieldInfo(annotation=List[str], required=True, description='A list of responses corresponding to the questions')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

to_csv(path: str | Path, **kwargs: Any) None

Save the data as a csv file

Parameters:

path (str | Path) – path to the csv file

Returns:

None

to_dataframe() DataFrame

Return a pandas dataframe of the data

Parameters:

None

Returns:

data representation

Return type:

DataFrame

to_dict() Dict[str, List[str] | List[List[str]]]

Return a dict of the data

Parameters:

None

Returns:

data representation

Return type:

dict

validator() Dataset

Validate the dataset after instantiation.

Raises:

ValueError – If the number of data points is not consistent across question, context, and response.

with_progress(purpose: str = 'Iterating') tqdm

Return a tqdm progress bar for iterating over the dataset.

Parameters:

purpose (str) – The purpose for iterating over the dataset.

Returns:

A tqdm progress bar.

Return type:

tqdm

ragrank.dataset.from_csv(path: str | Path, *, column_map: ColumnMap | None = None, **kwargs: Any) Dataset | DataNode

Create a Dataset or DataNode object from a CSV file.

Parameters:
  • path (Union[str, Path]) – The path to the CSV file.

  • column_map (ColumnMap, optional) – Column mapping. Defaults to ColumnMap().

  • **kwargs – Keyword arguments to pass to pandas read_csv function.

Returns:

Either a Dataset or DataNode object.

Return type:

Union[Dataset, DataNode]

ragrank.dataset.from_dataframe(data: DataFrame, *, return_as_dataset: bool = False, column_map: ColumnMap | None = None) Dataset | DataNode

Create a Dataset or DataNode object from a Pandas DataFrame.

Parameters:
  • data (pd.DataFrame) – The DataFrame containing the data.

  • return_as_dataset (bool, optional) – If True, return as Dataset object, otherwise return as DataNode. Defaults to False.

  • column_map (ColumnMap, optional) – Column mapping. Defaults to ColumnMap().

Returns:

Either a Dataset or DataNode object.

Return type:

Union[Dataset, DataNode]

ragrank.dataset.from_dict(data: Dict[str, List[str] | str] | Dict[str, List[str] | List[List[str]]], *, return_as_dataset: bool = False, column_map: ColumnMap | None = None) Dataset | DataNode

Create a Dataset or DataNode object from a dictionary representation.

Parameters:
  • data (Union[DATANODE_TYPE, DATASET_TYPE]) – The dictionary containing the data representation.

  • return_as_dataset (bool, optional) – If True, return as Dataset object, otherwise return as DataNode. Defaults to False.

  • column_map (ColumnMap, optional) – Column mapping. Defaults to ColumnMap().

Returns:

Either a Dataset or DataNode object.

Return type:

Union[Dataset, DataNode]

Raises:

ValueError – If the column specified in column_map is not present in the data.

ragrank.dataset.from_hfdataset(url: str | Tuple[str], *, split: str, column_map: ColumnMap | None = None) Dataset

Create a Dataset object from a Hugging Face dataset.

Parameters:
  • url (Union[str, Tuple[str]]) – The URL or tuple of URLs pointing to the dataset.

  • split (str) – The name of the split to load from the dataset.

  • column_map (ColumnMap, optional) – Column mapping. Defaults to ColumnMap().

Returns:

A Dataset object containing the loaded data.

Return type:

Dataset