COAT: Discovery of the Hidden World with Large Language Models

1Hong Kong Baptist University, 2MBZUAI, 3The Chinese University of Hong Kong
4The University of Sydney, 5The University of Melbourne, 6Carnegie Mellon University
( *Equal Contribution)


Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that are trained to learn rich knowledge from the massive observations of the world, provides a new opportunity to assist with discovering high-level hidden variables from the raw observational data. Therefore, we introduce COAT: Causal representatiOn AssistanT. COAT incorporates LLMs as an factor proposer that extracts the potential causal factors from unstructured data. Moreover, LLMs can also be instructed to provide additional information used to collect data values (e.g., annotation criteria) and to further parse the raw unstructured data into structured data. The annotated data will be fed to a causal learning module (e.g., the FCI algorithm) that provides both rigorous explanations of the data, as well as useful feedback to further improve the extraction of causal factors by LLMs. We verify the effectiveness of COAT in uncovering the underlying causal system with two case studies of review rating analysis and neuropathic diagnosis.

COAT Framework

img description

Figure 1. Illustration of the COAT framework.

Inspired by real-world causal discovery applications, given a new task with unstructured observational data, COAT aims to uncover the markov blanket with respect to a target variable:
  • (a) Factor Proposal. COAT first adopts an LLM to read, comprehend, and relate the rich knowledge during pre-training to propose a series of candidate factors along with some meta-information such as annotation guidelines.
  • (b) Factor Annotation. Based on the candidate factors, COAT then prompts another LLM to annotate or fetch the structured values of the unstructured data. With the annotated structured data.
  • (c1) Causal Discovery. With the annotated structured data,the causal discovery algorithm is called to find causal relations among the factors
  • (c2) Feedback Construction. By looking at samples where the target variable can not be well explained with the existing factors, LLM is expected to associate more related knowledge to uncover the desired causal factor.

Results on AppleGastronome Benchmark

In this benchmark, gastronomes comment and rate apples according to their preference. Each apple has its own attributes, including size, smell, and taste (e.g. sweet or sour). The target variable is the rating score.

Interpolation end reference image.

Box 1. Examples of AppleGastronome data, grouped by the value of scores.

Can LLMs be an effective factor proposer? It can be found that, compared to other uses of LLMs, COAT obtain significant improvements regardless of which LLM is used. In contrast, directly using LLMs to reason about the causal relations results in a high sensitivity to the capabilities of LLMs.

Interpolation end reference image.

Table 1 . Causal discovery results in AppleGastronome. MB, NMB and OT refer to the number of causal factors discovered in the underlying markov blanket, in the causal graph but not the markov blanket, and the other variables. Recall, precision, and F1 for factor proposal evaluate the discovered causal ancestors. Recall, precision, and F1 for relation extraction evaluate the recovery of the causal edges. The Data baseline refers to pairwise causal relation inference (Kiciman et al., 2023) based on the factors discovered by Data.

Can COAT reliably recover the causal relationships? Compared to the ground truth results, directly adopting LLMs to reason about the causal relations can easily elicit lots of false positive causal relations. In contrast, the relations recovered by COAT have a high precision as well as the recall. The directed edge between “taste” and “juiciness” can not be recovered by COAT is because of the limitations of FCI.

Interpolation end reference image.

Figure 2. The discovered causal graphs in AppleGastronome.

Can LLMs be an effective factor annotator? Moreover, since LLMs are also used to annotate the data according to the proposed annotation guidelines, we analyze the capabilities of LLMs in terms of annotation accuracy. It can be found that both LLMs are generally good at annotating objective attributes. When it comes to the human subjective preferences, the performance of GPT-3.5 will decrease while being relatively high.

Interpolation end reference image.

Figure 3. Annotation accuracies of GPT-4 and GPT-3.5 for apple attributes and preference matchness in AppleGastronome

Will LLMs introduce additional confounders in annotating factors? In addition, since the annotated results by LLMs will involve additional noises, or even additional confounders, we also conduct independence tests among the annotation noises and the features. It can be found that, with highly capable LLMs, e.g., GPT-4-Turbo, the dependencies can be controlled under an acceptable level.

Interpolation end reference image.

Table 2 . Independence tests of the annotation noises with annotated features and other noises in AppleGastronome.

Results on Neuropathic Benchmark

In the original dataset, there are three levels of causal variables, including the symptom-level, radiculopathy-level and the pathophysiology-level. In this project, we mainly consider the target variable of right shoulder impingement . When generating the clinical diagnosis notes as x using GPT-4, we will avoid any mentioning of variables other than symptoms .

Interpolation end reference image.

Box 2. Examples of Neuropathic data, grouped by the presence of one certain symptom.

Factor proposal. Similarly, we can find that COAT consistently outperforms all of the baselines regardless of which LLMs are incorporated. In particular, COAT can boost the weakest backbone LLaMA2-7b to be better than any other LLMs.

Interpolation end reference image.

Table 3 . Causal discovery results in Neuropathic. PA, AN, and OT refer to the parents, ancestors, and others, respectively. Accuracy and F1 measure the recovery of the causal ancestors.

Causal relation recovery. Due to the faithfulness issue of the original dataset (Tu et al., 2019), we mainly conduct a qualitative comparison between the ground truth that is faithful to the data, against the baselines and COAT.

Interpolation end reference image.

Figure 4. The discovered causal graphs in Neuropathic.


Welcome to check our paper for more details of the research work. If there is any question, please feel free to contact us.

If you find our paper and repo useful, please consider to cite:

      title={COAT: Discovery of the Hidden World with Large Language Models}, 
      author={Chenxi Liu and Yongqiang Chen and Tongliang Liu and Mingming Gong and James Cheng and Bo Han and Kun Zhang},
      journal = {arXiv preprint},
      volume = {arXiv:2402.03941}