CoAuthorAI: A Human in the Loop System For Scientific Book Writing

arXiv cs.CL 04/23/26, 04:00 AM Papers
Summary
CoAuthorAI is a human-in-the-loop system that combines retrieval-augmented generation and hierarchical outlines to enable accurate, coherent scientific book writing, achieving 98% recall and 82% human satisfaction in evaluations.
arXiv:2604.19772v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in scientific writing but struggle with book-length tasks, often producing inconsistent structure and unreliable citations. We introduce CoAuthorAI, a human-in-the-loop writing system that combines retrieval-augmented generation, expert-designed hierarchical outlines, and automatic reference linking. The system allows experts to iteratively refine text at the sentence level, ensuring coherence and accuracy. In evaluations of 500 multi-domain literature review chapters, CoAuthorAI achieved a maximum soft-heading recall of 98%; in a human evaluation of 100 articles, the generated content reached a satisfaction rate of 82%. The book AI for Rock Dynamics generated with CoAuthorAI and Kexin Technology's LUFFA AI model has been published with Springer Nature. These results show that systematic human-AI collaboration can extend LLMs' capabilities from articles to full-length books, enabling faster and more reliable scientific publishing.
Original Article
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# CoAuthorAI: A Human in the Loop System For Scientific Book Writing
Source: [https://arxiv.org/html/2604.19772](https://arxiv.org/html/2604.19772)
Yangjie Tian1,2Xungang Gu1,3Yun Zhao1Jiale Yang1Lin Yang1Ning Li1He Zhang1Ruohua Xu1Hua Wang2Kewen Liao3Ming Liu311footnotemark:1 1Kexin Technology, Beijing 100012, China 2Institute for Sustainable Industries and Liveable Cities, Victoria University, VIC 3011, Australia 3School of Information Technology, Deakin University, Melbourne, VIC 3125, Australia yangjie\.tian@live\.vu\.edu\.au, zhanghe@kxsz\.net, m\.liu@deakin\.edu\.au

###### Abstract

Large language models \(LLMs\) are increasingly used in scientific writing but struggle with book\-length tasks, often producing inconsistent structure and unreliable citations\. We introduce*CoAuthorAI*, a human\-in\-the\-loop writing system that combines retrieval\-augmented generation, expert\-designed hierarchical outlines, and automatic reference linking\. The system allows experts to iteratively refine text at the sentence level, ensuring coherence and accuracy\. In evaluations of 500 multi\-domain literature review chapters,*CoAuthorAI*achieved a maximum soft\-heading recall of 98%; in a human evaluation of 100 articles, the generated content reached a satisfaction rate of 82%\. The bookAI for Rock Dynamicsgenerated with CoAuthorAI and Kexin Technology’s LUFFA AI model has been published with Springer Nature\. These results show that systematic human–AI collaboration can extend LLMs’ capabilities from articles to full\-length books, enabling faster and more reliable scientific publishing\. The system demonstration video is available at[https://youtu\.be/PAWQz48tsdA](https://youtu.be/PAWQz48tsdA)\.

CoAuthorAI: A Human in the Loop System For Scientific Book Writing

Yangjie Tian1,2Xungang Gu1,3Yun Zhao1Jiale Yang1Lin Yang1Ning Li1He Zhang1††thanks:Corresponding author\.Ruohua Xu1Hua Wang2Kewen Liao3Ming Liu311footnotemark:11Kexin Technology, Beijing 100012, China2Institute for Sustainable Industries and Liveable Cities, Victoria University, VIC 3011, Australia3School of Information Technology, Deakin University, Melbourne, VIC 3125, Australiayangjie\.tian@live\.vu\.edu\.au, zhanghe@kxsz\.net, m\.liu@deakin\.edu\.au

## 1Introduction

Scientific writing is essential but complex and time\-consuming\. Writing books requires extensive research, careful organization, and multiple revisions\. Large language models \(LLMs\) offer new ways to speed up and improve this process, from generating drafts to suggesting edits, saving authors significant time\.

Recent work shows that LLMs such as ChatGPT, GPT\-4, and Claude have begun to reshape short\-form scientific text generation, including literature summarization\(Agarwalet al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib4); Wanget al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib5)\), report drafting\(Aljamaanet al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib6); Tayloret al\.,[2022](https://arxiv.org/html/2604.19772#bib.bib7); Wanget al\.,[2023](https://arxiv.org/html/2604.19772#bib.bib12)\), and even chapter writing\(Schoenenberger,[2023](https://arxiv.org/html/2604.19772#bib.bib8)\)\. In these scenarios, fluent prose is produced rapidly while retrieval modules provide up\-to\-date facts, yielding tangible productivity gains\. Nevertheless, fully automatic generation frequently suffers from citation hallucinations, factual inaccuracies, and stylistic inconsistencies across long documents\(Alkaissi and McFarlane,[2023](https://arxiv.org/html/2604.19772#bib.bib9)\)\.

To mitigate these shortcomings, the research community has embraced the human\-in\-the\-loop \(HITL\) paradigm, where human expertise is interleaved with model inference for planning, content vetting, and approval\(Hsuet al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib10); Agarwalet al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib4)\)\. This paradigm leverages the complementary strengths of humans \(domain knowledge, critical judgment\) and LLMs \(linguistic fluency, rapid drafting\), and is becoming the de facto standard for high\-stakes scientific communication\.

Despite notable advances in short\-form writing, the systematic development of long\-form book drafting remains substantially underexplored\. Existing prototypes such as BetaWriter’s fully automated monograph\(Beta Writer,[2019](https://arxiv.org/html/2604.19772#bib.bib11)\)and Meta’s Galactica demonstration\(Tayloret al\.,[2022](https://arxiv.org/html/2604.19772#bib.bib7)\)illustrate both the potential benefits and the significant limitations of generating book length manuscripts without sustained expert oversight\. In practice, large publishers still rely on labor\-intensive editing cycles to maintain coherence, control narrative depth, and ensure traceable citations\(Schoenenberger,[2023](https://arxiv.org/html/2604.19772#bib.bib8)\)\. This raises a key question: how can we scale LLM assistance to book\-length writing while keeping human authors firmly in control?

To address this challenge, we present CoAuthorAI, a production\-ready HITL system for scientific book generation\. Our contributions are threefold:

1. 1\.Design a*modular architecture*combining retrieval\-augmented generation, expert\-designed hierarchical outlines, and automatic reference linking, enabling chapter\-level generation with sentence\-level traceability\.
2. 2\.Implement*interactive feedback loops*that let experts iteratively refine outlines, regenerate sections, and verify citations, ensuring control over style, depth, and accuracy\.
3. 3\.Explore the boundaries between LLMs and domain experts in book\-writing tasks, and using this system in collaboration with Kexin Technology’s LUFFA model to assist author teams in publishingAI for Rock Dynamics111https://link\.springer\.com/book/10\.1007/978\-981\-96\-5342\-3?sap\-outbound\-id=D94D3E307CE1F96013B03FB247B741415100E16B\.

Collectively, these advances extend collaborative writing with LLMs from articles to full\-length books, providing a practical workflow for authors and publishers\.

## 2Related Work

We organise the discussion around three strands of research that converge on human‑in‑the‑loop book generation\.

#### Literature Summarisation

Early systems likeLitLLM\(Agarwalet al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib4)\)andAutoSurvey\(Wanget al\.,[2024](https://arxiv.org/html/2604.19772#bib.bib5)\)employ retrieval‑augmented generation pipelines to convert collections of papers into structured literature reviews\. Interactive platforms such asElicit\(Ought,[2024](https://arxiv.org/html/2604.19772#bib.bib13)\)andSciSpace\(Typeset,[2024](https://arxiv.org/html/2604.19772#bib.bib14)\)extend this idea, offering query‑driven paper discovery and summary previews that researchers can curate manually\. These works demonstrate the efficiency gains of combining search and generation but are typically limited to section‑scale outputs\.

#### Scientific Report Generation

General‑purpose models \(e\.g\. GPT‑3\.5/4, Claude\) have been tested for writing grant proposals and clinical reports, yet high hallucination rates in references remain a major challenge\(Alkaissi and McFarlane,[2023](https://arxiv.org/html/2604.19772#bib.bib9)\)\. Domain‑constrained approaches leverage multimodal grounding \(e\.g\.R2GenGPTfor radiology\) or curated corpora \(Elsevier’sScienceDirect AI\) to improve factuality\(Wanget al\.,[2023](https://arxiv.org/html/2604.19772#bib.bib12); Bio\-IT World Staff,[2025](https://arxiv.org/html/2604.19772#bib.bib15)\)\. While these systems showcase HITL verification interfaces, they tackle documents far shorter than a full book\.

#### Book Drafting and HITL Workflows

Beta Writer initiated the study of automatic book construction by clustering and summarising existing academic articles, although the resulting prose exhibited substantial fragmentation and required extensive human post editing\(Beta Writer,[2019](https://arxiv.org/html/2604.19772#bib.bib11)\)\. Subsequent industrial efforts have explored the integration of conversational large language models into collaborative authoring pipelines\. For example, Springer Nature’s GPT assisted textbook reportedly reduced production time by 50% while retaining human authorship as the central decision making component\(Schoenenberger,[2023](https://arxiv.org/html/2604.19772#bib.bib8)\)\. However, publicly available technical accounts detailing system architectures, evaluation protocols and design trade offs remain limited, leaving key questions unresolved concerning scalability, citation reliability and the granularity of expert involvement\. Our work addresses this gap by providing a fully documented system together with a comprehensive empirical analysis\.

![Refer to caption](https://arxiv.org/html/2604.19772v1/AI_book_pipeline.png)Figure 1:Overview of CoAuthorAI, illustrating the frontend for expert inputs and revision, and the backend for PDF parsing, content compression, section generation, and post\-processing through LLM interaction\.

## 3CoAuthorAI

Our CoAuthorAI demonstration system is a web application built with Streamlit222https://streamlit\.io/and employs Python333https://www\.python\.org/for preprocessing tasks\. We leverage the PDF parsing tool for extracting content from PDF documents, and store the resulting embeddings in Milvus444https://milvus\.io/zh\. The system is composed of two primary components: a front\-end and a back\-end\. As depicted in Figure[1](https://arxiv.org/html/2604.19772#S2.F1), The front\-end handles user interactions, such as document uploads, book outline creation, and expert content revisions\. On the back\-end, we use the PDF parsing tool to convert PDF articles into machine\-readable formats and, combined with retrieval\-augmented techniques, employ large language models to generate chapter content\. Furthermore, we provide a detailed, step\-by\-step guide to using CoAuthorAI\.

![Refer to caption](https://arxiv.org/html/2604.19772v1/CoAuthorAI_System_Page_Display.png)Figure 2:User interface and workflow of the CoAuthorAI system\. The interface guides users through a seven\-step human–AI collaborative pipeline: \(1\) selecting or creating a project, \(2\) viewing and managing book\-level information, \(3\) uploading outlines and reference PDFs, \(4\) parsing PDF content , \(5\) compressing literature, \(6\) manual refinement of AI\-generated text, and \(7\) performing citation tracing\.### 3\.1Frontend Design

#### Human Guidance and Looping

Human guidance and iterative feedback are essential for ensuring the quality and accuracy of AI\-generated content\. As shown in Figure[2](https://arxiv.org/html/2604.19772#S3.F2), experts are responsible for establishing the book’s title and outline, and upload relevant literature for each chapter to provide direction and factual material\. Because LLMs outputs can be random and machine\-like, experts review, refine, and verify the generated content on the platform to meet academic and industry standards\. This involves multiple rounds of human intervention and feedback loops to continuously improve the AI\-generated text\.

### 3\.2Backend Design

#### PDF Extraction

Using a self\-developed PDF parsing tool, which is designed to efficiently analyze and extract content from PDF documents that contain complex elements such as images, formulas, and tables\. It converts multi\-modal PDFs into Markdown format, making them easier to process and suitable for computational analysis and machine learning tasks\. With the help of this tool, users can construct large language model \(LLM\) corpora more effectively and make full use of the complete information contained in PDF materials\.

#### Content Compression

In experiments with multi\-document long\-form content generation, we found two major limitations of LLMs\. First, the context window acts as the model’s working memory, limiting the size of documents it can process at once, which is often smaller than the total reference material for a book\. Second, even with techniques to expand the context, generated content tends to focus on conceptual explanations and lacks detailed coverage of specific knowledge\. To address this, we first use LLMs to compress and summarize full\-text documents, then feed the processed text as reference\. This improves generation efficiency, depth, and accuracy\.

#### Section Generation

The section generation module serves as the core of our content production pipeline\. We employ LLMs to generate section content based on compressed reference materials\. We observed that more detailed prompts lead to higher\-quality outputs, so domain experts refine content outlines down to second\-level subheadings before generation\. Due to context window limitations and after repeated experiments, we found that to ensure in\-depth discussion, the number of compressed reference materials should not exceed 40 documents\. For sections requiring more references, our backend system uses a multi\-stage generation approach:

- •Step 1:Using the same prompt engineering, we generate the content of the chapter for every 40 reference documents, treating each as an intermediate result;
- •Step 2:After batch generation, these intermediate results are used as reference materials to produce the final version of the chapter;
- •Step 3:The final generated chapter is then linked back to the original materials to establish correspondence\.

This batch\-processing architecture ensures maximum utilization of reference materials while maintaining coherent narrative flow\.

#### Reference Linking

module plays a crucial role in ensuring the accuracy and credibility of citations in the generated content\. To achieve this, we first decompose the source documents into smaller semantic blocks, and then transform both the generated content and the references into high\-dimensional vector representations using the bge\-m3 embedding model\. Milvus’ IVF\-SQ8 index is employed to perform efficient similarity searches on these vectors, calculating the semantic similarity between them and the generated text\. In this way, the references most relevant to the generated content can be quickly located, and a similarity score can be obtained\. To process the original references, sentences are first segmented based on regular punctuation marks\. The processed sentences are then grouped into three sets, with each set containing an overlapping sentence\. This overlapping method helps retain richer information, improving recall integrity, and ensures the final block has semantic coherence\. To effectively manage large\-scale vector similarity searches, we integrated Milvus, a high\-performance open\-source vector database optimized for handling large volumes of unstructured data\. Milvus supports storage, indexing, and fast retrieval of document block embeddings, ensuring that similarity calculations remain fast and scalable\.

#### Head and Tail Generation

Following the generation of section content, the head and tail generation module is responsible for creating the introduction and conclusion parts of the book\. This module uses the generated section content as reference material and employs Large Language Models \(LLMs\) to craft the introductory and concluding sections of the book, providing a cohesive narrative that frames the entire work\.

### 3\.3System Walkthrough

Figure[2](https://arxiv.org/html/2604.19772#S3.F2)gives a comprehensive walkthrough of how the system can be operated by users\.

- •Step 1:Users upload the outline, specify sections, and upload relevant reference documents\.;
- •Step 2:Use the PDF parsing tool to extract content from PDF documents;
- •Step 3:Use LLMs to compress the content of the parsed documents\.
- •Step 4:Use LLMs to generate content for sections, and experts can manually edit the generated content in the editing area\.
- •Step 5:Conduct reference verification for the generated content to ensure the accuracy of citations\.

## 4Evaluation

In terms of evaluation, we adopt a phased approach to evaluate the system\. First, we use the system to perform literature review tasks, assessing the model’s ability to produce structured outputs and the readability of the generated content\. Once the system’s usability is confirmed, we further evaluate its capability in book\-writing tasks during the human–AI collaborative generation ofAI for Rock Dynamics\.

### 4\.1Literature Review Evaluation

#### Datasets

We have collected 500 English scientific research reviews555https://github\.com/Kexin\-Technology/EnSciRL\-500\. Table[4](https://arxiv.org/html/2604.19772#A1.T4)provides a data example from one of the scientific literature reviews\. In addition, we have further extracted the outlines of the references to evaluate the outlines generated by the large language models subsequently\.

#### Implementation Setups

During the experiments, we selected several mainstream large language models from both domestic and international sources\. We observed that, regardless of prompt adjustments, LLMs tend to produce subsection\-style outputs \(with subheadings\) when generating long\-form text\. Therefore, we focused on evaluating LLMs’ ability to generate outlines in the literature review task\. First, we directly fed metadata \(title, subject, references\) into LLMs to generate an initial outline of the literature review\. Then, following the*section generation*procedure, we generated the content for each section, ensuring that the output was both relevant and supported by existing literature\. Finally, we stitched together the generated text from each section to form a complete and coherent literature review, which was then used for evaluating the generated content\.

#### Automatic Evaluation Results

We selected the*ROUGE\-1/2/L*provided by Google666https://github\.com/google\-research/google\-research/tree/master/rougeto evaluate the content and the*Soft Heading Recall*\(S\-H Recall\)Fränti and Mariescu\-Istodor \([2023](https://arxiv.org/html/2604.19772#bib.bib3)\)for evaluating the outline of generated survey\.

Sim\(ti,tj\)=cos⁡\(embed\(ti\),embed\(tj\)\)\\begin\{split\}\\text\{Sim\}\\left\(t\_\{i\},t\_\{j\}\\right\)=\\cos\\left\(\\text\{embed\}\\left\(t\_\{i\}\\right\),\\text\{embed\}\\left\(t\_\{j\}\\right\)\\right\)\\end\{split\}\(1\)
card\(T\)=∑i=1\|T\|1∑j=1\|T\|Sim\(ti,tj\)\\begin\{split\}\\text\{card\}\(T\)=\\sum\_\{i=1\}^\{\|T\|\}\\frac\{1\}\{\\sum\_\{j=1\}^\{\|T\|\}\\text\{Sim\}\\left\(t\_\{i\},t\_\{j\}\\right\)\}\\end\{split\}\(2\)
card\(R∩G\)=card\(R\)\+card\(G\)−card\(R∪G\)\\begin\{split\}\\text\{card\}\(R\\cap G\)=\\text\{card\}\(R\)\+\\text\{card\}\(G\)\-\\text\{card\}\(R\\cup G\)\\end\{split\}\(3\)
soft heading recall=card\(R∩G\)card\(R\)\\begin\{split\}\\text\{soft heading recall\}=\\frac\{\\text\{card\}\(R\\cap G\)\}\{\\text\{card\}\(R\)\}\\end\{split\}\(4\)T=\{t1,t2,t3,⋯,tK\}T=\\\{t\_\{1\},t\_\{2\},t\_\{3\},\\cdots,t\_\{K\}\\\}represents a group of the chapter titles/heads in a generated/reference survey\. R and G are the chapter titles of the generated and reference survey, respectively\. The bge\-large\-en\-v1\.5777https://huggingface\.co/BAAI/bge\-large\-en\-v1\.5model is used for text embedding\. This score encourages the similarity between generated and reference chapter titles while punishes the similarity of titles within the generated survey\.

Table 1:Performance Comparison of Large Language Models on S\-H Recall and ROUGE MetricsTable[1](https://arxiv.org/html/2604.19772#S4.T1)presents the scores of different models in the task of literature review generation\. The evaluation metrics include the ROUGE score and the Soft Heading Recall \(S\-H Recall\) score\. It can be observed that the Claude model performs better in both S\-H Recall and ROUGE scores, achieving an especially high S\-H Recall score of 0\.9802\. It leads across all evaluation metrics and secures the top position\. These automatic evaluation metrics indicate that the system is desirable in terms of outline generation capability and text coherence\.

#### Human Evaluation Results

cover five aspects: \(i\) Fluent and clear language; \(ii\) Logical structure; \(iii\) reliable citations; \(iv\) Consistency of content with the theme; \(v\) Broad analytical scope\.

Table 2:Results of Human EvaluationTable[2](https://arxiv.org/html/2604.19772#S4.T2)presents the results of the human evaluation of 20 articles by a 5\-member team\. It was followed a structured pipeline: one primary evaluator \(Prim\-eval\) scored all articles on the five aspects; two secondary evaluators \(Sec\-eval\) jointly assessed the articles and averaged their scores; one first examiner \(Fir\-exam\) randomly checked 20 articles to ensure quality; and a final examiner \(Fin\-exam\) reviewed the overall results and assigned team performance levels \(A–E\)\. The table reveals a high degree of consistency in the human evaluations\. After careful deliberation and consensus within the evaluation team, the final human evaluation scores were determined by averaging the scores given by the primary evaluator and the secondary evaluators\.

### 4\.2Book Writing Evaluation

Based on the results of the literature review experiments, we validated the usability of*CoAuthorAI*for long\-form document generation\. The system, in combination with Kexin Technology’s LUFFA model, completedAI for Rock Dynamicsunder the support of the Artificial Pen Project\. Since book evaluation is a heavy and complex task, in this evaluation we only present some comparative results between the machine\-generated draft and the final draft ofAI for Rock Dynamics\.

#### Datasets

The book*AI for Rock Dynamics*contains an average of around 130 references per chapter across Chapters 2–8, totaling 910 references\. Excluding the Introduction and Conclusion chapters, whose references are the seven generated chapters themselves, the entire book includes 917 references in total\.

#### Implementation Setups

Together with the author team, we iteratively developed the prompts for text compression and section generation\. We strictly followed the procedure outlined in Section*3\.3 System Walkthrough*to generate each chapter, and used the generated sections as reference material to produce the Introduction and Conclusion chapters\.

#### Evaluation Results

Table[3](https://arxiv.org/html/2604.19772#S4.T3)presents comparative results between the machine\-generated draft and the final draft, including citation accuracy—the proportion of traceable citations among all citations produced by the LLM—and the manual correction rate for each chapter conducted by the author team\.

To compute the correction rate, both the machine\-generated draft and the final draft are segmented at the sentence level, resulting innnsentences in the initial draft andmmsentences in the final version\. For each corrected chapter, we iterate through the sentences of the final draft and count how many of them also appear in the initial draft; letssdenote this count\. The correction rate is then defined as:

Correction Rate=n−ss\.\\text\{Correction Rate\}=\\frac\{n\-s\}\{s\}\.\(5\)
Table 3:Citation Accuracy and Manual Correction Rate per ChapterThe results show that the average citation accuracy after system verification reaches 77\.4%, mitigating the impact of LLM hallucinations\. The manual correction rate fluctuates between 11% and 21%, with an average of 15\.4%\. These findings indicate that*CoAuthorAI*can achieve satisfactory performance in book writing, though a moderate level of manual intervention is still required to ensure reliability\.

## 5Conclusion

The*CoAuthorAI*system offers a novel and effective approach to scientific book writing by integrating human expertise with the capabilities of large language models\. By leveraging human guidance through expert\-crafted outlines and iterative feedback loops, the system ensures the quality and precision of the generated content\. The CoAuthorAI system has the potential to significantly streamline the scientific writing process and contribute to the production of high\-quality scientific books\. Future work may focus on further refining the system’s capabilities and exploring its applications in other scientific writing tasks\.

## Limitations

Despite continuous adjustments to the prompts in Table[5](https://arxiv.org/html/2604.19772#A1.T5),*CoAuthorAI*still retains a typical machine\-generated format in the final books\. The lack of visual elements such as images and tables reduces the readability and appeal of the books\. Compared with traditional books, the content generated from existing literature lacks innovative materials and mainly consists of summaries and syntheses of past knowledge\.

## References

- S\. Agarwal, G\. Sahu, A\. Puri, I\. H\. Laradji, K\. D\. Dvijotham, J\. Stanley, L\. Charlin, and C\. Pal \(2024\)Litllm: a toolkit for scientific literature review\.arXiv preprint arXiv:2402\.01788\.Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1),[§1](https://arxiv.org/html/2604.19772#S1.p3.1),[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Aljamaan, M\. Temsah, I\. Altamimi, A\. Al\-Eyadhy, A\. Jamal, K\. Alhasan, T\. A\. Mesallam, M\. Farahat, and K\. H\. Malki \(2024\)Reference hallucination score for medical artificial intelligence chatbots: development and usability study\.JMIR Medical Informatics12\(1\),pp\. e54345\.Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1)\.
- H\. Alkaissi and S\. I\. McFarlane \(2023\)Artificial hallucinations in chatgpt: implications in scientific writing\.Cureus15\(2\),pp\. e35179\.External Links:[Document](https://dx.doi.org/10.7759/cureus.35179)Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1),[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px2.p1.1)\.
- Beta Writer \(2019\)Lithium\-ion batteries: a machine\-generated summary of current research\.Springer\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-16800-1)Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p4.1),[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px3.p1.1)\.
- Bio\-IT World Staff \(2025\)Elsevier launches AI search, summary, comparison tool for ScienceDirect\.Note:Accessed 05 May 2025External Links:[Link](https://www.bio-itworld.com/news/2025/03/20/elsevier-launches-ai-search-summary-comparison-tool-for-sciencedirect)Cited by:[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Fränti and R\. Mariescu\-Istodor \(2023\)Soft precision and recall\.Pattern Recognition Letters167,pp\. 115–121\.External Links:ISSN 0167\-8655,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.patrec.2023.02.005),[Link](https://www.sciencedirect.com/science/article/pii/S0167865523000296)Cited by:[§4\.1](https://arxiv.org/html/2604.19772#S4.SS1.SSS0.Px3.p1.1)\.
- C\. Hsu, E\. Bransom, J\. Sparks, B\. Kuehl, C\. Tan, D\. Wadden, L\. L\. Wang, and A\. Naik \(2024\)CHIME: LLM\-assisted hierarchical organization of scientific studies for literature review support\.InFindings of ACL 2024,pp\. 118–132\.Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p3.1)\.
- Ought \(2024\)Elicit research assistant\.Note:[https://elicit\.org](https://elicit.org/)Accessed 05 May 2025Cited by:[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Schoenenberger \(2023\)Accelerating textbook production with gpt: a springer nature case study\.Note:Presentation at Frankfurt Book FairAccessed 05 May 2025Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1),[§1](https://arxiv.org/html/2604.19772#S1.p4.1),[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Taylor, M\. Kardas, G\. Cucurull, T\. Scialom, A\. Hartshorn, E\. Saravia, A\. Poulton, V\. Kerkez, and R\. Stojnic \(2022\)Galactica: a large language model for science\.arXiv preprint arXiv:2211\.09085\.Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1),[§1](https://arxiv.org/html/2604.19772#S1.p4.1)\.
- Typeset \(2024\)SciSpace ai research assistant\.Note:[https://typeset\.io/scispace](https://typeset.io/scispace)Accessed 05 May 2025Cited by:[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, Q\. Guo, W\. Yao, H\. Zhang, X\. Zhang, Z\. Wu, M\. Zhang, X\. Dai, M\. Zhang, Q\. Wen,et al\.\(2024\)Autosurvey: large language models can automatically write surveys\.Advances in neural information processing systems37,pp\. 115119–115145\.Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1),[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Wang, L\. Liu, L\. Wang, and L\. Zhou \(2023\)R2GenGPT: radiology report generation with frozen LLMs\.arXiv preprint arXiv:2309\.09812\.Cited by:[§1](https://arxiv.org/html/2604.19772#S1.p2.1),[§2](https://arxiv.org/html/2604.19772#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AAppendix

Table 4:An Example from datasetsTable 5:Prompts for the CoAuthorAI
CoAuthorAI: A Human in the Loop System For Scientific Book Writing

Similar Articles

Tried to write a book with ai for a year - honest breakdown!!!

Generating novel scientific hypotheses with Co-Scientist

@AYi_AInotes: https://x.com/AYi_AInotes/status/2062774798166503872

The sweet spot for AI-assisted writing is 50%

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Submit Feedback

Similar Articles

Tried to write a book with ai for a year - honest breakdown!!!
Generating novel scientific hypotheses with Co-Scientist
@AYi_AInotes: https://x.com/AYi_AInotes/status/2062774798166503872
The sweet spot for AI-assisted writing is 50%
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration