@KyrieCheungYep: https://x.com/KyrieCheungYep/status/2066703125659156572
Summary
This article details how to build an automated information collection pipeline using Codex CLI, including AGENTS.md configuration, MCP integration, Skill usage, and three practical scenarios (customer research, policy tracking, US stock monitoring), helping users automate repetitive information gathering tasks.
View Cached Full Text
Cached at: 06/16/26, 03:38 PM
Save 3 Hours a Day: Build an Automated Information Collection Pipeline with Codex
Every morning when I open my computer, the first things I do:
- Check clients: What’s the latest company moves - funding, hiring, negative news.
- Check policies: Any updates on regulations, subsidies, industry documents relevant to my business.
- Scan US stocks: What happened overnight in my sectors, any important news or data.
These three tasks share one thing: repetitive, fragmented, and need to be done daily. Individually, none is worth spending half an hour on. I used to open a dozen tabs, search around, copy-paste into Obsidian, then manually organize. That took one to two hours and I had to do it all over again the next day.
I handed this whole workflow over to Codex. While I drink my morning coffee, it has already compiled client updates, policy changes, and US stock highlights from the night before into Markdown, placed in my vault, each item with its original link.
I’ll break down this method completely for everyone. More process, less theory. After reading, you can build your own version.
Why Codex?
Many people think: for information gathering, why not just ask ChatGPT?
The difference is huge. In a chat window, AI responds to your queries one at a time, results stay in the conversation, and you have to manually transfer them. Codex adds a crucial capability: it can work on your computer. It can read files, modify files, run commands, search the web, connect to external tools via the MCP protocol, and write results directly into your specified directory.
There’s a frequently underestimated fact: your terminal can already see everything on your computer. Obsidian notes are just a bunch of Markdown files. You cd into your vault, and Codex can read and write them directly. No plugins, no API keys, no copy-pasting. A developer who lives in Cursor switched to Codex CLI precisely because of this: every time he wanted AI to help organize his notes, he either dragged the entire vault into the editor or went back and forth copying context – annoying. Later he realized those frictions were self-imposed.
How powerful is Codex for information retrieval? An OpenAI internal researcher, a heavy user who can burn $10,000 in API fees a month, described it like this:
“Codex is an exceptionally good search engine.”
When he asked Codex to do research, it would browse relevant Slack channels, read discussions, pull up experimental branches mentioned by others, look at screenshots, read documents and tables, and finally aggregate into a note with links. Each piece of information would have its source marked. Using this method, he generated over 700 verifiable hypotheses in a few hours. His rough assessment: in scenarios where mistakes are easy and costly, you need a very diligent search agent with high recall.
This exactly addresses the three most troublesome aspects of information gathering: try to cover everything, trace back to sources, and end up with a usable file. Codex excels at all three.
Five Mindset Tips: Turn Codex from a Tool into a Teammate
Before diving in, remember the most useful parts from OpenAI’s official best practices. The core is simple: don’t treat Codex as a one-time assistant; configure it as a long-term teammate working alongside you.
Specifically, five points:
- Give enough context first – don’t make Codex guess your intent.
- Use AGENTS.md for long-term guidance – mold it to fit your workflow.
- Connect external systems with MCP – minimize copy-paste.
- Abstract repetitive tasks into Skills – don’t repeatedly type the same prompts.
- Turn stable processes into Automations – hand over fixed actions.
Also, remember one distinction:
Skill defines how to do something; Automation defines when to do it.
Applied to information gathering, it’s an upgrade path: first teach it how to search, give it internet eyes, then solidify the search method into a Skill, and finally let it run on a schedule. Let’s go step by step.
Build the Foundation: Configure Codex as Your Information Assistant
3.1 Give it a Job Description: AGENTS.md
AGENTS.md is Codex’s onboarding manual, loaded automatically each startup. It influences how Codex understands your needs. For information gathering, I place this in the vault root directory:
Note the key points: each item must include a link, distinguish fact from speculation, and don’t write anything you can’t trace back to a source. These keep Codex from being a confidently wrong assistant and more like a reliable researcher.
Quick tip: type /init directly in Codex, and it will generate a basic AGENTS.md for you to modify.
3.2 Configure Some Convenient Aliases
Different scenarios require different configurations. Aliases save a lot of effort. You can add them to ~/.zshrc:
Clarify permissions here. --search only allows Codex to access the internet via the official search API, not visit arbitrary URLs. Full network access (Full Access) allows it to execute curl to access any resource. For daily information gathering, --search is usually sufficient – you get latest content without opening up network permissions too much.
Connect Real-Time Internet via MCP
--search solves the “findability” problem. Many scenarios also need more specific capabilities: scraping structured data from a page, reading pages with your logged-in browser, or calling specialized market and news data sources. That’s where MCP (Model Context Protocol) comes in.
MCP is Codex’s standard protocol to connect external tools. Once connected, Codex is not just a file editor; it can operate your information toolchain.
4.1 Two Configuration Methods
Codex’s MCP configuration lives in ~/.codex/config.toml. There are two common ways to add:
4.2 Use Your Logged-In Browser to Scrape
Valuable information often sits behind login walls. Here’s a low-barrier solution: Playwright MCP with your existing browser session – no extra API key needed.
Once configured, Codex can reuse your logged-in identity in Chrome, open pages, read content, and save results as Markdown. Someone used this trick to automatically sync Linear’s todo tasks into Obsidian: let Codex open the page, wait for load, read task titles and IDs, then save with links as Tasks.md. The content in the final file matches what’s on the web page.
Replace Linear tasks with your frequently checked client intelligence sites, policy databases requiring login, broker portfolio pages – same logic.
4.3 Scrape Structured Data from Web Pages
If you need to grab a page as clean structured data, connect a Web MCP like Bright Data which has tools such as search_engine, scrape_as_markdown. A real example: let Codex scrape a product page, save as product.json, then write a script to read and process it. Throughout the process, Codex selects tools, scrapes, saves, and validates format by itself, obtaining real data from the page.
Simply put, --search tells it where information exists, and MCP brings that information back.
4.4 Don’t Reinvent the Wheel: Check for Public Skills First
MCP is the underlying pipeline, but you don’t have to connect every data source yourself. The community already has many packaged public Skills you can install directly. For information gathering, I prioritize tools that cover multiple platforms, like Agent Reach.
Agent Reach wraps a dozen platforms into commands Codex can directly use: search engines, Xiaohongshu, Weibo, Douyin, Bilibili, Twitter, Reddit, V2EX, LinkedIn, GitHub, WeChat official accounts, web pages, RSS, YouTube, podcasts. Once installed, you don’t worry about how each platform is scraped or logged in. Just talk to Codex in natural language:
It will automatically call Agent Reach to fetch from the corresponding platform, then write into files following the rules in your AGENTS.md. This works well for my three scenarios:
- Clients: check Xiaohongshu, Weibo, Maimai, LinkedIn for company reputation, employee gossip, hiring signals. These soft signals are usually invisible in company registration data.
- Policies: monitor relevant WeChat official accounts and RSS, automatically pull latest posts.
- US Stocks: scan Twitter and Reddit discussion heat to complement news with sentiment.
How to find such public Skills:
- Go to aggregation sites like mcpmarket.com, ComposioHQ/awesome-codex-skills, and search. There are already many information-gathering Skills, such as research-collector, lead-research-assistant, content-research-writer.
- After finding one, drop it into
~/.codex/skills/<skill_name>/, restart Codex, and it will be auto-recognized. - Too lazy? Send the repo link to Codex and ask it to install for you.
My habit: first check if something already exists, then consider connecting MCP myself. No need to step into a trap others have already avoided.
Five Real Daily Scenarios
With the foundation built, let’s jump into practice. Below are three scenarios I run daily – prompts can be copied and adapted.
Scenario 1: Client Information – Multi-angle Background Check, Each Item with Source
When I get a new client, what I want most is a multi-angle, traceable background dossier. Automated web research is perfect for this: multi-angle search, extract content, verify source reliability, then organize into a report.
My prompt goes something like this. I use cxr read-only mode, pure collection:
It will search, read, cross-validate, and finally give me a profile with links. I only need to review it once and judge which speculations are reliable. The decision is still in my hands, but 90% of the grunt work is saved.
For advanced use, don’t let Codex rely solely on search engines. With Agent Reach mentioned earlier, scanning Xiaohongshu, Weibo, Maimai, LinkedIn often reveals reputation and internal signals not found in public registries. Add an existing research-collector Skill, and multi-angle search with source verification can be done together.
Scenario 2: Policy Information – Targeted Scraping, Structured Output
Policy tracking has two characteristics: fixed sources, fixed format requirements. The sites and columns you care about are usually just those few, and the output is nothing more than title, publish date, key points, impact. It’s especially suitable for codex exec non-interactive mode with JSON Schema output, because the format is stable, making archiving and further processing easy.
codex exec is non-interactive mode, designed for scripts and automation. The common parameters are these few, frequently encountered in information gathering:
By default, exec runs in a read-only sandbox, won’t mess with your files. Perfect for information collection.
Scenario 3: US Stock Watch – Real-time News + Market Data
US stocks are all about timeliness. You care about what happened from last night to today. For this, I generally connect two types of MCP: one for real-time news (some even with bias scores and live quotes, like helium-mcp), another for scraping specific data sources.
After running, a watch briefing with links lands in my vault. I go from searching and organizing for an hour to reviewing for three minutes.
Make the Agent Automatically Stand Guard
If you have to manually type the prompt every day for the above three scenarios, it’s still tiring. The real time-saver is the next two steps.
6.1 Step 1: Solidify Search Method into a Skill
If you use the same prompt repeatedly, it should become a Skill. A Skill is essentially a SKILL.md file that specifies the operational guidelines for a task. Put it into ~/.codex/skills/ directory, and Codex will automatically read and follow it when encountering related tasks.
Skill has a clever design called progressive disclosure. It loads in three layers, not wasting context:
- Layer 1, metadata: name + description, about 100 words, always in context. Codex uses this to decide whether to trigger this skill.
- Layer 2, SKILL.md body: loaded only after being triggered, usually kept under 5000 words.
- Layer 3, attached scripts and resources: loaded on demand; scripts can be executed directly without long-term context occupation.
For example, turning the client background check from Scenario 1 into a customer-recon Skill. Later, whenever I say “Research XX company,” it automatically runs according to that specification. Don’t overthink the Skill design principles: one skill does one thing, includes 2-3 use cases, and clearly states input, output, and trigger phrases.
The easiest way is to let Codex write the SKILL.md for you. Don’t write from scratch.
6.2 Let Codex Record and Improve Its Own Workflow
That OpenAI researcher who burned $10,000 – his true power wasn’t in a single trick, but in a habit: letting Codex continuously record and improve its own workflow. He would have Codex take notes while working, depositing reusable methods into a dedicated folder. After a few runs, these notes stabilize, and Codex becomes faster and more accurate for frequent tasks. He said he never read those notes; the value was mainly in making Codex perform better.
Someone domestically verified the same trick. Instead of teaching from scratch each time, tell Codex globally:
“In this project directory, you should build a reusable accumulation system. For similar tasks in the future, abstract the process yourself, no need to reason from scratch each time.”
Then Codex will decide which parts to solidify into Skills, which to write as documentation, design and implement them itself. Once it works, a lot of repetitive labor is saved. Multiple session contexts under the same directory can also communicate, gradually understanding your working style.
6.3 Step 2: Let It Run on a Schedule
Once the process is stable, add automation. In April 2026, OpenAI introduced Automations for Codex. You set a schedule, and it executes on time, pushing results to you.
Three core concepts:
- Schedule: daily, weekly, or Cron expression, e.g.,
30 8 * * *for 8:30 daily. - Trigger: file change, Webhook, etc.
- Context Persistence: returns to the same conversation thread, remembers what was reported last time, only reports new changes. Very useful for stock watch and policy tracking.
That’s how I set my daily morning briefing: at 8:30 AM, Codex automatically runs client updates, policy changes, and US stock highlights, outputting a table summary. Official advice is sensible: first converse, then automate. First tune the task in regular conversation until satisfied, then save it as an Automation.
If you don’t want to rely on official Automation (e.g., you want to run on your own server), you can use codex exec with system cron:
Note: when running locally, your computer needs to stay awake. For critical tasks, use a cloud instance.
Advanced Play: One Commander Leading a Team of Sub-Agents
When your collection needs become complex – e.g., simultaneously covering clients, policies, and US stocks, each requiring deep dive – single-threading becomes slow.
That OpenAI researcher’s later workflow: talk to only one Agent, let it command a team of sub-agents. Some search data, some read code, some write, some do data analysis. This way he avoids switching between multiple tasks and can leverage parallelism. The new generation of codex models is especially good at managing multiple concurrent sub-agents.
The official Cookbook also provides a paradigm: use Agents SDK to treat codex mcp-server as a tool, letting a project manager Agent act as orchestrator, dispatching tasks sequentially to specialized Agents. Each step should verify the previous output file exists before proceeding. In information gathering, the main agent breaks down tasks, sub-agents search clients, policies, US stocks respectively, and finally the main agent aggregates into a daily report.
Another efficient pattern often mentioned is 4x Codex: first spend time writing clear specs, then launch 4 concurrent tasks running different versions. You see multiple results simultaneously, fill in missing details, and finally humans verify.
You don’t need to do this from the start. Just know the upper limit exists.
Pitfall Checklist
I compile the common pitfalls from official and real-world usage for your reference:
- Prompt too vague. “Check what’s new” yields different results each time. Write down the goal, context, constraints, and completion conditions – these are the four most critical things in official prompt guidelines.
- Persistent rules stuffed into prompts. Long-term preferences should go into AGENTS.md or Skills, not repeated each time.
- Automating before the process is stable. Frequent errors waste more time. Turn it into a Skill first, stabilize it, then automate.
- Overly permissive permissions. Opening full access without understanding the process is risky. For information gathering, prefer
--sandbox read-only– pure collection doesn’t need file modification permissions. - Not requiring traceability. Always enforce in AGENTS.md that each item includes a link and distinguishes fact from speculation, otherwise you might get hallucinations.
- Context overflow. Don’t cram too many things into one session. Use
/compactto compress or/newto start a new session. - Hardcoded API keys. When running automation, use environment variables or Secrets, not written directly in scripts.
The core of this method is simple: don’t treat information gathering as a daily manual chore; transform it into a pipeline that Codex guards for you.
From manual search to teaching Codex to search, to letting it solidify search methods, to automated scheduling, you’ll find that the one to two hours saved daily can be spent on things that need human judgment: talking to clients, making decisions, strategizing.
Tools will keep evolving, model version numbers will keep rising. But this skeleton from context to MCP to Skill and Automation won’t become obsolete soon.
Start today: spend ten minutes configuring the simplest daily morning briefing and get it running. Move first, then adjust gradually.
About the Author
Kyrie — Former R&D engineer at a large Chinese internet company, now based in Bangkok, doing overseas business development for Chinese tech companies. Sharing real overseas experiences and practical AI usage in business, occasionally talking about US stock investment and life abroad.
- X: @KyrieCheungYep
Similar Articles
@dotey: https://x.com/dotey/status/2057250417638035555
This article shares usage tips from the Codex official team, including persistent conversation flow, voice input, task intervention and queuing, tool integration, automation, and goal setting, to help users get the most out of Codex, an AI coding agent.
@wsl8297: https://x.com/wsl8297/status/2054798253955375388
Introduces how to use XCrawl and Hermes Agent to build a no-code automated intelligence collection workflow, covering scenarios such as competitor monitoring, Twitter interaction radar, and Amazon product monitoring.
@aronhouyu: https://x.com/aronhouyu/status/2063561548145275255
Introduces an open-source repository called awesome-codex-skills, which contains thousands of preset skills for Codex (as well as Claude Code, Gemini CLI, etc.), covering development, data, collaboration, and other scenarios. It also provides installation and usage guides to help users reuse workflows.
@xiaogaifun: https://x.com/xiaogaifun/status/2064268648601268540
A detailed summary of 8 high-frequency use cases for the Codex tool, including adding captions, organizing disks, converting to slides, processing meeting minutes, connecting Feishu and WeRead, deploying websites, and handling daily company tasks, demonstrating various applications of AI assistants in real work and life.
@GitHub_Daily: Using AI agents for production-grade tasks—writing code, running workflows, calling APIs—works fine initially, but as the scale grows, things easily get out of control: permissions too broad, context loss, and debugging becomes impossible. That's where agents-best-practices comes in: a complete guide to designing a runtime framework for AI agents, not limited to coding scenarios, but also applicable to operations, sales...
Introduces the agents-best-practices repository, a production-grade AI agent runtime framework design guide covering tool permission tiers, context compression, etc., supporting Codex and Claude Code installation.