Copilot: Exposed GitHub Repositories

Introduction

Copilot Security researchers are raising alarms over the persistence of online data exposure, even after it has been made private. A recent discovery by Israeli cybersecurity firm Lasso reveals that thousands of once-public GitHub repositories—including those from major corporations such as Microsoft—can still be accessed through Microsoft Copilot, highlighting the risks posed by generative AI tools that index and retain publicly available information.

The Discovery

Lasso co-founder Ophir Dror disclosed that the company found content from its own GitHub repository appearing in Copilot’s responses. The repository had been mistakenly made public for a brief period before being set to private. Despite GitHub returning a “page not found” error for the repository, Copilot could still generate responses containing its contents.

“If I was to browse the web, I wouldn’t see this data,” Dror explained. “But anyone in the world could ask Copilot the right question and get this data.”

This revelation led Lasso to investigate further. By analyzing public repositories in 2024, they identified over 20,000 repositories that had been deleted or set to private. However, due to Bing’s caching mechanisms, Copilot still had access to data from these repositories, affecting more than 16,000 organizations.

Impact on Major Corporations

The affected companies include technology giants such as Amazon Web Services, Google, IBM, PayPal, Tencent, and Microsoft itself. According to Lasso, some of these companies had confidential GitHub archives exposed through Copilot, containing sensitive corporate data, intellectual property, access keys, and security tokens.

One particularly concerning discovery was a now-deleted GitHub repository—previously hosted by Microsoft—that contained a tool for generating “offensive and harmful” AI images using Microsoft’s cloud AI service. Despite the repository’s removal, Copilot could still surface its contents.

Industry Response and Microsoft’s Stance

Lasso contacted affected organizations to alert them of the exposure and recommended rotating or revoking any potentially compromised access keys. However, none of the named companies, including Microsoft, responded to inquiries from TechCrunch regarding the issue.

Microsoft was informed of Lasso’s findings in November 2024. While the company classified the issue as “low severity,” it acknowledged that Copilot’s reliance on Bing’s cache was an “acceptable” behavior. To mitigate risks, Microsoft removed links to Bing’s cache from its search results in December 2024. However, Lasso points out that Copilot can still access data from deleted or private repositories, suggesting that the fix is only temporary and does not fully resolve the underlying risk.

Security Implications and Recommendations

This incident highlights the challenges organizations face in protecting sensitive information once it has been exposed, even momentarily. The persistence of cached data in AI-driven tools raises critical security concerns, especially for companies that rely on GitHub for software development and collaboration.

To mitigate such risks, organizations should:

Regularly audit public repositories for accidental exposure of sensitive data.
Rotate and revoke access keys immediately if a repository is mistakenly made public.
Work with AI vendors to ensure that data indexing policies align with security best practices.
Raise awareness within developer teams about the potential long-term impact of even brief public exposure.

Conclusion

As generative AI tools continue to evolve, so do the security risks associated with their use. The persistence of previously public GitHub data in Copilot serves as a cautionary tale for organizations handling sensitive information online. Until a comprehensive solution is implemented, companies must take proactive measures to safeguard their intellectual property and sensitive data against unintended AI-driven exposure.

ALSO READ THIS BLOG