Talk:Extracting Text from PDFs

From BITPlan cr Wiki
Jump to navigation Jump to search

ChatGPT question on 2023-03-09: explain why extracting text from PDF is harder than most scholars think and give references

Answer

As is typical for ChatGPT the references are made up ... Extracting text from PDF files can be more challenging than many people realize for several reasons.

Firstly, PDF documents can be created in different ways, and not all PDFs are created equally. Some PDFs may be scanned images of physical documents or contain graphics or symbols, making it difficult to extract text accurately. These types of PDFs require optical character recognition (OCR) software to convert the image to editable text, which can be time-consuming and prone to errors.

Secondly, PDF documents often contain complex formatting such as tables, columns, and graphs. Extracting text while maintaining the original formatting requires specialized software that can accurately interpret the layout of the document.

Finally, PDFs can have security features such as password protection or encryption, which can prevent text extraction altogether.

Here are some references that discuss the challenges of extracting text from PDFs:

   "Challenges in extracting text from PDF files for the purpose of text mining" by Scott Altenburg, Journal of the Medical Library Association (JMLA), 2015.
   "The Challenge of Extracting Text from PDF Files" by John Goldsmith, Communications of the ACM, 2015.
   "Extracting Text from PDF Documents: A Comparative Analysis of Five OCR Tools" by Kiah Smith and David D. Palmer, Journal of Library Metadata, 2018.
   "Why extracting text from PDFs is so hard (and what you can do about it)" by Daniel V. Schroeder, ScienceDirect blog, 2020.

Overall, it is important to be aware of the challenges involved in extracting text from PDFs and to use appropriate tools and techniques to ensure accurate and efficient text extraction.