Update: Table of Contents now available! Also compilation from other dharmawheel forummers available at: Table of Contents for Malcolm Dharmawheel Posts + Astus, Krodha (Kyle Dixon), Geoff (Jnana), Meido Moore



Soh Wei Yu

Author
Admin
Generated output of Malcolm's forum posts in PDF, word and text files: https://app.box.com/s/ju3gothq09bmzzpcehv045ylwegvfzaj
Malcolm12PartsForumPosts.zip | Powered by Box
APP.BOX.COM
Malcolm12PartsForumPosts.zip | Powered by Box
Malcolm12PartsForumPosts.zip | Powered by Box
  • Like
  • Reply
  • Remove Preview
  • 46m
Soh Wei Yu
Author
Admin
Warning: The following word and PDF documents are super huge. Save all your files before attempting to open them, as I cannot guarantee that opening the files will not crash your system.
It is recommended that you run the files in Malcolm12PartsForumPosts.zip instead from the previous post.

Malcolm posts in 3 files(docx and pdf provided): https://app.box.com/s/pwn72amv07cptm1wekvc2twv3k980iiv
Malcolm posts in one file (docx and pdf provided): https://app.box.com/s/ibii96pyxps6nlhy71pj76s5mi92qxr1

Warning: the one file contains about 10,000 pages of texts and will likely crash your system if you do not have enough RAM and processing power.
Soh Wei Yu
Author
Admin
Next in line to compile: Astus, Krodha (Kyle Dixon), Geoff . Both John Tan and I like these authors too.
  • Like
  • Reply
  • 1m
Malcolm3PartsForumPosts.zip | Powered by Box
APP.BOX.COM
Malcolm3PartsForumPosts.zip | Powered by Box
Malcolm3PartsForumPosts.zip | Powered by Box
  • Like
  • Reply
  • Remove Preview
  • 1m


Compilation of All Dharmawheel Posts by Dzogchen teacher Acarya Malcolm Smith
So I programmed a web crawler to compile all of malcolm's almost fifty thousands of posts in dharmawheel into text files (500 per text file, then later i might combine them automatically into one big word document)
initially the plan was to produce it like
"ThreeVows wrote on Thu May 04, 2023 9:21 pm:
Prasangika Madhyamaka is the doorway to the realization of the path of seeing.
Kai lord replied to ThreeVows on Thu May 04, 2023 9:58 pm:
If that is the case, then Bhāvaviveka, Asanga, etc, wouldn't have attained Arya bodhisattvahood.
ThreeVows replied to Kai lord on Thu May 04, 2023 10:03 pm:
No, I don't think that is necessarily the case. The essential point has to be assimilated, which can be done in a non-analytical manner as well, and one can still discuss other things. Dudjom Rinpoche for instance discusses how there is basically a lower and higher mode of understanding mind-only, and the intention of Asanga and Nagarjuna are aligned in truth. But I don't know that I have the capacity to discuss this precisely at this time, so I won't try overly hard.
Malcolm replied to ThreeVows on Thu May 04, 2023 10:16 pm:
Dudjom’s arguments about an inner madhyanaka are not convincing, partially predicated as they are on a mangled citation from Bhavaviveka that does not mean what he takes it to mean"
But then i realised it gets a little complex due to the way the html is structured, and i'm a little lazy
So now i'm just outputting it into this format:
"Author: Malcolm
Date: Thu May 04, 2023 9:40 pm
Title: Re: Clarification on Nyingma view of emptiness
Content:
Agreed, so is the second, but the Buddha had to repeat himself because folks were not getting it. However not every sutra called “third turning” can be regarded as definitive."
If anyone wants to work on the code to improve it to include the original format, i can send you my code in c# and in python.
I will upload Malcolm's compilation once its done. Currently at 1000 post out of 50k. I set the timer to wait 2 seconds per post to avoid getting banned.
Reason i made this: just like compiling kyle dixon's posts from reddit, for my listening while i'm driving (text to speech on iphone) https://www.awakeningtoreality.com/.../a-compilation-of... . And for sharing of course. And ease of reference (can search faster). In 2019 when I met Acarya Malcolm in California he asked me if I had any backup or records of his E-Sangha forum posts. He even said we could collaborate (perhaps to help him out with compilation?). E-Sangha was the primary Buddhist forum before Dharmawheel but it was hacked and all postings were just gone. Unfortunately, I don't have any records except maybe a few, hardly any.
This time we can save all posts.


6 comments
Like
Comment
Send

Comments

All comments

  • Soh Wei Yu
    Author
    Admin
    And of course I can also easily compile other dharmawheel members by changing the target member name
  • Soh Wei Yu
    Author
    Admin
    not exactly web crawler but a scraping software
  • Soh Wei Yu
    Author
    Admin
    User guides for the programs:
    Introduction
    Dharmawheel Post Scraper is a console application written in C# that logs into the Dharmawheel forum (https://dharmawheel.net
    ), collects all the posts written by a specific author, and saves them to text files. The program uses the AngleSharp library for web scraping and the RestSharp library for handling HTTP requests.
    Purpose
    The purpose of this program is to gather all posts by a specified author on the Dharmawheel forum and save them to text files, with a maximum of 500 posts per file.
    Prerequisites
    • .NET SDK 5.0 or higher installed on your system
    • NuGet packages:
    • AngleSharp
    • RestSharp
    How to Operate
    1. Open your preferred text editor or IDE and create a new C# console application project.
    2. Replace the default Program.cs file with the provided code.
    3. Install the required NuGet packages (AngleSharp and RestSharp).
    4. Update the loginData object with your Dharmawheel forum credentials:
    csharpCopy code
    var loginData = new { username = "your_username", password = "your_password", login = "Login", redirect = "index.php" };
    5. Compile and run the program. It will log in to the Dharmawheel forum, scrape the specified author's posts, and save them to text files named AuthorName_posts_X.txt, where X is the file number.
    Understanding the Code
    • BaseUrl: The base URL of the Dharmawheel forum.
    • AuthorName: The name of the author whose posts you want to collect.
    • AuthorId: The ID of the author whose posts you want to collect.
    • CurrentPage: The current page of search results being processed.
    • FileCounter: The current file number being saved.
    • CollectedPosts: A list containing the collected posts.
    The Main method:
    1. Logs into the Dharmawheel forum.
    2. Iterates through the search result pages containing the author's posts.
    3. Scrapes the relevant information from each post.
    4. Saves the collected posts to text files, with a maximum of 500 posts per file.
    The CreateRestClient method:
    • Creates and configures a RestSharp RestClient instance.
    The SaveToFile method:
    • Saves the collected posts to a text file.
    The GetCorrectPost method:
    • Retrieves the correct post from the post page.
    The FormatPostContent method:
    • Formats the post content by removing HTML tags and unnecessary content.
    The GetQuoteLevel method:
    • Determines the quote level of a given element.
    Notes
    • The program may take some time to run, depending on the number of posts to be collected.
    • Make sure to replace the sample login credentials with your own Dharmawheel forum credentials before running the program.
    • The program uses a delay of 2 seconds between processing search result pages to avoid overwhelming the server.
    DHARMAWHEEL.NET
    Dharma Wheel - Buddhism Discussion Forum
    Dharma Wheel - Buddhism Discussion Forum
    • Like
    • Reply
    • Remove Preview
    • 50m
  • Soh Wei Yu
    Author
    Admin
    Text To Word Conversation App:
    Purpose
    This program is designed to combine multiple text files containing forum posts into a single Word document (.docx) and PDF file. The program can process a single folder or multiple subfolders, and you can choose to sort the files in ascending or descending order.
    How to use
    1. Ensure you have Microsoft Word installed on your computer, as this program relies on Microsoft.Office.Interop.Word to convert Word documents to PDF files.
    2. Compile and run the program.
    3. Follow the on-screen prompts to operate the program:
    a. Enter the folder path containing the text files you want to combine.
    b. Choose whether you want to process each subdirectory within the parent folder. Enter "y" to process subdirectories and "n" to process only the parent folder.
    c. If you chose to process only the parent folder, you will be prompted to specify the order of files. Enter "A" for ascending order (starting with the smallest file number) or "D" for descending order (starting with the largest file number).
    4. The program will then process the specified folder(s) and create a Word document and a PDF file in each folder. The newly created files will be named in the format "MalcolmCombinedPosts_SortByAscendingDate_X.docx" or "MalcolmCombinedPosts_SortByDescendingDate_X.docx" (and their corresponding PDF files), where X is a unique number.
    Important notes
    • The program expects text files to be named in the format "Malcolm_posts_#.txt", where # is a number. Ensure your text files follow this naming convention.
    • The program sorts the entries in each text file by date. Make sure the entries follow the format "Author: ...", "Date: ...", and so on, with each field on a new line.
    • The program uses the default paragraph style with custom spacing settings to adjust the line spacing between entries in the output Word document.
    • The program requires Microsoft Word to be installed on your computer to convert Word documents to PDF files. If Word is not installed, the program will not be able to create PDF files.




p.s. does anyone want to volunteer their time to work on this?

Soh to ChatGPT:
now, you understand that Dharmawheel Post Scraper extracts all the posts into text files right. how do you propose that we can sort them by topic? something like, get ai to understand the content, tag by topic, then sort. or how do you suggest to do it. like for example, posts about rigpa and what it means are being automatically placed into one text file... etc

ChatGPT
To sort the collected posts by topic, you can use a combination of natural language processing (NLP) techniques and machine learning models like topic modeling. One such popular topic modeling technique is Latent Dirichlet Allocation (LDA). Here's a step-by-step guide on how to achieve this:

Preprocess the collected posts:

Remove any unnecessary information such as author name, date, etc., and retain only the content of the posts.
Tokenize the text: Split the post content into individual words.
Remove stop words: Common words that don't carry much meaning (e.g., "the", "and", "in").
Perform stemming or lemmatization: Reduce words to their base forms (e.g., "running" becomes "run").
Create a document-term matrix:

Convert the preprocessed text into a numerical representation using techniques like Bag of Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF). This will result in a matrix where rows represent the posts and columns represent the terms (words).
Apply the LDA model:

Choose the number of topics you want to extract. This might require some experimentation to find the optimal number of topics.
Train the LDA model on the document-term matrix.
The LDA model will generate topic distributions for each post, which can be used to assign a primary topic to each post.
Sort the posts by topic:



Create a separate text file for each topic.
Iterate through the posts, determine their primary topic based on the LDA-generated topic distributions, and append each post to the corresponding topic text file.
Here's an outline of how you can integrate this process into your existing code:

After collecting all the posts, preprocess the text using an NLP library like NLTK (for Python) or Stanford NLP (for C#/.NET).
Create a document-term matrix using a library like scikit-learn (for Python) or Accord.NET (for C#/.NET).
Train an LDA model on the document-term matrix using a library like Gensim (for Python) or Accord.NET (for C#/.NET).
Modify the SaveToFile method to accept an additional parameter for the topic and create separate text files for each topic.
When saving the posts, use the LDA-generated topic distributions to determine the primary topic and save each post to the corresponding topic text file.
Keep in mind that this approach may not always produce perfect results, as topic modeling can be sensitive to the choice of parameters and the quality of the input data. You may need to experiment with different settings and preprocessing techniques to achieve satisfactory results.





User
how many mandays are needed to accomplish that
ChatGPT
The number of mandays required to accomplish this task depends on several factors, including the developer's expertise, familiarity with the required technologies, and the complexity of the dataset. Here's a rough estimate of the time required for each step of the process:

Preprocessing the text: 1-2 mandays

This step involves researching and implementing the appropriate NLP techniques to clean and preprocess the text data.
Creating the document-term matrix: 1 manday

This step involves converting the preprocessed text into a numerical representation using an appropriate library.
Training the LDA model: 1-2 mandays

This step includes researching the LDA model, selecting an appropriate library, and experimenting with different parameters to find the optimal number of topics.
Sorting the posts by topic and modifying the SaveToFile method: 1 manday

This step involves modifying the existing code to save the posts to separate text files based on their assigned topics.
Testing and refining the model: 2-3 mandays

This step includes testing the model on the dataset, evaluating its performance, and fine-tuning the preprocessing techniques and model parameters to achieve better results.
In total, this task might take approximately 6-9 mandays to accomplish. However, this is a rough estimate and may vary depending on the developer's skill level and experience with the required technologies. Additionally, unforeseen challenges or complications might arise, which could also affect the time required to complete the task.





User
is there a pre existing software that can do this so i dont have to develop a new software or do so much work
ChatGPT
Yes, there are pre-existing tools and libraries that can help you analyze and categorize the text data without developing a new software from scratch. Two popular Python libraries for natural language processing and topic modeling are Gensim and Scikit-learn.

Here's a high-level overview of how you can use these libraries to categorize the text data:

Preprocess the text data: You can use the Natural Language Toolkit (NLTK) library to tokenize, remove stopwords, and stem/lemmatize the text data.

Create a document-term matrix: You can use the Gensim library to convert the preprocessed text into a bag-of-words representation or use Scikit-learn's TfidfVectorizer to create a TF-IDF representation.

Train a topic model: Gensim provides the Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) models, while Scikit-learn provides the LDA model. You can choose the appropriate model and train it on your document-term matrix.

Assign topics to the text: Once the model is trained, you can use it to assign topics to each post. Based on the assigned topics, you can then sort the posts and save them to separate text files.

There are many tutorials and resources available online that can guide you through the process of using these libraries for topic modeling. By leveraging these pre-existing tools, you can significantly reduce the development time and effort required to categorize the text data.





User
and how many mandays if we use those tools? what is the simplest path of least resistance and shortest mandays required to accomplish my tasks. can use external tools, or not. just tell me what solutions
ChatGPT
Using the pre-existing tools mentioned above will considerably shorten the time required to accomplish your task. The total number of mandays needed depends on your familiarity with these tools and your programming skills. However, I will provide a rough estimate for each step:

Preprocess the text data (0.5-1 day): You'll need to familiarize yourself with the NLTK library and preprocess the text data accordingly.

Create a document-term matrix (0.5-1 day): Learn how to use Gensim or Scikit-learn to create a bag-of-words or TF-IDF representation of your text data.

Train a topic model (1-2 days): Understand how to use Gensim or Scikit-learn's LDA/LSA models, choose the appropriate model, and train it on your document-term matrix.

Assign topics to the text and sort them (0.5-1 day): Once you have a trained model, use it to assign topics to each post, sort them based on their assigned topics, and save them to separate text files.

Considering the estimates above, you may need approximately 3-5 mandays to complete your task using pre-existing tools. These estimates assume that you have some programming experience and will need time to learn how to use these libraries effectively.

If you are not familiar with programming or prefer a more user-friendly approach, you can consider using topic modeling tools with graphical user interfaces (GUIs) like Orange or RapidMiner. These tools might be easier to use but may not provide the same level of customization as using Python libraries directly.

In any case, the simplest path of least resistance would be to use pre-existing tools like Gensim, Scikit-learn, or GUI-based tools to minimize the development time and effort required to accomplish your task.


......



Dharmawheel Post Scraper User Guide: Listening to PDFs on iPhone and Android

This user guide will provide you with instructions on how to download and listen to the PDF files generated by the Dharmawheel Post Scraper on your iPhone or Android device. The text-to-speech feature allows you to listen to the content while traveling, driving, or performing other tasks.

For iPhone Users:

  1. Download the PDF files: a. Open the Safari browser on your iPhone. b. Navigate to the box.com link provided, which contains the zip file with the PDF files. c. Tap on the zip file to download it. d. Once the download is complete, tap on the zip file to extract its contents. The PDF files will be visible in the Files app.

  2. Add the PDF files to the Books app: a. Open the Files app on your iPhone. b. Locate the folder containing the extracted PDF files. c. Tap on the "Select" button in the top-right corner. d. Select all the PDF files you want to add to the Books app. e. Tap on the "Share" button at the bottom-left corner. f. Scroll through the list of apps and select "Books." The PDF files will now be added to your Books library.

  3. Use the speech control panel to listen to the PDF files: a. Open the Settings app on your iPhone. b. Scroll down and tap on "Accessibility." c. Tap on "Spoken Content." d. Toggle on "Speak Screen." e. Open the Books app and navigate to the PDF file you want to listen to. f. Swipe down from the top of the screen with two fingers to activate the speech control panel. Your iPhone will start reading the text from the PDF file aloud.

For Android Users:

  1. Download the PDF files: a. Open the Chrome browser on your Android device. b. Navigate to the box.com link provided, which contains the zip file with the PDF files. c. Tap on the zip file to download it. d. Once the download is complete, use a file manager app to locate the downloaded zip file. e. Tap on the zip file and select "Extract" to extract its contents. The PDF files will be visible in the file manager app.

  2. Add the PDF files to a PDF reader app (e.g., Adobe Acrobat Reader, Foxit PDF Reader, etc.): a. Open the file manager app on your Android device. b. Locate the folder containing the extracted PDF files. c. Tap on a PDF file to open it. Your Android device may prompt you to select a PDF reader app to open the file. Choose your preferred app and select "Always" to set it as the default app for opening PDF files.

  3. Use a text-to-speech app to listen to the PDF files: a. Download a text-to-speech app from the Google Play Store (e.g., Voice Aloud Reader, Talk Free, etc.). b. Open the text-to-speech app and grant any necessary permissions. c. Use the app's file browser to locate and open the PDF file you want to listen to. The app will start reading the text from the PDF file aloud.

You can now listen to the content of the Dharmawheel Post Scraper PDFs on your iPhone or Android device while traveling, driving, or performing other tasks.





Update: Table of Contents now available! Also compilation from other dharmawheel forummers available at: Table of Contents for Malcolm Dharmawheel Posts + Astus, Krodha (Kyle Dixon), Geoff (Jnana), Meido Moore



Astus Dharmawheel Posts: https://app.box.com/s/ln2rvagp8u7xx0uytci78defdawgctsm

Krodha (Kyle Dixon) Dharmawheel Posts: https://app.box.com/s/k0frsynnhxkivdsvjiqyhvt0zc8blbsl

Geoff (Jnana) Dharmawheel Posts: https://app.box.com/s/axsh5qeue8vdy47p7enztv9oryjr5qf7

Zen Master Meido Moore Dharmawheel Posts: https://app.box.com/s/lbiudrkbiueq89y2z4xvx499d6st95y8


Also, from previous post:

Malcolm posts in 12 files (docx and pdf and table of contents provided): https://app.box.com/s/ju3gothq09bmzzpcehv045ylwegvfzaj

Malcolm posts in 3 files (docx and pdf and table of contentsprovided): https://app.box.com/s/pwn72amv07cptm1wekvc2twv3k980iiv

Malcolm posts in one file (docx and pdf and table of contents provided): https://app.box.com/s/ibii96pyxps6nlhy71pj76s5mi92qxr1